WO2022100691A1 - Audio recognition method and device - Google Patents

Audio recognition method and device Download PDF

Info

Publication number
WO2022100691A1
WO2022100691A1 PCT/CN2021/130304 CN2021130304W WO2022100691A1 WO 2022100691 A1 WO2022100691 A1 WO 2022100691A1 CN 2021130304 W CN2021130304 W CN 2021130304W WO 2022100691 A1 WO2022100691 A1 WO 2022100691A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
sub
probability
duration
human voice
Prior art date
Application number
PCT/CN2021/130304
Other languages
French (fr)
Chinese (zh)
Inventor
贾杨
夏龙
吴凡
郭常圳
Original Assignee
北京猿力未来科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京猿力未来科技有限公司 filed Critical 北京猿力未来科技有限公司
Publication of WO2022100691A1 publication Critical patent/WO2022100691A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present application relates to the technical field of data processing, and in particular, to an audio recognition method and device.
  • the sound is generated by the vibration of the sound source, which is a way of expressing energy. Therefore, the energy of the audio can be analyzed, and the effective opening time of the user is obtained after the "mute" is removed.
  • This method has good effect in relatively quiet environment, but the effect will be reduced when there is relatively strong noise and reverberation around, because the noise itself also contains energy.
  • the statistical algorithm of opening duration based on speech recognition can get the text contained in the audio and the corresponding time point. Based on the statistics of the user's opening time, ideally, this solution should have the best performance results, but this solution often has the following shortcomings: 1)
  • the speech recognition model generally has high computational complexity and is not suitable for high-concurrency online environments. use. 2)
  • the effect of statistical opening depends on the accuracy of the speech recognition model. For different languages (including mixed languages, dialects, etc.), different recognition models need to be trained, and a large amount of labeling data is required to improve the performance of the model. Therefore, the generalization of this scheme is poor and the early labeling cost is too high.
  • Vggish is a technique proposed by Google for audio classification.
  • the VGG network structure in the field of image recognition proposed by Oxford's Visual Geometry Group at ILSVRC 2014 is used, and the scheme uses the Youtube-100M video dataset.
  • Vggish's solution uses audio classification based on titles, subtitles, comments, etc. from online videos, such as songs, music, sports, speeches, and so on. Under this scheme, the quality of classification depends on manual review, otherwise there will be more errors in the classification standards; on the other hand, if the classification related to human voice in this scheme is classified into the human voice category, and the rest are classified as non-voice parts, and finally The model performance is poor.
  • the present application provides an audio recognition method. Including: obtaining original audio, adding null data of a first duration before the head of the original audio, and adding dummy data of a second duration after the tail of the original audio, to obtain the expanded audio;
  • the third duration of the sum of the two durations is a segmented window, starting from the header of the expanded audio with the first length, and then dividing the windows to obtain multiple sub-audios; respectively, calculating the time-frequency feature sequence of the sub-audio
  • the neural network obtains the probability that the sub-audio belongs to a specific category according to the time-frequency feature sequence; and compares the probability with the judgment threshold to determine whether the sub-audio is a specific category.
  • the time-frequency feature sequence of the sub-audio is a Mel-frequency cepstral coefficient feature sequence
  • the neural network obtains the vocal probability of the sub-audio according to the characteristic sequence of Mel frequency cepstral coefficients; and, respectively, compares the vocal probability with a judgment threshold to determine whether the sub-audio is a human voice.
  • the method further includes: obtaining an array of the human voice probabilities of all the sub-audio of the original audio; filtering the probability values in the array with the first number as a window, Get the filtered vocal probability.
  • the method of median filtering is used to filter the vocal probability array.
  • the audio energy value of the sub-audio in the original audio at a certain time point obtains the audio energy value of the sub-audio in the original audio at a certain time point; and set the vocal probability adjustment factor according to the energy value, including: if the audio energy value is greater than the energy upper limit, the sub-audio's vocal The probability adjustment factor is set to 1; if the audio energy value is less than the energy lower limit, the human voice probability adjustment factor of the sub-audio is set to 0; if the audio energy value is not greater than the energy upper limit and not less than the energy lower limit, then according to the energy value
  • the vocal probability adjustment factor is normalized to be between 0 and 1; the sub-audio vocal probability adjustment factor is multiplied by the sub-audio vocal probability to obtain the modified sub-audio vocal probability; The sub-audio vocal probability is compared with the decision threshold to determine whether the sub-audio is human.
  • the method further includes: acquiring sub-audio continuously judged to be human voice in the original audio; acquiring an audio segment composed of the determined time points of the sub-audio continuously determined to be human voice; outputting the audio segment.
  • the first duration is equal to the second duration; and, the audio energy value of the determined time point in the sub-audio is specifically the audio energy value of the sub-audio center time point; the acquisition of the continuous judgment is the human voice.
  • the audio segment composed of the determined time point of the sub-audio is specifically the audio segment composed of the central time point of the sub-audio that is continuously judged to be a human voice.
  • the method further includes: if the time interval between the adjacent audio clips is less than a third threshold, acquiring the audio clips between the adjacent audio clips.
  • the above method further includes: counting the duration of the output audio.
  • the application provides an audio recognition device, including:
  • a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method as described above.
  • the present application by adding null data before and after the original audio, and dividing the original audio into a window with a first length of 2 times the duration of the null data, multiple sub-audios are obtained by segmentation, and then based on this fine-grained analysis of the original audio Segment, calculate MFCC for the sub-audio separately, and obtain the time-frequency two-dimensional map of the sound signal.
  • the method can also obtain the probability value of whether the audio is recognized as a certain type at the beginning and end of the original audio data, the probability of the approximate sub-audio is the probability corresponding to the central time point, and then the probability array of the original audio time point granularity can be obtained, More accurate detection of opening segments can be achieved.
  • the present application distinguishes between human voices and non-human voices through deep learning algorithms. And further, the probability value obtained by the neural network is filtered to effectively remove the noise in the audio recognition result, and the noise is generated based on the fine-grained segmentation of the previous step, the model performance or/and the cause of noise in the original audio. Therefore, the audio recognition result is smoother through this reasonable sliding window and smoothing mechanism system.
  • the fine-grained sub-audio segmentation and sub-audio substantial overlap strategy of the present application results in a small part of the non-vocal audio probabilities being modified by the surrounding points to be more human-like. Therefore, based on the feature that the energy of noise or silence is weaker than the human voice, the present application further corrects the probability value of the application network by calculating the energy of the original audio to further improve the recognition accuracy.
  • a certain latitude is given to maintain the continuity of speech segments when performing segmentation and statistics of opening segments according to the final audio human voice probability array. This segmentation can not only provide better accuracy, but also provide corpus with higher content quality.
  • the method of the present application can save statistical time and improve feedback efficiency, increase the function of opening time node statistics and improve its accuracy, increase generalization, and improve the accuracy of statistics on the premise of ensuring user interactivity and interaction efficiency.
  • FIG. 1 is a schematic flowchart of an audio recognition method shown in an embodiment of the present application.
  • FIG. 2 is a schematic diagram of original audio segmentation preprocessing shown in an embodiment of the present application.
  • Figure 3a is a schematic diagram of an existing neural network structure
  • FIG. 3b is a schematic diagram of the structure of the neural network according to the embodiment of the present application.
  • Fig. 4 is the audio frequency human voice probability distribution diagram before moving average shown in the embodiment of the present application.
  • Fig. 5 is the audio frequency human voice probability distribution diagram after moving average shown in the embodiment of the present application.
  • FIG. 6 is a schematic diagram of permissive merge processing.
  • first, second, third, etc. may be used in this application to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information without departing from the scope of the present application.
  • second information may also be referred to as the first information without departing from the scope of the present application.
  • a feature defined as “first” or “second” may expressly or implicitly include one or more of that feature.
  • “plurality” means two or more, unless otherwise expressly and specifically defined.
  • the present application provides an audio recognition method for recognizing audio belonging to a certain audio category from a piece of original audio. For example, the part of the human voice can be accurately identified from a piece of original audio, so that the duration of the human voice can be counted, or/and the human voice audio in the original audio can be further output.
  • the following embodiments illustrate the application of human voice recognition in an online education scenario as an example.
  • the idea of the present invention is to obtain sub-audio by fine-grained windowing of the original audio, and the probability of approximating the sub-audio is the probability corresponding to the central time point, and then by calculating the MFCC of the sub-audio, the time-frequency two-dimensional sound signal is obtained.
  • FIG. 1 A specific embodiment of the present invention will be described with reference to FIG. 1 .
  • Step 11 Get the original audio.
  • the original audio may contain both the desired human voice and other non-human voices such as background sounds, noise, etc.
  • Step 12 adding empty data before the original audio head and after the tail respectively to obtain the expanded audio
  • subdivision processing is performed on the original audio, the original audio is divided into smaller sub-audios, a segment of empty audio is added to the beginning and end of the original audio, and the expanded audio is obtained.
  • the audio is sub-audio segmentation based on the segmentation window value, and the ratio of the empty audio value and the segmentation window value is maintained at 1:2.
  • FIG. 2 shows an optimal implementation method for audio extension provided by an embodiment of the present invention.
  • a is the original audio array.
  • empty data of equal duration is added to the beginning and the end of the original audio a, that is, zeros of 480 milliseconds (ms), to obtain the expanded audio b.
  • the number of 0s in the 480ms is determined according to the sampling frequency of the audio, that is, the data frequency in 480ms is the same as the sampling frequency.
  • the duration of the null data added before and after the original audio header is 480 ms is only exemplary, and the present invention does not limit other values of the duration.
  • Step 13 take 2 times the above-mentioned duration as the segmentation window, and obtain a plurality of sub-audios in sequence from the header of the expanded audio with the first length;
  • the segmentation window adopts 960ms, that is, twice the 480ms.
  • the segmentation step size is 10ms, so the minimum segmentation granularity of the sub-audio is 10ms.
  • the difference between adjacent sub audios is 10ms
  • the duration of each sub audio is 960ms.
  • the human voice of the sub-audio feature map calculated in the subsequent steps is The probability is taken as the probability of the human voice corresponding to the audio at the time point t i +0.48s. Therefore, the voice probability calculated according to the first sub-audio in this scheme is the vocal probability at the beginning of the original audio; the vocal probability calculated by the last sub-audio is the vocal probability at the end of the original audio.
  • the present invention since the empty data is added before and after the head of the original audio, and then combined with the data of the half time of the segmentation window, that is, the present invention approximately calculates the probability of a human voice at a certain time point in this way, so it can be realized More accurate detection of open fragments.
  • the first step length that is, the segmentation granularity is 10 ms, and the present invention does not limit the selection of other segmentation granularities.
  • Step 14 respectively calculate and obtain the time sequence feature sequence of the sub-audio
  • the vibration of human vocal cords and the closure of the oral cavity have universal laws, which are directly reflected in the frequency domain characteristics of the sound.
  • the Mel Frequency Cepstral Coefficient MFCC
  • MFCC Mel Frequency Cepstral Coefficient
  • a preset window length and step size are used to calculate the result of its short-time Fourier transform, and a characteristic sequence of Mel frequency cepstral coefficients is obtained.
  • a window length of 25 ms and a step length of 10 ms are used to calculate the result of the short-time Fourier transform, and obtain the MFCC characteristic.
  • Step 15 The neural network obtains the probability that the sub-audio belongs to a specific category according to the time sequence feature sequence.
  • the probability corresponding to each audio segment is predicted by the trained neural network model. The value of this probability can range from 0 to 1.
  • the trained neural network model adopts a 3x3 convolution kernel and pool layer to simplify the model parameters.
  • the training of the neural network includes two stages: pre-training and fine-tuning.
  • the picture on the left shows the 500-class classification model.
  • the 500-class audio classification model was trained using the sound dataset.
  • the figure on the right shows the two-class model.
  • the network reuses the underlying network structure and parameters of the 500-class model, and uses the back propagation algorithm to make the model converge. Through this binary classification model to identify whether there is human voice in the audio clip, the model will output the probability that the current audio clip has human voice.
  • pre-training and fine-tuning the network trained by the present invention is more focused on the classification scene of human voice and non-human voice, and the performance of the model is improved.
  • the trained convolutional neural network (CNN) in the embodiment of the present invention is composed of input and output layers and multiple hidden layers, wherein the hidden layer is mainly composed of a series of convolutional layers, It consists of a pooling layer and a fully connected layer.
  • the convolution layer generally defines a convolution kernel.
  • the size of the convolution kernel represents the receptive field of the network in this layer. By sliding different convolution kernels on the input feature map and performing a dot product operation on the feature map, the information in the receptive field is projected. Go to a certain element in the lower layer to achieve the effect of information enrichment. In general, the size of the convolution kernel is much smaller than the input feature map, and it overlaps or acts on the input feature map in parallel.
  • the pooling layer is actually a non-linear form of downsampling operation, and there are many different forms of non-linear pooling functions such as max pooling, mean pooling, etc.
  • pooling layers are periodically inserted between the convolutional layers in the CNN network structure.
  • the fully connected layer is to fuse the high-level feature information abstracted by the convolution layer and the pooling layer, and finally achieve the classification effect.
  • Step 16 Compare the probabilities with the decision thresholds to decide whether the sub-audio is of a specific category.
  • the judgment threshold is set as the basis for judging whether it is human voice, if the probability is greater than the judgment threshold, the judgment is human voice, and if the probability is less than the judgment threshold, the judgment is non-human voice.
  • the original audio a is divided into a vocal or non-vocal segment.
  • the duration of the human voice in the original audio that is, the duration information of the user's mouth
  • the original audio can be segmented according to the information of each opening segment, so as to facilitate the subsequent output of vocal audio, for example, to evaluate the learning status.
  • the obtained probability value can also be preprocessed by the following method, so as to achieve the probability value for optimization purposes.
  • the vocal probability array of the original audio obtained by the method described above contains noise points. It is reflected in the 200-millisecond vocal probability distribution diagram shown in Figure 4, where the ordinate represents the probability that the audio point is a human voice, the abscissa represents time, and each point represents 10ms. There are many sudden changes in probability values, that is, glitches, on the probability value distribution of 0-1 corresponding to the time axis of the horizontal axis. Therefore, it is necessary to perform sliding average preprocessing on the currently obtained probability to make the probability distribution smoother, and obtain the 200-millisecond vocal probability distribution map as shown in Figure 5.
  • the sliding average preprocessing adopts the median sliding filtering method.
  • the probability that the i-th sub-audio after median filtering is a human voice is:
  • P ⁇ p 1 ,p 2 ,p 3 ,...,p i ...,p n ⁇ , where n is the total number of sub-audio obtained by segmenting the original audio, and p i represents the probability that the ith sub-audio is a human voice .
  • w_smooth is the selected window size.
  • the window is selected as 31, that is, the window is 31 values in the vocal probability array of the sub-audio.
  • median filtering is to average the probability values of adjacent 31 points as the probability value of the intermediate point; according to this method, the probability value of each point is recalculated with the step size of 1.
  • the above median filtering is an implementation manner of the present invention, and the present invention does not limit the adoption of other filtering methods.
  • the judgment algorithm is used to distinguish human voice and non-human voice, determine the opening segment of human voice, and count the user's opening time.
  • the audio probability of a small part of non-human voices is modified by the surrounding points during filtering.
  • the embodiment of the present invention utilizes the characteristic that the energy of noise or mute is weaker than the human voice, and uses the energy of the original audio to further modify the probability of the human voice, so as to improve the accuracy.
  • the moving average audio vocal probability array is:
  • the original audio is sliced with a step size of 10ms to obtain sub-audio, and then the probability of vocals with an interval of 10ms is obtained. Therefore, the energy array of the original audio is obtained by calculating the step size of 10ms here, so that Make the moments of the energy array of the original audio correspond to the moments of the vocal probability array of the original audio.
  • wi Normalize the value of the Power array to be between 0 and 1, and determine the upper limit of energy P up and the lower limit of energy P down , then wi can be normalized as follows:
  • wi takes the value of 1, and if the audio energy at a certain moment is less than the energy lower limit P down , wi takes the value 0, and obtains
  • the obtained probability adjustment factor is between 0 and 1, by which The probability adjustment factor adjusts the vocal probability value at the corresponding time point, and finally obtains an energy-corrected audio vocal probability value array P T .
  • the probabilities obtained in the above embodiments are first subjected to sliding average preprocessing, then energy correction preprocessing, and finally, a judgment algorithm is used to discriminate between human voices and non-human voices, determine vocal opening segments, and count the duration of user openings; for the currently obtained probability
  • energy correction preprocessing can be performed first, and then sliding average preprocessing.
  • the present invention can also adopt one of the above two preprocessing methods to achieve the purpose of improving the accuracy of human voice recognition.
  • the human voice audio obtained by the above method may also be subjected to tolerant merge processing.
  • the method of permissive merging is as follows.
  • the judgment result of whether each time node of the original audio is a human voice is obtained, that is, if If the decision threshold is set, the audio at time point i of the original audio is a human voice; otherwise, it corresponds to a non-human voice.
  • the original audio a is divided into a vocal or non-vocal segment. If the interval between the two sub-audio judged to be human voice and the number of sub-audio judged to be non-human voice is less than the third threshold, then further acquire the audio between the central moments of the two sub-audio sub-audio judged to be human voice.
  • the value of the third threshold is 500 milliseconds, which is only exemplary, and is not limited in the present invention.
  • the user's opening duration information obtained by accumulating the durations of all segments is more reasonable than the user's opening duration information obtained without the tolerant merging process; and the obtained vocal audio maintains the continuity of the speech segments. sex.
  • step 12 specifically adopts adding null data with equal time lengths before the original audio head and after the tail, for example, both are 480 milliseconds; and in step 13, a window with a duration of 2 times 480 milliseconds or 960 milliseconds is used. Divide the original audio to get multiple sub audios.
  • the length of time of the null data added before the original audio header and after the tail may not be equal. That is, add the null data of the first duration before the original audio head, and add the null data of the second duration after the original audio tail; and the third duration of the sum of the first duration and the second duration is cut.
  • the split window splits the original audio to obtain sub-audio.
  • the first duration is 240 milliseconds
  • the second duration is 720 milliseconds
  • the split window is the sum of the first duration and the second duration, that is, 960 milliseconds. It can be seen that the duration of the sub-audio obtained by using this method is the same as that in the above embodiment, which is still 960ms.
  • the calculated sub-audio vocal probability is approximated as the vocal probability value of the sub-audio at 1/4 time.
  • the probability value of the sub-audio vocals is approximated as the human voice at the time of t i +0.24s in the sub-audio probability value.
  • an audio segment composed of the 1/4th time point in each sub-audio that is continuously judged to be a human voice is obtained. It can be known that since the sub-audio is obtained by segmenting the original audio by the first length, the first 1/4 moment is separated by the first length, for example, 10ms used in the above-mentioned embodiment.
  • the resulting array of vocal probabilities for the sub-audio may be filtered using the same method as described above.
  • the present application further provides an audio recognition device.
  • the device includes:
  • a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method described above.
  • the specific manner in which each module performs the operation has been described in detail in the embodiment of the method, and will not be described in detail here.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Telephonic Communication Services (AREA)

Abstract

An audio recognition method and device, the audio recognition method comprising: an original audio is obtained (11), null data of a first duration is added before the head of the original audio, and null data of a second duration is added after the tail of the original audio to obtain extended audio; a third duration that is the sum of the first duration and the second duration is taken as a segmentation window, and starting from the header of the extended audio, the window is sequentially segmented by a first step to obtain a plurality of sub-audios; and time-frequency feature sequences of the sub-audios are calculated respectively (14); according to the time-frequency feature sequences, a neural network obtains the probabilities of the sub-audios belonging to a specific category; and the probabilities are separately compared to a determination threshold to determine whether the sub-audios are in a specific category (16).

Description

音频识别方法和装置Audio recognition method and device
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2020年11月12日提交中国专利局、申请号为202011260050.5、发明名称为“一种音频识别方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202011260050.5 and the invention titled "An audio recognition method and device" filed with the China Patent Office on November 12, 2020, the entire contents of which are incorporated into this application by reference .
技术领域technical field
本申请涉及数据处理技术领域,尤其涉及一种音频识别方法和装置。The present application relates to the technical field of data processing, and in particular, to an audio recognition method and device.
背景技术Background technique
随着互联网技术的发展,线上教育等类似行业蓬勃发展,在线学习人数剧增,老师通过人为主观感受和统计同学的开口时长,评估学生课堂参与度,并给予反馈,提高教学效果。With the development of Internet technology, online education and other similar industries have flourished, and the number of online students has increased sharply. Teachers evaluate students' classroom participation and give feedback to improve teaching effects through artificial subjective feelings and statistics of students' speaking time.
目前,现有技术中有几种统计用户开口时长的方案。At present, there are several solutions for counting the duration of a user's mouth opening in the prior art.
基于开关式的开口时长统计,即在用户客户端设置可开启、关闭的录制按钮,用户需按下按钮之后再开口说话。该方案的出发点是为了实现简单直接的时长统计,但当受众是低龄的幼儿群体时,该群体的特点就是服从性差、行动不循章法,为此点击按钮并说话的方式显得效率低下,全程开启麦克风的方案更适合此场景。同时,用户在交互发生的过程中,必须手动进行操作然后才能传达信息,这一定程度上减少了用户的互动性,且时长统计的有效性完全取决于用户的自觉性。Based on the statistics of the opening time of the switch, that is, a recording button that can be turned on and off is set on the user client, and the user needs to press the button before speaking. The starting point of this plan is to achieve simple and direct time statistics, but when the audience is a group of young children, the group is characterized by poor obedience and unruly actions. For this reason, the method of clicking the button and talking is inefficient, and the whole process is turned on. The microphone scheme is more suitable for this scenario. At the same time, in the process of interaction, users must manually operate before they can convey information, which reduces the user's interactivity to a certain extent, and the validity of the duration statistics depends entirely on the user's consciousness.
基于音频能量分析的开口时长统计,声音产生于声源的震动,是一种能量的表示方式。因此,可对音频的能量进行分析,切除掉“静音”后即为用户的有效开口时长。该方法在相对安静的环境下具备较好的效果,但是当周围存在较为强烈的噪声和混响时效果会有所下降,因为噪声本身也涵盖能量。Based on the statistics of the opening duration of the audio energy analysis, the sound is generated by the vibration of the sound source, which is a way of expressing energy. Therefore, the energy of the audio can be analyzed, and the effective opening time of the user is obtained after the "mute" is removed. This method has good effect in relatively quiet environment, but the effect will be reduced when there is relatively strong noise and reverberation around, because the noise itself also contains energy.
基于语音识别的开口时长统计算法,基于混合高斯和隐式马尔科夫、神经网络和隐式马尔科夫和Connectionist temporal classification(CTC)等语音识别算法,可以得到音频中包含的文本以及对应的时间点。基于此统计用户的开口时长,理想情况下该方案应具备最好的性能结果,但该方案往往存在以下缺点:1)语音识别模型一般计算复杂度较高,不适合在高并发的线上环境使用。2)统计开口的效果依赖于语音识别模型的精度。针对于不同的语种(包含混合语种、方言等),需训练不同的识别模型,而模型的性能提升需要大量标注数据,因此该方案的泛化性较差且前期的标注成本过高。The statistical algorithm of opening duration based on speech recognition, based on speech recognition algorithms such as mixed Gaussian and Hidden Markov, neural network and Hidden Markov and Connectionist temporal classification (CTC), can get the text contained in the audio and the corresponding time point. Based on the statistics of the user's opening time, ideally, this solution should have the best performance results, but this solution often has the following shortcomings: 1) The speech recognition model generally has high computational complexity and is not suitable for high-concurrency online environments. use. 2) The effect of statistical opening depends on the accuracy of the speech recognition model. For different languages (including mixed languages, dialects, etc.), different recognition models need to be trained, and a large amount of labeling data is required to improve the performance of the model. Therefore, the generalization of this scheme is poor and the early labeling cost is too high.
Vggish是Google提出的用于音频分类的技术。使用了Oxford的Visual Geometry Group在ILSVRC 2014上提出的图像识别领域的VGG网络结构,该方案使用Youtube-100M视频数据集。Vggish的方案采用的音频的分类依据来自线上视频的标题、字幕、评论等,例如歌曲、音乐、运动、演讲等等。这种方案下,分类质量依赖于人工 审核,否则类别标准将存在较多错误;另一方面,如果将该方案中与人声相关的分类归于人声类别,其余归为非人声部分,最终的模型性能较差。Vggish is a technique proposed by Google for audio classification. The VGG network structure in the field of image recognition proposed by Oxford's Visual Geometry Group at ILSVRC 2014 is used, and the scheme uses the Youtube-100M video dataset. Vggish's solution uses audio classification based on titles, subtitles, comments, etc. from online videos, such as songs, music, sports, speeches, and so on. Under this scheme, the quality of classification depends on manual review, otherwise there will be more errors in the classification standards; on the other hand, if the classification related to human voice in this scheme is classified into the human voice category, and the rest are classified as non-voice parts, and finally The model performance is poor.
发明内容SUMMARY OF THE INVENTION
本申请提供一种音频识别方法。包括:获得原始音频,在所述原始音频头部之前添加第一时长的空数据,以及在所述原始音频尾部之后添加第二时长的空数据,得到扩展后的音频;以第一时长与第二时长之和的第三时长为切分窗口,以第一步长从所述扩展后的音频的首部开始,依次分窗后获得多个子音频;分别计算得到所述子音频的时频特征序列;神经网络根据所述时频特征序列得到子音频属于特定分类的概率;将所述概率分别与判决门限进行比较判决子音频是否为特定分类。The present application provides an audio recognition method. Including: obtaining original audio, adding null data of a first duration before the head of the original audio, and adding dummy data of a second duration after the tail of the original audio, to obtain the expanded audio; The third duration of the sum of the two durations is a segmented window, starting from the header of the expanded audio with the first length, and then dividing the windows to obtain multiple sub-audios; respectively, calculating the time-frequency feature sequence of the sub-audio The neural network obtains the probability that the sub-audio belongs to a specific category according to the time-frequency feature sequence; and compares the probability with the judgment threshold to determine whether the sub-audio is a specific category.
其中,所述子音频的时频特征序列为梅尔频率倒谱系数特征序列;Wherein, the time-frequency feature sequence of the sub-audio is a Mel-frequency cepstral coefficient feature sequence;
以及,所述神经网络根据梅尔频率倒谱系数特征序列得到子音频的人声概率;以及,将所述人声概率分别与判决门限进行比较判决子音频是否为人声。And, the neural network obtains the vocal probability of the sub-audio according to the characteristic sequence of Mel frequency cepstral coefficients; and, respectively, compares the vocal probability with a judgment threshold to determine whether the sub-audio is a human voice.
上述方法中,所述获得子音频属于人声概率之后,还包括:获得所述原始音频所有子音频的人声概率的数组;以第一数量作为窗口对所述数组中的概率值进行滤波,得到滤波后的人声概率。In the above method, after obtaining the probability that the sub-audio belongs to the human voice, the method further includes: obtaining an array of the human voice probabilities of all the sub-audio of the original audio; filtering the probability values in the array with the first number as a window, Get the filtered vocal probability.
其中,采用中值滤波的方法对所述人声概率数组进行滤波。Wherein, the method of median filtering is used to filter the vocal probability array.
上述方法中,获取原始音频中所述子音频中确定时刻点的音频能量值;以及根据所述能量值设置人声概率调节因子,包括:若音频能量值大于能量上限,该子音频的人声概率调节因子置为1;若音频能量值小于能量下限,该子音频的人声概率调节因子置为0;若音频能量值不大于能量上限且不小于能量下限,则根据能量值将所述人声概率调节因子归一化为0至1之间;将子音频的人声概率调节因子乘以所述子音频人声概率,得到修正后的子音频人声概率;以及,将所述修正后的子音频人声概率分别与判决门限进行比较判决子音频是否为人声。In the above method, obtain the audio energy value of the sub-audio in the original audio at a certain time point; and set the vocal probability adjustment factor according to the energy value, including: if the audio energy value is greater than the energy upper limit, the sub-audio's vocal The probability adjustment factor is set to 1; if the audio energy value is less than the energy lower limit, the human voice probability adjustment factor of the sub-audio is set to 0; if the audio energy value is not greater than the energy upper limit and not less than the energy lower limit, then according to the energy value The vocal probability adjustment factor is normalized to be between 0 and 1; the sub-audio vocal probability adjustment factor is multiplied by the sub-audio vocal probability to obtain the modified sub-audio vocal probability; The sub-audio vocal probability is compared with the decision threshold to determine whether the sub-audio is human.
所述方法还包括:获取原始音频中连续判决为人声的子音频;获取所述连续判决为人声的子音频的所述确定时刻点的组成的音频片段;输出所述音频片段。The method further includes: acquiring sub-audio continuously judged to be human voice in the original audio; acquiring an audio segment composed of the determined time points of the sub-audio continuously determined to be human voice; outputting the audio segment.
上述方法中,所述第一时长与第二时长相等;以及,所述子音频中确定时刻点的音频能量值具体为子音频中心时刻点的音频能量值;所述获取所述连续判决为人声的子音频的所述确定时刻点的组成的音频片段具体为获取所述连续判决为人声的子音频中心时刻点的组成的音频片段。In the above method, the first duration is equal to the second duration; and, the audio energy value of the determined time point in the sub-audio is specifically the audio energy value of the sub-audio center time point; the acquisition of the continuous judgment is the human voice. The audio segment composed of the determined time point of the sub-audio is specifically the audio segment composed of the central time point of the sub-audio that is continuously judged to be a human voice.
以及,所述输出音频之前还包括:若相邻的所述音频片段之间的时间间隔小于第三门限,则获取相邻音频片段间的音频片段。And, before the outputting the audio, the method further includes: if the time interval between the adjacent audio clips is less than a third threshold, acquiring the audio clips between the adjacent audio clips.
上述方法还包括:统计所述输出的音频的时长。The above method further includes: counting the duration of the output audio.
本申请提供一种音频识别装置,包括:The application provides an audio recognition device, including:
处理器;以及processor; and
存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述 处理器执行如上所述的方法。A memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method as described above.
本申请通过在原始音频前后添加空数据,并以所述空数据时长的2倍对原始音频以第一步长进行分窗,切分得到多个子音频,进而基于这种对原始音频的细粒度切分,分别对子音频计算MFCC,得到声音信号的时频二维图。该方法在原始音频数据开始和结尾阶段也能够得到音频是否识别为某种类型的概率值,近似子音频的概率为其中心时间点所对应的概率,进而得到原始音频时间点粒度的概率数组,可以实现较为准确的开口片段检测。In the present application, by adding null data before and after the original audio, and dividing the original audio into a window with a first length of 2 times the duration of the null data, multiple sub-audios are obtained by segmentation, and then based on this fine-grained analysis of the original audio Segment, calculate MFCC for the sub-audio separately, and obtain the time-frequency two-dimensional map of the sound signal. The method can also obtain the probability value of whether the audio is recognized as a certain type at the beginning and end of the original audio data, the probability of the approximate sub-audio is the probability corresponding to the central time point, and then the probability array of the original audio time point granularity can be obtained, More accurate detection of opening segments can be achieved.
另一方面,本申请通过深度学习算法分辨人声和非人声。以及进一步的,对神经网络得到的概率值进行滤波,有效去除音频识别结果中的噪点,这些噪点基于前步骤的细粒度切分、模型性能或/和原始音频中的噪声原因而产生。因而通过这种合理的滑动窗口和平滑机制统使得音频识别结果更加平滑。On the other hand, the present application distinguishes between human voices and non-human voices through deep learning algorithms. And further, the probability value obtained by the neural network is filtered to effectively remove the noise in the audio recognition result, and the noise is generated based on the fine-grained segmentation of the previous step, the model performance or/and the cause of noise in the original audio. Therefore, the audio recognition result is smoother through this reasonable sliding window and smoothing mechanism system.
本申请的细粒度的子音频切分和子音频大幅重叠策略导致一小部分非人声的音频概率被周围点修正得更倾向于人声。因而进一步的,本申请基于噪声或者静音的能量相比人声弱的特征,通过计算原始音频的能量进一步对申请网络的概率值进行修正,进一步提高识别精度。The fine-grained sub-audio segmentation and sub-audio substantial overlap strategy of the present application results in a small part of the non-vocal audio probabilities being modified by the surrounding points to be more human-like. Therefore, based on the feature that the energy of noise or silence is weaker than the human voice, the present application further corrects the probability value of the application network by calculating the energy of the original audio to further improve the recognition accuracy.
进一步,本申请根据最终的音频人声概率数组进行开口片段切分统计时给予一定的宽容度用以保持语音片段的前后连续性。这样切分既能提供较好的精度,也可以提供更高内容质量的语料。Further, in the present application, a certain latitude is given to maintain the continuity of speech segments when performing segmentation and statistics of opening segments according to the final audio human voice probability array. This segmentation can not only provide better accuracy, but also provide corpus with higher content quality.
综上,通过本申请的方法能够在保证用户的互动性和交互效率的前提下,节省统计时间提高反馈效率,增加开口时间节点统计的功能并提升其精度,增加泛化性,更精准的统计用户开口时长。In summary, the method of the present application can save statistical time and improve feedback efficiency, increase the function of opening time node statistics and improve its accuracy, increase generalization, and improve the accuracy of statistics on the premise of ensuring user interactivity and interaction efficiency. The length of time the user speaks.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not limiting of the present application.
附图说明Description of drawings
通过结合附图对本申请示例性实施方式进行更详细的描述,本申请的上述以及其它目的、特征和优势将变得更加明显,其中,在本申请示例性实施方式中,相同的参考标号通常代表相同部件。The above and other objects, features and advantages of the present application will become more apparent from the more detailed description of the exemplary embodiments of the present application in conjunction with the accompanying drawings, wherein the same reference numerals generally represent the exemplary embodiments of the present application. same parts.
图1是本申请实施例示出的一种音频识别方法的流程示意图;1 is a schematic flowchart of an audio recognition method shown in an embodiment of the present application;
图2是本申请实施例示出的原始音频切分预处理示意图;2 is a schematic diagram of original audio segmentation preprocessing shown in an embodiment of the present application;
图3a是现有神经网络结构示意图;Figure 3a is a schematic diagram of an existing neural network structure;
图3b是本申请实施例神经网络结构示意图;FIG. 3b is a schematic diagram of the structure of the neural network according to the embodiment of the present application;
图4是本申请实施例示出的滑动平均前音频人声概率分布图;Fig. 4 is the audio frequency human voice probability distribution diagram before moving average shown in the embodiment of the present application;
图5是本申请实施例示出的滑动平均后音频人声概率分布图;Fig. 5 is the audio frequency human voice probability distribution diagram after moving average shown in the embodiment of the present application;
图6是宽容合并处理示意图。FIG. 6 is a schematic diagram of permissive merge processing.
具体实施方式Detailed ways
下面将参照附图更详细地描述本申请的优选实施方式。虽然附图中显示了本申请的优选实施方式,然而应该理解,可以以各种形式实现本申请而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本申请更加透彻和完整,并且能够将本申请的范围完整地传达给本领域的技术人员。Preferred embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While preferred embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the scope of this application to those skilled in the art.
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本申请可能采用术语“第一”、“第二”、“第三”等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。It should be understood that although the terms "first", "second", "third", etc. may be used in this application to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information without departing from the scope of the present application. Thus, a feature defined as "first" or "second" may expressly or implicitly include one or more of that feature. In the description of the present application, "plurality" means two or more, unless otherwise expressly and specifically defined.
本申请提供一种音频识别方法,用于从一段原始音频中识别出属于某种音频分类的音频。例如,从一段原始音频中准确识别出人声的部分,从而可以统计人声的时长,或/和进一步输出原始音频中的人声音频。The present application provides an audio recognition method for recognizing audio belonging to a certain audio category from a piece of original audio. For example, the part of the human voice can be accurately identified from a piece of original audio, so that the duration of the human voice can be counted, or/and the human voice audio in the original audio can be further output.
以下实施例以人声识别在在线教育场景下的应用为例进行说明。The following embodiments illustrate the application of human voice recognition in an online education scenario as an example.
教育领域中有开口时长统计的概念。例如在线下日常教育中,由老师作为统计的执行者和评估者,通过人为主观感受和统计同学的开口时长,评估学生课堂参与度,并给予反馈,提高教学效果。随着线上教育的发展,在线学习人数剧增,开口时长的统计依然是在线教育评估学生参与度的指标之一,因而需要一种技术方案准确有效的统计学生的开口时长。There is a concept of opening time statistics in the field of education. For example, in offline daily education, teachers, as the executors and evaluators of statistics, evaluate students' classroom participation and give feedback to improve teaching effects through artificial subjective feelings and statistics of the opening time of classmates. With the development of online education, the number of people studying online has increased sharply, and the statistics of opening time is still one of the indicators for evaluating students' participation in online education. Therefore, a technical solution is needed to accurately and effectively count the opening time of students.
本发明的思路在于通过对原始音频进行细粒度分窗的得到子音频,近似子音频的概率为其中心时间点所对应的概率,进而通过计算子音频的MFCC,得到声音信号的时频二维图,并通过深度学习算法分辨人声和非人声,并通过设计合理的滑动窗口和平滑机制统计有效开口时长。The idea of the present invention is to obtain sub-audio by fine-grained windowing of the original audio, and the probability of approximating the sub-audio is the probability corresponding to the central time point, and then by calculating the MFCC of the sub-audio, the time-frequency two-dimensional sound signal is obtained. Figure, and use deep learning algorithm to distinguish human voice and non-human voice, and calculate the effective opening time through a reasonably designed sliding window and smoothing mechanism.
参照图1说明本发明具体实施例。A specific embodiment of the present invention will be described with reference to FIG. 1 .
步骤11:获得原始音频。Step 11: Get the original audio.
获得原始音频文件,例如学生在线学习时,根据学习软件的提示进行语音的作答,智能设备通过麦克风获取学生语音作答时的原始音频。该原始音频可能既包含了所需的人声,还包括背景声、噪音等其他非人声的音频。Obtain the original audio files. For example, when students study online, they can answer the voice according to the prompts of the learning software, and the smart device can obtain the original audio of the students' voice answers through the microphone. The original audio may contain both the desired human voice and other non-human voices such as background sounds, noise, etc.
步骤12:分别在所述原始音频头部之前以及尾部之后添加空数据,得到扩展后的音频;Step 12: adding empty data before the original audio head and after the tail respectively to obtain the expanded audio;
在一种实施方式中,对原始音频进行细分度切分处理,将原始音频切分为更小的子 音频,对原始音频首尾各增加一段空音频,得到扩展后的音频,对扩展后的音频基于切分窗口值进行子音频切分,空音频数值与切分窗口值保持1:2的比例。In one embodiment, subdivision processing is performed on the original audio, the original audio is divided into smaller sub-audios, a segment of empty audio is added to the beginning and end of the original audio, and the expanded audio is obtained. The audio is sub-audio segmentation based on the segmentation window value, and the ratio of the empty audio value and the segmentation window value is maintained at 1:2.
如图2所示为本发明实施例提供的一种音频扩展较优实现方法。FIG. 2 shows an optimal implementation method for audio extension provided by an embodiment of the present invention.
本实施例中,为了实现开口时间节点的精确统计,子音频需要有更小的切分粒度。如图所示,a为原始的音频数组,首先在原始音频a的首、尾部各添加等时长的空数据,即480毫秒(ms)的零,得到扩展后的音频b。所述480ms中0的个数根据音频的采样频率而定,即480ms中的数据频率与采样频率相同。In this embodiment, in order to achieve accurate statistics of opening time nodes, the sub-audio needs to have a smaller segmentation granularity. As shown in the figure, a is the original audio array. First, empty data of equal duration is added to the beginning and the end of the original audio a, that is, zeros of 480 milliseconds (ms), to obtain the expanded audio b. The number of 0s in the 480ms is determined according to the sampling frequency of the audio, that is, the data frequency in 480ms is the same as the sampling frequency.
本实施例中在原始音频首部之前和尾部之后添加的空数据时长为480ms仅为示例性的,本发明并不限制该时长的其他取值。In this embodiment, the duration of the null data added before and after the original audio header is 480 ms is only exemplary, and the present invention does not limit other values of the duration.
步骤13:以2倍上述时长为切分窗口,以第一步长从所述扩展后的音频的首部开始顺序获得多个子音频;Step 13: take 2 times the above-mentioned duration as the segmentation window, and obtain a plurality of sub-audios in sequence from the header of the expanded audio with the first length;
如图2所示,本实施例中,对原始音频切分获得子音频时,切分窗口采用960ms,即所述480ms的2倍。切分步长采用10ms,从而子音频的最小切分粒度为10ms。As shown in FIG. 2 , in this embodiment, when sub-audio is obtained by segmenting the original audio, the segmentation window adopts 960ms, that is, twice the 480ms. The segmentation step size is 10ms, so the minimum segmentation granularity of the sub-audio is 10ms.
按照以上切分方法,得到了数个子音频,相邻子音频之间相差10ms,每个子音频的时长为960ms。According to the above segmentation method, several sub audios are obtained, the difference between adjacent sub audios is 10ms, and the duration of each sub audio is 960ms.
假设某一个子音频的起始时刻和截止时刻在原始音频中分别表示为t i,t i+0.96s,则在本发明实施例,将后续步骤中计算得到的该子音频特征图的人声概率作为时间点t i+0.48s时刻音频对应的人声概率。因此,本方案根据第一个子音频计算得到的人声概率即作为原始音频起始时刻的人声概率;最后一个子音频计算得到的人声概率即作为原始音频结束时刻的人声概率。 Assuming that the start time and the end time of a certain sub-audio are respectively represented as t i , t i +0.96s in the original audio, then in the embodiment of the present invention, the human voice of the sub-audio feature map calculated in the subsequent steps is The probability is taken as the probability of the human voice corresponding to the audio at the time point t i +0.48s. Therefore, the voice probability calculated according to the first sub-audio in this scheme is the vocal probability at the beginning of the original audio; the vocal probability calculated by the last sub-audio is the vocal probability at the end of the original audio.
可以看出,由于在原始音频头部之前和尾部之后添加了空数据,进而结合切分窗口一半时间的数据,即本发明通过这种方式近似计算某个时间点的人声概率,因此可以实现较为准确的开口片段检测。It can be seen that since the empty data is added before and after the head of the original audio, and then combined with the data of the half time of the segmentation window, that is, the present invention approximately calculates the probability of a human voice at a certain time point in this way, so it can be realized More accurate detection of open fragments.
本发明实施例中第一步长即切分粒度为10ms,本发明不限制其他切分粒度的选择。In the embodiment of the present invention, the first step length, that is, the segmentation granularity is 10 ms, and the present invention does not limit the selection of other segmentation granularities.
步骤14:分别计算得到所述子音频的时序特征序列;Step 14: respectively calculate and obtain the time sequence feature sequence of the sub-audio;
对于人声的识别,人类声带震动、口腔的闭合等具有普遍规律,该规律直接表现在声音的频域特征。本发明实施例中采用梅尔频率倒谱系数(MFCC),是基于声音频率的非线性梅尔刻度(mel scale)的对数能量频谱的线性变换得到的谱系数,符合人耳对于声音频率的感受特性,可以表征声音所具备的频域特性。For the recognition of human voice, the vibration of human vocal cords and the closure of the oral cavity have universal laws, which are directly reflected in the frequency domain characteristics of the sound. In the embodiment of the present invention, the Mel Frequency Cepstral Coefficient (MFCC) is used, which is a spectral coefficient obtained by linear transformation of the logarithmic energy spectrum of the nonlinear Mel scale of the sound frequency, which is in line with the human ear's perception of the sound frequency. Feeling characteristics can characterize the frequency domain characteristics of sound.
对于每个切分得到的子音频,采用预设的窗口长度以及步长,计算其短时傅立叶变换的结果,得到梅尔频率倒谱系数特征序列。本发明实施例中采用窗口长度25ms,步长10ms,计算其短时傅立叶变换的结果,得到MFCC特性。For each sub-audio obtained by segmentation, a preset window length and step size are used to calculate the result of its short-time Fourier transform, and a characteristic sequence of Mel frequency cepstral coefficients is obtained. In the embodiment of the present invention, a window length of 25 ms and a step length of 10 ms are used to calculate the result of the short-time Fourier transform, and obtain the MFCC characteristic.
步骤15:神经网络根据所述时序特征序列得到子音频属于特定分类的概率。Step 15: The neural network obtains the probability that the sub-audio belongs to a specific category according to the time sequence feature sequence.
将梅尔频率倒谱系数特征序列输入已训练的神经网络模型,并获得神经网络模型输出的各音频片段对应的概率,在该实施例中,将得到的各音频片段按照时间顺序输入已训练的神经网络模型中,由已训练的神经网络模型预测各音频片段对应的概率。这个概 率的取值范围可以在0到1之间。Input the Mel frequency cepstral coefficient feature sequence into the trained neural network model, and obtain the corresponding probability of each audio segment output by the neural network model. In the neural network model, the probability corresponding to each audio segment is predicted by the trained neural network model. The value of this probability can range from 0 to 1.
例如,如图3所示,已训练的神经网络模型采用3x3的卷积核和pool层简化模型参数。神经网路的训练包括预训练和微调两个阶段。左图为500类分类模型,先使用声音数据集训练了500分类的音频分类模型。右图为二分类模型,该网络复用了500分类模型的底层网络结构和参数,通过反向传播算法使得模型收敛。通过此二分类模型来识别音频片段是否存在人声,则模型会输出当前音频片段存在人声的音频的概率。通过引入预训练和微调两个,使得本发明所训练的网络更加聚焦于人声、非人声的分类场景,提高了模型性能。For example, as shown in Figure 3, the trained neural network model adopts a 3x3 convolution kernel and pool layer to simplify the model parameters. The training of the neural network includes two stages: pre-training and fine-tuning. The picture on the left shows the 500-class classification model. First, the 500-class audio classification model was trained using the sound dataset. The figure on the right shows the two-class model. The network reuses the underlying network structure and parameters of the 500-class model, and uses the back propagation algorithm to make the model converge. Through this binary classification model to identify whether there is human voice in the audio clip, the model will output the probability that the current audio clip has human voice. By introducing pre-training and fine-tuning, the network trained by the present invention is more focused on the classification scene of human voice and non-human voice, and the performance of the model is improved.
如图3在进行音频的人声识别时,本发明实施例经过训练的卷积神经网络(CNN):由输入和输出层以及多个隐藏层组成,其中隐藏层主要由一系列卷积层、池化层和全连接层组成。As shown in Figure 3, when performing audio human voice recognition, the trained convolutional neural network (CNN) in the embodiment of the present invention is composed of input and output layers and multiple hidden layers, wherein the hidden layer is mainly composed of a series of convolutional layers, It consists of a pooling layer and a fully connected layer.
卷积层一般会定义一个卷积核,卷积核的大小表征该层网络的感受野,通过在输入特征图上滑动不同的卷积核并于特征图进行点积运算将感受野内的信息投影到下层的某一个元素,达到信息富集的作用。一般,卷积核的尺寸要比输入特征图小得多,且重叠或平行地作用于输入特征图中。The convolution layer generally defines a convolution kernel. The size of the convolution kernel represents the receptive field of the network in this layer. By sliding different convolution kernels on the input feature map and performing a dot product operation on the feature map, the information in the receptive field is projected. Go to a certain element in the lower layer to achieve the effect of information enrichment. In general, the size of the convolution kernel is much smaller than the input feature map, and it overlaps or acts on the input feature map in parallel.
池化层实际上是一种非线性形式的降采样运算,有多种不同形式的非线性池化函数例如最大值池化、均值池化等。通常来说,CNN的网络结构中的卷积层之间都会周期性地插入池化层。The pooling layer is actually a non-linear form of downsampling operation, and there are many different forms of non-linear pooling functions such as max pooling, mean pooling, etc. Generally speaking, pooling layers are periodically inserted between the convolutional layers in the CNN network structure.
全连接层则是将卷积层、池化层抽象出的高层特征信息进行融合交汇,并最终达到分类效果。The fully connected layer is to fuse the high-level feature information abstracted by the convolution layer and the pooling layer, and finally achieve the classification effect.
步骤16:将所述概率分别与判决门限进行比较判决子音频是否为特定分类。Step 16: Compare the probabilities with the decision thresholds to decide whether the sub-audio is of a specific category.
设置所述判决门限作为判决是否为人声的依据,若所述概率大于判决门限,则判决为人声,若概率小于判决门限则判决为非人声。The judgment threshold is set as the basis for judging whether it is human voice, if the probability is greater than the judgment threshold, the judgment is human voice, and if the probability is less than the judgment threshold, the judgment is non-human voice.
经过以上步骤,原始音频a被分成了一个个人声或非人声的片段。通过累加所有片段的时长即可得到原始音频中人声的时长,即用户的开口时长信息。并且,可以按照各开口片段的信息将原始音频进行切分,方便后续输出人声音频,用于例如对学习状况的评估。After the above steps, the original audio a is divided into a vocal or non-vocal segment. By accumulating the durations of all segments, the duration of the human voice in the original audio, that is, the duration information of the user's mouth, can be obtained. In addition, the original audio can be segmented according to the information of each opening segment, so as to facilitate the subsequent output of vocal audio, for example, to evaluate the learning status.
在本发明的又一实施例中,在步骤15神经网络根据所述时序特征序列得到子音频属于特定分类的概率后,还可以利用如下方法对所得到的概率值进行预处理,达到对概率值进行优化的目的。In yet another embodiment of the present invention, after the neural network obtains the probability that the sub-audio belongs to a specific classification according to the time sequence feature sequence in step 15, the obtained probability value can also be preprocessed by the following method, so as to achieve the probability value for optimization purposes.
1)对当前获得的概率进行滑动平均预处理。1) Perform sliding average preprocessing on the currently obtained probability.
由于切分粒度和噪声的原因,导致按照上文记载的方法得到的原始音频的人声概率数组中包含噪点。体现在如图4所示200毫秒的人声概率分布图中,纵坐标表示该音频点为人声的概率,横坐标代表时间,每个点表示10ms。在横轴时间轴所对应的0-1的概率值分布上存在很多概率值的突变,即毛刺。因此,需要对当前获得的概率进行滑动平均预处理,使得概率分布更加平滑,得到如图5所示的200毫秒的人声概率分布图。Due to the segmentation granularity and noise, the vocal probability array of the original audio obtained by the method described above contains noise points. It is reflected in the 200-millisecond vocal probability distribution diagram shown in Figure 4, where the ordinate represents the probability that the audio point is a human voice, the abscissa represents time, and each point represents 10ms. There are many sudden changes in probability values, that is, glitches, on the probability value distribution of 0-1 corresponding to the time axis of the horizontal axis. Therefore, it is necessary to perform sliding average preprocessing on the currently obtained probability to make the probability distribution smoother, and obtain the 200-millisecond vocal probability distribution map as shown in Figure 5.
滑动平均预处理,采用中值滑动滤波法,中值滤波后的第i个子音频为人声的概率为:The sliding average preprocessing adopts the median sliding filtering method. The probability that the i-th sub-audio after median filtering is a human voice is:
Figure PCTCN2021130304-appb-000001
Figure PCTCN2021130304-appb-000001
其中,原始音频中的所有子音频的人声概率数组where, an array of vocal probabilities for all sub-audios in the original audio
P={p 1,p 2,p 3,...,p i...,p n},其中n为原始音频切分得到的子音频总数,p i代表第i个子音频为人声的概率。 P={p 1 ,p 2 ,p 3 ,...,p i ...,p n }, where n is the total number of sub-audio obtained by segmenting the original audio, and p i represents the probability that the ith sub-audio is a human voice .
w_smooth是选定窗口大小。例如本实施例中选取所述窗口为31,即窗口为所述子音频的人声概率数组中的31个值。w_smooth is the selected window size. For example, in this embodiment, the window is selected as 31, that is, the window is 31 values in the vocal probability array of the sub-audio.
针对于p i,确定滑动平均的上、下限索引。 For pi , determine the upper and lower bound indices of the moving average.
下限索引为:Lo=max(0,i-15),表示数组中的第一个概率值;The lower limit index is: Lo=max(0,i-15), which represents the first probability value in the array;
上限索引为:Hi=min(n,i+15),表示数组中的最后一个概率值。The upper limit index is: Hi=min(n, i+15), which represents the last probability value in the array.
本实施例中,中值滤波即是以相邻31个点的概率值进行平均后作为中间点的概率值;按照该方法,以步长为1,重新计算每个点的概率值。In this embodiment, median filtering is to average the probability values of adjacent 31 points as the probability value of the intermediate point; according to this method, the probability value of each point is recalculated with the step size of 1.
对比图4和图5,可以看出经过滑动平均后子音频人声概率图的毛刺被有效修正,在一定程度上提高了开口片段切分的精度。Comparing Fig. 4 and Fig. 5, it can be seen that the burr of the sub-audio vocal probability map is effectively corrected after moving average, which improves the accuracy of opening segment segmentation to a certain extent.
以上中值滤波为本发明的一种实现方式,本发明并不限制其他滤波方法的采用。The above median filtering is an implementation manner of the present invention, and the present invention does not limit the adoption of other filtering methods.
经过滤波预处理后,利用判决算法判别人声与非人声,确定人声开口片段,统计用户开口时长。After filtering and preprocessing, the judgment algorithm is used to distinguish human voice and non-human voice, determine the opening segment of human voice, and count the user's opening time.
2)能量修正预处理。2) Energy correction preprocessing.
经过滑动平均预处理后,由于本发明实施例中采用细粒度的子音频切分,以及由于子音频大幅重叠的策略导致一小部分非人声的音频概率在经过滤波时被周围点修正得更倾向于人声,即人声概率增加,但其本质为非人声。After the moving average preprocessing, due to the fine-grained sub-audio segmentation adopted in the embodiment of the present invention and the strategy of greatly overlapping sub-audio, the audio probability of a small part of non-human voices is modified by the surrounding points during filtering. Tends to human voice, that is, the probability of human voice increases, but its nature is non-human voice.
为解决上述问题,本发明实施例利用噪声或者静音的能量相对人声较弱的特性,利用原始音频的能量对人声概率进行进一步修正,以提高精度。In order to solve the above problem, the embodiment of the present invention utilizes the characteristic that the energy of noise or mute is weaker than the human voice, and uses the energy of the original audio to further modify the probability of the human voice, so as to improve the accuracy.
经过滑动平均的音频人声概率数组为:The moving average audio vocal probability array is:
Figure PCTCN2021130304-appb-000002
Figure PCTCN2021130304-appb-000002
以10ms为窗口大小,10ms为步长,计算得到原始音频的能量数组:Taking 10ms as the window size and 10ms as the step size, calculate the energy array of the original audio:
P ower={w 1,w 2,w 3,...,w i,..w n} Power = {w 1 ,w 2 ,w 3 ,..., wi ,..w n }
由于上文记载的实施例中,采用步长10ms对原始音频进行切片得到子音频,进而得到10ms为间隔的人声概率,因而,此处采用10ms的步长计算得到原始音频的能量数组,从而使得原始音频的能量数组的时刻与原始音频的人声概率数组时刻相应。Since in the above-recorded embodiment, the original audio is sliced with a step size of 10ms to obtain sub-audio, and then the probability of vocals with an interval of 10ms is obtained. Therefore, the energy array of the original audio is obtained by calculating the step size of 10ms here, so that Make the moments of the energy array of the original audio correspond to the moments of the vocal probability array of the original audio.
将P ower数组的值归一化到0~1之间,确定能量上限P up和能量下限P down,则w i可以按照以下方式归一化: Normalize the value of the Power array to be between 0 and 1, and determine the upper limit of energy P up and the lower limit of energy P down , then wi can be normalized as follows:
Figure PCTCN2021130304-appb-000003
Figure PCTCN2021130304-appb-000003
以上公式可以看到,当某时刻音频能量大于所述能量上限P up时,w i取值为1,若某时刻音频能量小于所述能量下限P down时,w i取值为0,得到
Figure PCTCN2021130304-appb-000004
It can be seen from the above formula that when the audio energy at a certain moment is greater than the energy upper limit P up , wi takes the value of 1, and if the audio energy at a certain moment is less than the energy lower limit P down , wi takes the value 0, and obtains
Figure PCTCN2021130304-appb-000004
数组P f和数组
Figure PCTCN2021130304-appb-000005
对应值进行点积运算,得到能量修正后的音频人声概率值数组P T。经过该运算,当某时刻音频能量大于所述能量上限P up时,则该时刻人声概率值不变;若某时刻音频能量小于所述能量下限P down时,则该时刻人声概率值取值为0。
array P f and array
Figure PCTCN2021130304-appb-000005
A dot product operation is performed on the corresponding values to obtain an energy-corrected audio vocal probability value array P T . After this calculation, when the audio energy at a certain moment is greater than the energy upper limit P up , the probability value of the human voice at that moment remains unchanged; if the audio energy at a certain moment is less than the lower energy limit P down , the human voice probability value at that moment is taken as The value is 0.
在实施例中,若所述音频能量介于所述能量下限和能量上限之间(包含能量上限值和能量下限值),则取得的概率调整因子介于0和1之间,通过该概率调整因子调整对应时刻点的人声概率值,最终得到能量修正后的音频人声概率值数组P TIn an embodiment, if the audio energy is between the lower energy limit and the upper energy limit (including the upper energy limit value and the lower energy limit value), the obtained probability adjustment factor is between 0 and 1, by which The probability adjustment factor adjusts the vocal probability value at the corresponding time point, and finally obtains an energy-corrected audio vocal probability value array P T .
以上可以看出,通过利用原始音频的能量矩阵,若某时刻音频能量低于能量下限,则认为该时刻音频为非人声,从而将该时刻的人声概率变为零,通过这种方法进一步去除了非人声的部分音频。It can be seen from the above that by using the energy matrix of the original audio, if the audio energy at a certain moment is lower than the energy lower limit, the audio at that moment is considered to be non-human voice, so that the probability of the human voice at this moment becomes zero. Parts of audio that are not human voices are removed.
以上实施例将获得的概率先经过滑动平均预处理,再经过能量修正预处理,最后利用判决算法判别人声与非人声,确定人声开口片段,统计用户开口时长;对于对当前获取的概率进行能量修正和滑动平均两种预处理,没有先后顺序,亦可先进行能量修正预处理,再进行滑动平均预处理。The probabilities obtained in the above embodiments are first subjected to sliding average preprocessing, then energy correction preprocessing, and finally, a judgment algorithm is used to discriminate between human voices and non-human voices, determine vocal opening segments, and count the duration of user openings; for the currently obtained probability There are two kinds of preprocessing, energy correction and sliding average. There is no sequence. Energy correction preprocessing can be performed first, and then sliding average preprocessing.
本发明也可以采用上述两种预处理方法中的其中一种达到提高人声识别准确率的目的。The present invention can also adopt one of the above two preprocessing methods to achieve the purpose of improving the accuracy of human voice recognition.
作为对以上实施例的进一步优化,在统计人声时长或输出人声音频之前,还可以对上述方法得到的人声音频进行宽容合并处理。As a further optimization to the above embodiment, before the human voice duration is counted or the human voice audio is output, the human voice audio obtained by the above method may also be subjected to tolerant merge processing.
具体的,考虑到人类说话的前后延续性,尤其是儿童、青少年线上学习的场景,表达完整意思句子的单词间往往有短暂的停顿,通常用以换气或者表征某种情绪。本实施例中,根据最终的音频人声概率数组进行开口片段切分统计时并不严格按照上文记载的步骤所得到的识别结果,而是给予一定的宽容度用以保持语音片段的前后连续性。这样切分既能提供较好的精度,也可以为教师提供更高内容质量的评价语料,方便老师评估学生的学习效果。Specifically, considering the continuity of human speech, especially in online learning scenarios for children and adolescents, there are often brief pauses between words that express complete sentences, which are usually used to breathe or express a certain emotion. In this embodiment, when the opening segment segmentation statistics are performed according to the final audio vocal probability array, the recognition results obtained by the steps described above are not strictly followed, but a certain latitude is given to maintain the continuity of the speech segments. sex. This segmentation can not only provide better accuracy, but also provide teachers with evaluation corpus with higher content quality, which is convenient for teachers to evaluate students' learning effects.
宽容合并的方法具体如下。The method of permissive merging is as follows.
设置判决门限作为判决是否为人声的依据。Set the judgment threshold as the basis for judging whether it is human voice.
在上述实施例的基础上,结合最终的概率数组P T和所述判决门限,得到原始音频每个时间节点是否为人声的判决结果,即若
Figure PCTCN2021130304-appb-000006
判决门限,则原始音频i时刻点的音频为人声;反之,则对应于非人声。
On the basis of the above embodiment, combined with the final probability array P T and the judgment threshold, the judgment result of whether each time node of the original audio is a human voice is obtained, that is, if
Figure PCTCN2021130304-appb-000006
If the decision threshold is set, the audio at time point i of the original audio is a human voice; otherwise, it corresponds to a non-human voice.
通过以上步骤,原始音频a被分成了一个个人声或非人声的片段。如果判决为人声的两个子音频之间间隔的判决为非人声的子音频数量小于第三门限,则进一步获取所述 判决为人声的两个子音频中心时刻之间的音频。Through the above steps, the original audio a is divided into a vocal or non-vocal segment. If the interval between the two sub-audio judged to be human voice and the number of sub-audio judged to be non-human voice is less than the third threshold, then further acquire the audio between the central moments of the two sub-audio sub-audio judged to be human voice.
具体如图6所示,如果原始音频中包含有两个人声片段a i,a i+1,起止的时间节点分别为
Figure PCTCN2021130304-appb-000007
Figure PCTCN2021130304-appb-000008
则就将这两个片段合并为一个。本实施例中,第三门限取值为500毫秒,其仅为示例性的,本发明不进行限制。
Specifically, as shown in Figure 6, if the original audio contains two vocal segments a i , a i+1 , the start and end time nodes are respectively
Figure PCTCN2021130304-appb-000007
and
Figure PCTCN2021130304-appb-000008
Then the two segments are merged into one. In this embodiment, the value of the third threshold is 500 milliseconds, which is only exemplary, and is not limited in the present invention.
经过宽容合并的处理,通过累加所有片段的时长所得到的用户的开口时长信息相比不采用宽容合并处理所得到的用户开口时长信息更加合理;并且得到的人声音频保持了语音片段的前后连续性。After the tolerant merging process, the user's opening duration information obtained by accumulating the durations of all segments is more reasonable than the user's opening duration information obtained without the tolerant merging process; and the obtained vocal audio maintains the continuity of the speech segments. sex.
上文记载的实施例中,步骤12具体采用在原始音频头部之前以及尾部之后添加时间长度相等的空数据,例如均为480毫秒;以及步骤13中采用2倍480毫秒即960毫秒时长的窗口对原始音频进行切分得到多个子音频。In the above-recorded embodiment, step 12 specifically adopts adding null data with equal time lengths before the original audio head and after the tail, for example, both are 480 milliseconds; and in step 13, a window with a duration of 2 times 480 milliseconds or 960 milliseconds is used. Divide the original audio to get multiple sub audios.
在本发明的其他实施例中,在原始音频头部之前以及尾部之后添加的空数据时间长度可以不相等。即在所述原始音频头部之前添加第一时长的空数据,以及在所述原始音频尾部之后添加第二时长的空数据;并且以第一时长与第二时长之和的第三时长为切分窗口对原始音频进行切分得到子音频。In other embodiments of the present invention, the length of time of the null data added before the original audio header and after the tail may not be equal. That is, add the null data of the first duration before the original audio head, and add the null data of the second duration after the original audio tail; and the third duration of the sum of the first duration and the second duration is cut. The split window splits the original audio to obtain sub-audio.
例如,第一时长为240毫秒,第二时长为720毫秒,切分窗口为第一时长与第二时长之和,即为960毫秒。可见,利用本方式得到的子音频时长与上文实施例相同,依然为960ms。For example, the first duration is 240 milliseconds, the second duration is 720 milliseconds, and the split window is the sum of the first duration and the second duration, that is, 960 milliseconds. It can be seen that the duration of the sub-audio obtained by using this method is the same as that in the above embodiment, which is still 960ms.
使用此种切分方式,将计算得到的子音频人声概率近似的作为子音频中在1/4时刻的人声概率值。假设某一个子音频的起始时刻和截止时刻在原始音频中分别表示为t i,t i+0.96s,则将子音频人声概率值近似作为子音频中t i+0.24s时刻的人声概率值。进一步,在输出人声音频时,获得连续判决为人声的各子音频中第1/4时刻点组成的音频片段。可知,由于采用第一步长对原始音频切分得到子音频,因而相邻的子音频的第1/4时刻之间相隔第一步长,例如上述实施例中采用的10ms。 Using this segmentation method, the calculated sub-audio vocal probability is approximated as the vocal probability value of the sub-audio at 1/4 time. Assuming that the start time and the end time of a certain sub-audio are represented as t i and t i +0.96s respectively in the original audio, then the probability value of the sub-audio vocals is approximated as the human voice at the time of t i +0.24s in the sub-audio probability value. Further, when the human voice audio is output, an audio segment composed of the 1/4th time point in each sub-audio that is continuously judged to be a human voice is obtained. It can be known that since the sub-audio is obtained by segmenting the original audio by the first length, the first 1/4 moment is separated by the first length, for example, 10ms used in the above-mentioned embodiment.
可采用上文记载相同的方法对得到的子音频的人声概率数组进行滤波。The resulting array of vocal probabilities for the sub-audio may be filtered using the same method as described above.
在对得到的子音频的人声概率数组进行音频能量修正预处理时,较优的方式是计算子音频中前1/4时刻的能量值。例如,假设某一个子音频的起始时刻和截止时刻在原始音频中分别表示为t i,t i+0.96s,则计算t i+0.24s时刻的能量值,并根据该能量值得到该子音频(t i,t i+0.96s)的概率修正因子。 When performing the audio energy correction preprocessing on the obtained vocal probability array of the sub-audio, a better way is to calculate the energy value of the first 1/4 moment in the sub-audio. For example, assuming that the start time and the end time of a certain sub-audio are represented as t i and t i +0.96s respectively in the original audio, then calculate the energy value at the time of t i +0.24s, and obtain the sub-audio according to the energy value. Probability correction factor for audio (t i , t i +0.96s).
与前述应用功能实现方法实施例相对应,本申请还提供了一种音频识别装置。该装置包括:Corresponding to the foregoing application function implementation method embodiments, the present application further provides an audio recognition device. The device includes:
处理器;以及processor; and
存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行上文记载的方法。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不再做详细阐述说明。A memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method described above. Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment of the method, and will not be described in detail here.
本领域技术人员还将明白的是,结合这里的申请所描述的各种示例性逻辑块、模块、电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。Those skilled in the art will also appreciate that the various exemplary logical blocks, modules, circuits, and algorithm steps described in connection with the applications herein may be implemented as electronic hardware, computer software, or combinations of both.
附图中的流程图和框图显示了根据本申请的多个实施例的系统和方法的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标记的功能也可以以不同于附图中所标记的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Various embodiments of the present application have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the various embodiments, the practical application or improvement over the technology in the marketplace, or to enable others of ordinary skill in the art to understand the various embodiments disclosed herein.

Claims (10)

  1. 一种音频识别方法,其特征在于,包括:An audio recognition method, comprising:
    获得原始音频,在所述原始音频头部之前添加第一时长的空数据,以及在所述原始音频尾部之后添加第二时长的空数据,得到扩展后的音频;Obtain the original audio, add the empty data of the first duration before the original audio header, and add the empty data of the second duration after the original audio tail to obtain the expanded audio;
    以第一时长与第二时长之和的第三时长为切分窗口,以第一步长从所述扩展后的音频的首部开始,依次分窗后获得多个子音频;Taking the third duration of the sum of the first duration and the second duration as the segmentation window, starting from the header of the expanded audio with the first duration, and obtaining a plurality of sub-audios after windowing in turn;
    分别计算得到所述子音频的时频特征序列;Calculate the time-frequency feature sequence of the sub-audio respectively;
    神经网络根据所述时频特征序列得到子音频属于特定分类的概率;The neural network obtains the probability that the sub-audio belongs to a specific classification according to the time-frequency feature sequence;
    将所述概率分别与判决门限进行比较判决子音频是否为特定分类。The probabilities are respectively compared with decision thresholds to decide whether the sub-audio is of a particular class.
  2. 根据权利要求1所述的方法,其特征在于,The method of claim 1, wherein:
    所述子音频的时频特征序列为梅尔频率倒谱系数特征序列;The time-frequency feature sequence of the sub-audio is a Mel-frequency cepstral coefficient feature sequence;
    以及,所述神经网络根据梅尔频率倒谱系数特征序列得到子音频的人声概率;And, the neural network obtains the vocal probability of the sub-audio according to the Mel frequency cepstral coefficient feature sequence;
    以及,将所述人声概率分别与判决门限进行比较判决子音频是否为人声。And, comparing the human voice probabilities with the decision thresholds to determine whether the sub-audio is human voice.
  3. 根据权利要求2所述的方法,其特征在于,所述获得子音频的人声概率之后,还包括:The method according to claim 2, wherein after obtaining the vocal probability of the sub-audio, the method further comprises:
    获得所述原始音频所有子音频属于人声概率的数组;Obtain an array of probabilities that all sub-audios of the original audio belong to human voices;
    以第一数量作为窗口对所述数组中的概率值进行滤波,得到滤波后的概率。The probability values in the array are filtered using the first number as a window to obtain the filtered probability.
  4. 根据权利要求3所述的方法,其特征在于,采用中值滤波的方法对所述人声概率的数组进行滤波。The method according to claim 3, wherein the array of human voice probabilities is filtered by means of median filtering.
  5. 根据权利要求2或者3所述的方法,其特征在于,按照预置的规则,根据所述人声概率判决子音频是否为人声包括:The method according to claim 2 or 3, wherein, according to a preset rule, determining whether the sub-audio is a human voice according to the human voice probability comprises:
    获取原始音频中所述子音频中确定时刻点的音频能量值;以及根据所述能量值设置人声概率调节因子,包括,Acquire the audio energy value of the sub-audio in the original audio at a certain time point; and set the human voice probability adjustment factor according to the energy value, including,
    若能量值大于能量上限,该子音频的人声概率调节因子置为1;If the energy value is greater than the energy upper limit, the vocal probability adjustment factor of the sub-audio is set to 1;
    若能量值小于能量下限,该子音频的人声概率调节因子置为0;If the energy value is less than the energy lower limit, the vocal probability adjustment factor of the sub-audio is set to 0;
    若能量值不大于能量上限且不小于能量下限,则根据能量值将所述人声概率调节因子归一化为0至1之间;If the energy value is not greater than the energy upper limit and not less than the energy lower limit, normalize the vocal probability adjustment factor to be between 0 and 1 according to the energy value;
    将子音频的人声概率调节因子乘以所述子音频人声概率,得到修正后的子音频人声概率;Multiply the sub-audio vocal probability adjustment factor by the sub-audio vocal probability to obtain the revised sub-audio vocal probability;
    以及,将所述修正后的子音频人声概率分别与判决门限进行比较判决子音频是否为人声。And, comparing the modified sub-audio vocal probabilities with a decision threshold respectively to determine whether the sub-audio is human.
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:The method according to claim 5, wherein the method further comprises:
    获取原始音频中连续判决为人声的子音频;Obtain the sub-audio that is continuously judged as human voice in the original audio;
    获取所述连续判决为人声的子音频的所述确定时刻点的组成的音频片段;Acquiring the audio segment composed of the determined time points of the sub-audio that is continuously judged to be a human voice;
    输出所述音频片段。The audio clip is output.
  7. 根据权利要求6所述的方法,其特征在于:The method according to claim 6, wherein:
    所述第一时长与第二时长相等;The first duration is equal to the second duration;
    以及,所述子音频中确定时刻点的音频能量值具体为子音频中心时刻点的音频能量值;And, the audio energy value of the determined moment in the sub audio is specifically the audio energy value of the sub audio center moment;
    所述获取所述连续判决为人声的子音频的所述确定时刻点的组成的音频片段具体为获取所述连续判决为人声的子音频中心时刻点的组成的音频片段。The acquiring the audio segment composed of the determined time points of the sub-audio continuously judged to be the human voice is specifically acquiring the audio segment composed of the central time point of the sub-audio continuously determined to be the human voice.
  8. 根据权利要求6或7所述的方法,其特征在于,所述输出音频之前还包括:The method according to claim 6 or 7, wherein before the outputting audio, the method further comprises:
    若相邻的所述音频片段之间的时间间隔小于第三门限,则获取相邻音频片段间的音频片段。If the time interval between the adjacent audio clips is smaller than the third threshold, acquire the audio clips between the adjacent audio clips.
  9. 根据权利要求8所述的方法,其特征在于,还包括:The method of claim 8, further comprising:
    统计所述输出的音频的时长。Count the duration of the output audio.
  10. 一种音频识别的装置,其特征在于,包括:A device for audio recognition, comprising:
    处理器;以及processor; and
    存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行如权利要求1-9中任一项所述的方法。A memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method of any of claims 1-9.
PCT/CN2021/130304 2020-11-12 2021-11-12 Audio recognition method and device WO2022100691A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011260050.5A CN112270933B (en) 2020-11-12 2020-11-12 Audio identification method and device
CN202011260050.5 2020-11-12

Publications (1)

Publication Number Publication Date
WO2022100691A1 true WO2022100691A1 (en) 2022-05-19

Family

ID=74339924

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/130304 WO2022100691A1 (en) 2020-11-12 2021-11-12 Audio recognition method and device

Country Status (2)

Country Link
CN (1) CN112270933B (en)
WO (1) WO2022100691A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115579022A (en) * 2022-12-09 2023-01-06 南方电网数字电网研究院有限公司 Superposition sound detection method and device, computer equipment and storage medium
CN115840877A (en) * 2022-12-06 2023-03-24 中国科学院空间应用工程与技术中心 Distributed stream processing method and system for MFCC extraction, storage medium and computer

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270933B (en) * 2020-11-12 2024-03-12 北京猿力未来科技有限公司 Audio identification method and device
CN113593603A (en) * 2021-07-27 2021-11-02 浙江大华技术股份有限公司 Audio category determination method and device, storage medium and electronic device

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030088411A1 (en) * 2001-11-05 2003-05-08 Changxue Ma Speech recognition by dynamical noise model adaptation
CN1666252A (en) * 2002-07-08 2005-09-07 里昂中央理工学院 Method and apparatus for classifying sound signals
CN108288465A (en) * 2018-01-29 2018-07-17 中译语通科技股份有限公司 Intelligent sound cuts the method for axis, information data processing terminal, computer program
CN109712641A (en) * 2018-12-24 2019-05-03 重庆第二师范学院 A kind of processing method of audio classification and segmentation based on support vector machines
CN110085251A (en) * 2019-04-26 2019-08-02 腾讯音乐娱乐科技(深圳)有限公司 Voice extracting method, voice extraction element and Related product
CN110349597A (en) * 2019-07-03 2019-10-18 山东师范大学 A kind of speech detection method and device
CN110782920A (en) * 2019-11-05 2020-02-11 广州虎牙科技有限公司 Audio recognition method and device and data processing equipment
US20200074997A1 (en) * 2018-08-31 2020-03-05 CloudMinds Technology, Inc. Method and system for detecting voice activity in noisy conditions
CN111145763A (en) * 2019-12-17 2020-05-12 厦门快商通科技股份有限公司 GRU-based voice recognition method and system in audio
CN111613213A (en) * 2020-04-29 2020-09-01 广州三人行壹佰教育科技有限公司 Method, device, equipment and storage medium for audio classification
CN111883182A (en) * 2020-07-24 2020-11-03 平安科技(深圳)有限公司 Human voice detection method, device, equipment and storage medium
CN112270933A (en) * 2020-11-12 2021-01-26 北京猿力未来科技有限公司 Audio identification method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107610707B (en) * 2016-12-15 2018-08-31 平安科技(深圳)有限公司 A kind of method for recognizing sound-groove and device
CN107527626A (en) * 2017-08-30 2017-12-29 北京嘉楠捷思信息技术有限公司 Audio identification system
CN109859771B (en) * 2019-01-15 2021-03-30 华南理工大学 Sound scene clustering method for jointly optimizing deep layer transformation characteristics and clustering process
CN110782872A (en) * 2019-11-11 2020-02-11 复旦大学 Language identification method and device based on deep convolutional recurrent neural network
CN110827798B (en) * 2019-11-12 2020-09-11 广州欢聊网络科技有限公司 Audio signal processing method and device

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030088411A1 (en) * 2001-11-05 2003-05-08 Changxue Ma Speech recognition by dynamical noise model adaptation
CN1666252A (en) * 2002-07-08 2005-09-07 里昂中央理工学院 Method and apparatus for classifying sound signals
CN108288465A (en) * 2018-01-29 2018-07-17 中译语通科技股份有限公司 Intelligent sound cuts the method for axis, information data processing terminal, computer program
US20200074997A1 (en) * 2018-08-31 2020-03-05 CloudMinds Technology, Inc. Method and system for detecting voice activity in noisy conditions
CN109712641A (en) * 2018-12-24 2019-05-03 重庆第二师范学院 A kind of processing method of audio classification and segmentation based on support vector machines
CN110085251A (en) * 2019-04-26 2019-08-02 腾讯音乐娱乐科技(深圳)有限公司 Voice extracting method, voice extraction element and Related product
CN110349597A (en) * 2019-07-03 2019-10-18 山东师范大学 A kind of speech detection method and device
CN110782920A (en) * 2019-11-05 2020-02-11 广州虎牙科技有限公司 Audio recognition method and device and data processing equipment
CN111145763A (en) * 2019-12-17 2020-05-12 厦门快商通科技股份有限公司 GRU-based voice recognition method and system in audio
CN111613213A (en) * 2020-04-29 2020-09-01 广州三人行壹佰教育科技有限公司 Method, device, equipment and storage medium for audio classification
CN111883182A (en) * 2020-07-24 2020-11-03 平安科技(深圳)有限公司 Human voice detection method, device, equipment and storage medium
CN112270933A (en) * 2020-11-12 2021-01-26 北京猿力未来科技有限公司 Audio identification method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115840877A (en) * 2022-12-06 2023-03-24 中国科学院空间应用工程与技术中心 Distributed stream processing method and system for MFCC extraction, storage medium and computer
CN115579022A (en) * 2022-12-09 2023-01-06 南方电网数字电网研究院有限公司 Superposition sound detection method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN112270933A (en) 2021-01-26
CN112270933B (en) 2024-03-12

Similar Documents

Publication Publication Date Title
CN108564942B (en) Voice emotion recognition method and system based on adjustable sensitivity
WO2022100691A1 (en) Audio recognition method and device
CN110021308B (en) Speech emotion recognition method and device, computer equipment and storage medium
Kinnunen Spectral features for automatic text-independent speaker recognition
Schuller Intelligent audio analysis
WO2022100692A1 (en) Human voice audio recording method and apparatus
Qamhan et al. Digital audio forensics: microphone and environment classification using deep learning
Sefara The effects of normalisation methods on speech emotion recognition
CN113488063B (en) Audio separation method based on mixed features and encoding and decoding
WO2023226239A1 (en) Object emotion analysis method and apparatus and electronic device
CN112735404A (en) Ironic detection method, system, terminal device and storage medium
CN113823323A (en) Audio processing method and device based on convolutional neural network and related equipment
CN114023300A (en) Chinese speech synthesis method based on diffusion probability model
CN111009235A (en) Voice recognition method based on CLDNN + CTC acoustic model
Xiao et al. Hierarchical classification of emotional speech
WO2023279691A1 (en) Speech classification method and apparatus, model training method and apparatus, device, medium, and program
Hashem et al. Speech emotion recognition approaches: A systematic review
Chakroun et al. Efficient text-independent speaker recognition with short utterances in both clean and uncontrolled environments
Chetouani et al. Time-scale feature extractions for emotional speech characterization: applied to human centered interaction analysis
Grewal et al. Isolated word recognition system for English language
Eyben et al. Audiovisual vocal outburst classification in noisy acoustic conditions
Laghari et al. Robust speech emotion recognition for sindhi language based on deep convolutional neural network
CN111009236A (en) Voice recognition method based on DBLSTM + CTC acoustic model
CN116259312A (en) Method for automatically editing task by aiming at voice and neural network model training method
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21891211

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21891211

Country of ref document: EP

Kind code of ref document: A1