WO2022100691A1

WO2022100691A1 - Audio recognition method and device

Info

Publication number: WO2022100691A1
Application number: PCT/CN2021/130304
Authority: WO
Inventors: 贾杨; 夏龙; 吴凡; 郭常圳
Original assignee: 北京猿力未来科技有限公司
Priority date: 2020-11-12
Filing date: 2021-11-12
Publication date: 2022-05-19
Also published as: CN112270933A; CN112270933B

Abstract

An audio recognition method and device, the audio recognition method comprising: an original audio is obtained (11), null data of a first duration is added before the head of the original audio, and null data of a second duration is added after the tail of the original audio to obtain extended audio; a third duration that is the sum of the first duration and the second duration is taken as a segmentation window, and starting from the header of the extended audio, the window is sequentially segmented by a first step to obtain a plurality of sub-audios; and time-frequency feature sequences of the sub-audios are calculated respectively (14); according to the time-frequency feature sequences, a neural network obtains the probabilities of the sub-audios belonging to a specific category; and the probabilities are separately compared to a determination threshold to determine whether the sub-audios are in a specific category (16).

Description

Audio recognition method and device

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority of the Chinese patent application with the application number 202011260050.5 and the invention titled "An audio recognition method and device" filed with the China Patent Office on November 12, 2020, the entire contents of which are incorporated into this application by reference .

technical field

The present application relates to the technical field of data processing, and in particular, to an audio recognition method and device.

Background technique

With the development of Internet technology, online education and other similar industries have flourished, and the number of online students has increased sharply. Teachers evaluate students' classroom participation and give feedback to improve teaching effects through artificial subjective feelings and statistics of students' speaking time.

At present, there are several solutions for counting the duration of a user's mouth opening in the prior art.

Based on the statistics of the opening time of the switch, that is, a recording button that can be turned on and off is set on the user client, and the user needs to press the button before speaking. The starting point of this plan is to achieve simple and direct time statistics, but when the audience is a group of young children, the group is characterized by poor obedience and unruly actions. For this reason, the method of clicking the button and talking is inefficient, and the whole process is turned on. The microphone scheme is more suitable for this scenario. At the same time, in the process of interaction, users must manually operate before they can convey information, which reduces the user's interactivity to a certain extent, and the validity of the duration statistics depends entirely on the user's consciousness.

Based on the statistics of the opening duration of the audio energy analysis, the sound is generated by the vibration of the sound source, which is a way of expressing energy. Therefore, the energy of the audio can be analyzed, and the effective opening time of the user is obtained after the "mute" is removed. This method has good effect in relatively quiet environment, but the effect will be reduced when there is relatively strong noise and reverberation around, because the noise itself also contains energy.

The statistical algorithm of opening duration based on speech recognition, based on speech recognition algorithms such as mixed Gaussian and Hidden Markov, neural network and Hidden Markov and Connectionist temporal classification (CTC), can get the text contained in the audio and the corresponding time point. Based on the statistics of the user's opening time, ideally, this solution should have the best performance results, but this solution often has the following shortcomings: 1) The speech recognition model generally has high computational complexity and is not suitable for high-concurrency online environments. use. 2) The effect of statistical opening depends on the accuracy of the speech recognition model. For different languages (including mixed languages, dialects, etc.), different recognition models need to be trained, and a large amount of labeling data is required to improve the performance of the model. Therefore, the generalization of this scheme is poor and the early labeling cost is too high.

Vggish is a technique proposed by Google for audio classification. The VGG network structure in the field of image recognition proposed by Oxford's Visual Geometry Group at ILSVRC 2014 is used, and the scheme uses the Youtube-100M video dataset. Vggish's solution uses audio classification based on titles, subtitles, comments, etc. from online videos, such as songs, music, sports, speeches, and so on. Under this scheme, the quality of classification depends on manual review, otherwise there will be more errors in the classification standards; on the other hand, if the classification related to human voice in this scheme is classified into the human voice category, and the rest are classified as non-voice parts, and finally The model performance is poor.

SUMMARY OF THE INVENTION

The present application provides an audio recognition method. Including: obtaining original audio, adding null data of a first duration before the head of the original audio, and adding dummy data of a second duration after the tail of the original audio, to obtain the expanded audio; The third duration of the sum of the two durations is a segmented window, starting from the header of the expanded audio with the first length, and then dividing the windows to obtain multiple sub-audios; respectively, calculating the time-frequency feature sequence of the sub-audio The neural network obtains the probability that the sub-audio belongs to a specific category according to the time-frequency feature sequence; and compares the probability with the judgment threshold to determine whether the sub-audio is a specific category.

Wherein, the time-frequency feature sequence of the sub-audio is a Mel-frequency cepstral coefficient feature sequence;

And, the neural network obtains the vocal probability of the sub-audio according to the characteristic sequence of Mel frequency cepstral coefficients; and, respectively, compares the vocal probability with a judgment threshold to determine whether the sub-audio is a human voice.

In the above method, after obtaining the probability that the sub-audio belongs to the human voice, the method further includes: obtaining an array of the human voice probabilities of all the sub-audio of the original audio; filtering the probability values in the array with the first number as a window, Get the filtered vocal probability.

Wherein, the method of median filtering is used to filter the vocal probability array.

In the above method, obtain the audio energy value of the sub-audio in the original audio at a certain time point; and set the vocal probability adjustment factor according to the energy value, including: if the audio energy value is greater than the energy upper limit, the sub-audio's vocal The probability adjustment factor is set to 1; if the audio energy value is less than the energy lower limit, the human voice probability adjustment factor of the sub-audio is set to 0; if the audio energy value is not greater than the energy upper limit and not less than the energy lower limit, then according to the energy value The vocal probability adjustment factor is normalized to be between 0 and 1; the sub-audio vocal probability adjustment factor is multiplied by the sub-audio vocal probability to obtain the modified sub-audio vocal probability; The sub-audio vocal probability is compared with the decision threshold to determine whether the sub-audio is human.

The method further includes: acquiring sub-audio continuously judged to be human voice in the original audio; acquiring an audio segment composed of the determined time points of the sub-audio continuously determined to be human voice; outputting the audio segment.

In the above method, the first duration is equal to the second duration; and, the audio energy value of the determined time point in the sub-audio is specifically the audio energy value of the sub-audio center time point; the acquisition of the continuous judgment is the human voice. The audio segment composed of the determined time point of the sub-audio is specifically the audio segment composed of the central time point of the sub-audio that is continuously judged to be a human voice.

And, before the outputting the audio, the method further includes: if the time interval between the adjacent audio clips is less than a third threshold, acquiring the audio clips between the adjacent audio clips.

The above method further includes: counting the duration of the output audio.

The application provides an audio recognition device, including:

processor; and

A memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method as described above.

In the present application, by adding null data before and after the original audio, and dividing the original audio into a window with a first length of 2 times the duration of the null data, multiple sub-audios are obtained by segmentation, and then based on this fine-grained analysis of the original audio Segment, calculate MFCC for the sub-audio separately, and obtain the time-frequency two-dimensional map of the sound signal. The method can also obtain the probability value of whether the audio is recognized as a certain type at the beginning and end of the original audio data, the probability of the approximate sub-audio is the probability corresponding to the central time point, and then the probability array of the original audio time point granularity can be obtained, More accurate detection of opening segments can be achieved.

On the other hand, the present application distinguishes between human voices and non-human voices through deep learning algorithms. And further, the probability value obtained by the neural network is filtered to effectively remove the noise in the audio recognition result, and the noise is generated based on the fine-grained segmentation of the previous step, the model performance or/and the cause of noise in the original audio. Therefore, the audio recognition result is smoother through this reasonable sliding window and smoothing mechanism system.

The fine-grained sub-audio segmentation and sub-audio substantial overlap strategy of the present application results in a small part of the non-vocal audio probabilities being modified by the surrounding points to be more human-like. Therefore, based on the feature that the energy of noise or silence is weaker than the human voice, the present application further corrects the probability value of the application network by calculating the energy of the original audio to further improve the recognition accuracy.

Further, in the present application, a certain latitude is given to maintain the continuity of speech segments when performing segmentation and statistics of opening segments according to the final audio human voice probability array. This segmentation can not only provide better accuracy, but also provide corpus with higher content quality.

In summary, the method of the present application can save statistical time and improve feedback efficiency, increase the function of opening time node statistics and improve its accuracy, increase generalization, and improve the accuracy of statistics on the premise of ensuring user interactivity and interaction efficiency. The length of time the user speaks.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not limiting of the present application.

Description of drawings

The above and other objects, features and advantages of the present application will become more apparent from the more detailed description of the exemplary embodiments of the present application in conjunction with the accompanying drawings, wherein the same reference numerals generally represent the exemplary embodiments of the present application. same parts.

1 is a schematic flowchart of an audio recognition method shown in an embodiment of the present application;

2 is a schematic diagram of original audio segmentation preprocessing shown in an embodiment of the present application;

Figure 3a is a schematic diagram of an existing neural network structure;

FIG. 3b is a schematic diagram of the structure of the neural network according to the embodiment of the present application;

Fig. 4 is the audio frequency human voice probability distribution diagram before moving average shown in the embodiment of the present application;

Fig. 5 is the audio frequency human voice probability distribution diagram after moving average shown in the embodiment of the present application;

FIG. 6 is a schematic diagram of permissive merge processing.

Detailed ways

Preferred embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While preferred embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the scope of this application to those skilled in the art.

The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms "first", "second", "third", etc. may be used in this application to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information without departing from the scope of the present application. Thus, a feature defined as "first" or "second" may expressly or implicitly include one or more of that feature. In the description of the present application, "plurality" means two or more, unless otherwise expressly and specifically defined.

The present application provides an audio recognition method for recognizing audio belonging to a certain audio category from a piece of original audio. For example, the part of the human voice can be accurately identified from a piece of original audio, so that the duration of the human voice can be counted, or/and the human voice audio in the original audio can be further output.

The following embodiments illustrate the application of human voice recognition in an online education scenario as an example.

There is a concept of opening time statistics in the field of education. For example, in offline daily education, teachers, as the executors and evaluators of statistics, evaluate students' classroom participation and give feedback to improve teaching effects through artificial subjective feelings and statistics of the opening time of classmates. With the development of online education, the number of people studying online has increased sharply, and the statistics of opening time is still one of the indicators for evaluating students' participation in online education. Therefore, a technical solution is needed to accurately and effectively count the opening time of students.

The idea of the present invention is to obtain sub-audio by fine-grained windowing of the original audio, and the probability of approximating the sub-audio is the probability corresponding to the central time point, and then by calculating the MFCC of the sub-audio, the time-frequency two-dimensional sound signal is obtained. Figure, and use deep learning algorithm to distinguish human voice and non-human voice, and calculate the effective opening time through a reasonably designed sliding window and smoothing mechanism.

A specific embodiment of the present invention will be described with reference to FIG. 1 .

Step 11: Get the original audio.

Obtain the original audio files. For example, when students study online, they can answer the voice according to the prompts of the learning software, and the smart device can obtain the original audio of the students' voice answers through the microphone. The original audio may contain both the desired human voice and other non-human voices such as background sounds, noise, etc.

Step 12: adding empty data before the original audio head and after the tail respectively to obtain the expanded audio;

In one embodiment, subdivision processing is performed on the original audio, the original audio is divided into smaller sub-audios, a segment of empty audio is added to the beginning and end of the original audio, and the expanded audio is obtained. The audio is sub-audio segmentation based on the segmentation window value, and the ratio of the empty audio value and the segmentation window value is maintained at 1:2.

FIG. 2 shows an optimal implementation method for audio extension provided by an embodiment of the present invention.

In this embodiment, in order to achieve accurate statistics of opening time nodes, the sub-audio needs to have a smaller segmentation granularity. As shown in the figure, a is the original audio array. First, empty data of equal duration is added to the beginning and the end of the original audio a, that is, zeros of 480 milliseconds (ms), to obtain the expanded audio b. The number of 0s in the 480ms is determined according to the sampling frequency of the audio, that is, the data frequency in 480ms is the same as the sampling frequency.

In this embodiment, the duration of the null data added before and after the original audio header is 480 ms is only exemplary, and the present invention does not limit other values of the duration.

Step 13: take 2 times the above-mentioned duration as the segmentation window, and obtain a plurality of sub-audios in sequence from the header of the expanded audio with the first length;

As shown in FIG. 2 , in this embodiment, when sub-audio is obtained by segmenting the original audio, the segmentation window adopts 960ms, that is, twice the 480ms. The segmentation step size is 10ms, so the minimum segmentation granularity of the sub-audio is 10ms.

According to the above segmentation method, several sub audios are obtained, the difference between adjacent sub audios is 10ms, and the duration of each sub audio is 960ms.

Assuming that the start time and the end time of a certain sub-audio are respectively represented as t _i , t _i +0.96s in the original audio, then in the embodiment of the present invention, the human voice of the sub-audio feature map calculated in the subsequent steps is The probability is taken as the probability of the human voice corresponding to the audio at the time point t _i +0.48s. Therefore, the voice probability calculated according to the first sub-audio in this scheme is the vocal probability at the beginning of the original audio; the vocal probability calculated by the last sub-audio is the vocal probability at the end of the original audio.

It can be seen that since the empty data is added before and after the head of the original audio, and then combined with the data of the half time of the segmentation window, that is, the present invention approximately calculates the probability of a human voice at a certain time point in this way, so it can be realized More accurate detection of open fragments.

In the embodiment of the present invention, the first step length, that is, the segmentation granularity is 10 ms, and the present invention does not limit the selection of other segmentation granularities.

Step 14: respectively calculate and obtain the time sequence feature sequence of the sub-audio;

For the recognition of human voice, the vibration of human vocal cords and the closure of the oral cavity have universal laws, which are directly reflected in the frequency domain characteristics of the sound. In the embodiment of the present invention, the Mel Frequency Cepstral Coefficient (MFCC) is used, which is a spectral coefficient obtained by linear transformation of the logarithmic energy spectrum of the nonlinear Mel scale of the sound frequency, which is in line with the human ear's perception of the sound frequency. Feeling characteristics can characterize the frequency domain characteristics of sound.

For each sub-audio obtained by segmentation, a preset window length and step size are used to calculate the result of its short-time Fourier transform, and a characteristic sequence of Mel frequency cepstral coefficients is obtained. In the embodiment of the present invention, a window length of 25 ms and a step length of 10 ms are used to calculate the result of the short-time Fourier transform, and obtain the MFCC characteristic.

Step 15: The neural network obtains the probability that the sub-audio belongs to a specific category according to the time sequence feature sequence.

Input the Mel frequency cepstral coefficient feature sequence into the trained neural network model, and obtain the corresponding probability of each audio segment output by the neural network model. In the neural network model, the probability corresponding to each audio segment is predicted by the trained neural network model. The value of this probability can range from 0 to 1.

For example, as shown in Figure 3, the trained neural network model adopts a 3x3 convolution kernel and pool layer to simplify the model parameters. The training of the neural network includes two stages: pre-training and fine-tuning. The picture on the left shows the 500-class classification model. First, the 500-class audio classification model was trained using the sound dataset. The figure on the right shows the two-class model. The network reuses the underlying network structure and parameters of the 500-class model, and uses the back propagation algorithm to make the model converge. Through this binary classification model to identify whether there is human voice in the audio clip, the model will output the probability that the current audio clip has human voice. By introducing pre-training and fine-tuning, the network trained by the present invention is more focused on the classification scene of human voice and non-human voice, and the performance of the model is improved.

As shown in Figure 3, when performing audio human voice recognition, the trained convolutional neural network (CNN) in the embodiment of the present invention is composed of input and output layers and multiple hidden layers, wherein the hidden layer is mainly composed of a series of convolutional layers, It consists of a pooling layer and a fully connected layer.

The convolution layer generally defines a convolution kernel. The size of the convolution kernel represents the receptive field of the network in this layer. By sliding different convolution kernels on the input feature map and performing a dot product operation on the feature map, the information in the receptive field is projected. Go to a certain element in the lower layer to achieve the effect of information enrichment. In general, the size of the convolution kernel is much smaller than the input feature map, and it overlaps or acts on the input feature map in parallel.

The pooling layer is actually a non-linear form of downsampling operation, and there are many different forms of non-linear pooling functions such as max pooling, mean pooling, etc. Generally speaking, pooling layers are periodically inserted between the convolutional layers in the CNN network structure.

The fully connected layer is to fuse the high-level feature information abstracted by the convolution layer and the pooling layer, and finally achieve the classification effect.

Step 16: Compare the probabilities with the decision thresholds to decide whether the sub-audio is of a specific category.

The judgment threshold is set as the basis for judging whether it is human voice, if the probability is greater than the judgment threshold, the judgment is human voice, and if the probability is less than the judgment threshold, the judgment is non-human voice.

After the above steps, the original audio a is divided into a vocal or non-vocal segment. By accumulating the durations of all segments, the duration of the human voice in the original audio, that is, the duration information of the user's mouth, can be obtained. In addition, the original audio can be segmented according to the information of each opening segment, so as to facilitate the subsequent output of vocal audio, for example, to evaluate the learning status.

In yet another embodiment of the present invention, after the neural network obtains the probability that the sub-audio belongs to a specific classification according to the time sequence feature sequence in step 15, the obtained probability value can also be preprocessed by the following method, so as to achieve the probability value for optimization purposes.

1) Perform sliding average preprocessing on the currently obtained probability.

Due to the segmentation granularity and noise, the vocal probability array of the original audio obtained by the method described above contains noise points. It is reflected in the 200-millisecond vocal probability distribution diagram shown in Figure 4, where the ordinate represents the probability that the audio point is a human voice, the abscissa represents time, and each point represents 10ms. There are many sudden changes in probability values, that is, glitches, on the probability value distribution of 0-1 corresponding to the time axis of the horizontal axis. Therefore, it is necessary to perform sliding average preprocessing on the currently obtained probability to make the probability distribution smoother, and obtain the 200-millisecond vocal probability distribution map as shown in Figure 5.

The sliding average preprocessing adopts the median sliding filtering method. The probability that the i-th sub-audio after median filtering is a human voice is:

where, an array of vocal probabilities for all sub-audios in the original audio

P={p ₁ ,p ₂ ,p ₃ ,...,p _i ...,p _n }, where n is the total number of sub-audio obtained by segmenting the original audio, and p _i represents the probability that the ith sub-audio is a human voice .

w_smooth is the selected window size. For example, in this embodiment, the window is selected as 31, that is, the window is 31 values in the vocal probability array of the sub-audio.

For _pi , determine the upper and lower bound indices of the moving average.

The lower limit index is: Lo=max(0,i-15), which represents the first probability value in the array;

The upper limit index is: Hi=min(n, i+15), which represents the last probability value in the array.

In this embodiment, median filtering is to average the probability values of adjacent 31 points as the probability value of the intermediate point; according to this method, the probability value of each point is recalculated with the step size of 1.

Comparing Fig. 4 and Fig. 5, it can be seen that the burr of the sub-audio vocal probability map is effectively corrected after moving average, which improves the accuracy of opening segment segmentation to a certain extent.

The above median filtering is an implementation manner of the present invention, and the present invention does not limit the adoption of other filtering methods.

After filtering and preprocessing, the judgment algorithm is used to distinguish human voice and non-human voice, determine the opening segment of human voice, and count the user's opening time.

2) Energy correction preprocessing.

After the moving average preprocessing, due to the fine-grained sub-audio segmentation adopted in the embodiment of the present invention and the strategy of greatly overlapping sub-audio, the audio probability of a small part of non-human voices is modified by the surrounding points during filtering. Tends to human voice, that is, the probability of human voice increases, but its nature is non-human voice.

In order to solve the above problem, the embodiment of the present invention utilizes the characteristic that the energy of noise or mute is weaker than the human voice, and uses the energy of the original audio to further modify the probability of the human voice, so as to improve the accuracy.

The moving average audio vocal probability array is:

Taking 10ms as the window size and 10ms as the step size, calculate the energy array of the original audio:

_Power = {w ₁ ,w ₂ ,w ₃ ,..., _wi ,..w _n }

Since in the above-recorded embodiment, the original audio is sliced with a step size of 10ms to obtain sub-audio, and then the probability of vocals with an interval of 10ms is obtained. Therefore, the energy array of the original audio is obtained by calculating the step size of 10ms here, so that Make the moments of the energy array of the original audio correspond to the moments of the vocal probability array of the original audio.

Normalize the value of the _Power array to be between 0 and 1, and determine the upper limit of energy P _up and the lower limit of energy P _down , then _wi can be normalized as follows:

It can be seen from the above formula that when the audio energy at a certain moment is greater than the energy upper limit P _up , _wi takes the value of 1, and if the audio energy at a certain moment is less than the energy lower limit P _down , _wi takes the value 0, and obtains

array P ^f and array

A dot product operation is performed on the corresponding values to obtain an energy-corrected audio vocal probability value array P ^T . After this calculation, when the audio energy at a certain moment is greater than the energy upper limit P _up , the probability value of the human voice at that moment remains unchanged; if the audio energy at a certain moment is less than the lower energy limit P _down , the human voice probability value at that moment is taken as The value is 0.

In an embodiment, if the audio energy is between the lower energy limit and the upper energy limit (including the upper energy limit value and the lower energy limit value), the obtained probability adjustment factor is between 0 and 1, by which The probability adjustment factor adjusts the vocal probability value at the corresponding time point, and finally obtains an energy-corrected audio vocal probability value array P ^T .

It can be seen from the above that by using the energy matrix of the original audio, if the audio energy at a certain moment is lower than the energy lower limit, the audio at that moment is considered to be non-human voice, so that the probability of the human voice at this moment becomes zero. Parts of audio that are not human voices are removed.

The probabilities obtained in the above embodiments are first subjected to sliding average preprocessing, then energy correction preprocessing, and finally, a judgment algorithm is used to discriminate between human voices and non-human voices, determine vocal opening segments, and count the duration of user openings; for the currently obtained probability There are two kinds of preprocessing, energy correction and sliding average. There is no sequence. Energy correction preprocessing can be performed first, and then sliding average preprocessing.

The present invention can also adopt one of the above two preprocessing methods to achieve the purpose of improving the accuracy of human voice recognition.

As a further optimization to the above embodiment, before the human voice duration is counted or the human voice audio is output, the human voice audio obtained by the above method may also be subjected to tolerant merge processing.

Specifically, considering the continuity of human speech, especially in online learning scenarios for children and adolescents, there are often brief pauses between words that express complete sentences, which are usually used to breathe or express a certain emotion. In this embodiment, when the opening segment segmentation statistics are performed according to the final audio vocal probability array, the recognition results obtained by the steps described above are not strictly followed, but a certain latitude is given to maintain the continuity of the speech segments. sex. This segmentation can not only provide better accuracy, but also provide teachers with evaluation corpus with higher content quality, which is convenient for teachers to evaluate students' learning effects.

The method of permissive merging is as follows.

Set the judgment threshold as the basis for judging whether it is human voice.

On the basis of the above embodiment, combined with the final probability array P ^T and the judgment threshold, the judgment result of whether each time node of the original audio is a human voice is obtained, that is, if

If the decision threshold is set, the audio at time point i of the original audio is a human voice; otherwise, it corresponds to a non-human voice.

Through the above steps, the original audio a is divided into a vocal or non-vocal segment. If the interval between the two sub-audio judged to be human voice and the number of sub-audio judged to be non-human voice is less than the third threshold, then further acquire the audio between the central moments of the two sub-audio sub-audio judged to be human voice.

Specifically, as shown in Figure 6, if the original audio contains two vocal segments a _i , a _i+1 , the start and end time nodes are respectively

and

Then the two segments are merged into one. In this embodiment, the value of the third threshold is 500 milliseconds, which is only exemplary, and is not limited in the present invention.

After the tolerant merging process, the user's opening duration information obtained by accumulating the durations of all segments is more reasonable than the user's opening duration information obtained without the tolerant merging process; and the obtained vocal audio maintains the continuity of the speech segments. sex.

In the above-recorded embodiment, step 12 specifically adopts adding null data with equal time lengths before the original audio head and after the tail, for example, both are 480 milliseconds; and in step 13, a window with a duration of 2 times 480 milliseconds or 960 milliseconds is used. Divide the original audio to get multiple sub audios.

In other embodiments of the present invention, the length of time of the null data added before the original audio header and after the tail may not be equal. That is, add the null data of the first duration before the original audio head, and add the null data of the second duration after the original audio tail; and the third duration of the sum of the first duration and the second duration is cut. The split window splits the original audio to obtain sub-audio.

For example, the first duration is 240 milliseconds, the second duration is 720 milliseconds, and the split window is the sum of the first duration and the second duration, that is, 960 milliseconds. It can be seen that the duration of the sub-audio obtained by using this method is the same as that in the above embodiment, which is still 960ms.

Using this segmentation method, the calculated sub-audio vocal probability is approximated as the vocal probability value of the sub-audio at 1/4 time. Assuming that the start time and the end time of a certain sub-audio are represented as t _i and t _i +0.96s respectively in the original audio, then the probability value of the sub-audio vocals is approximated as the human voice at the time of t _i +0.24s in the sub-audio probability value. Further, when the human voice audio is output, an audio segment composed of the 1/4th time point in each sub-audio that is continuously judged to be a human voice is obtained. It can be known that since the sub-audio is obtained by segmenting the original audio by the first length, the first 1/4 moment is separated by the first length, for example, 10ms used in the above-mentioned embodiment.

The resulting array of vocal probabilities for the sub-audio may be filtered using the same method as described above.

When performing the audio energy correction preprocessing on the obtained vocal probability array of the sub-audio, a better way is to calculate the energy value of the first 1/4 moment in the sub-audio. For example, assuming that the start time and the end time of a certain sub-audio are represented as t _i and t _i +0.96s respectively in the original audio, then calculate the energy value at the time of t _i +0.24s, and obtain the sub-audio according to the energy value. Probability correction factor for audio (t _i , t _i +0.96s).

Corresponding to the foregoing application function implementation method embodiments, the present application further provides an audio recognition device. The device includes:

processor; and

A memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method described above. Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment of the method, and will not be described in detail here.

Those skilled in the art will also appreciate that the various exemplary logical blocks, modules, circuits, and algorithm steps described in connection with the applications herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.

Various embodiments of the present application have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the various embodiments, the practical application or improvement over the technology in the marketplace, or to enable others of ordinary skill in the art to understand the various embodiments disclosed herein.

Claims

An audio recognition method, comprising:

Obtain the original audio, add the empty data of the first duration before the original audio header, and add the empty data of the second duration after the original audio tail to obtain the expanded audio;

Taking the third duration of the sum of the first duration and the second duration as the segmentation window, starting from the header of the expanded audio with the first duration, and obtaining a plurality of sub-audios after windowing in turn;

Calculate the time-frequency feature sequence of the sub-audio respectively;

The neural network obtains the probability that the sub-audio belongs to a specific classification according to the time-frequency feature sequence;

The probabilities are respectively compared with decision thresholds to decide whether the sub-audio is of a particular class.
The method of claim 1, wherein:

The time-frequency feature sequence of the sub-audio is a Mel-frequency cepstral coefficient feature sequence;

And, the neural network obtains the vocal probability of the sub-audio according to the Mel frequency cepstral coefficient feature sequence;

And, comparing the human voice probabilities with the decision thresholds to determine whether the sub-audio is human voice.
The method according to claim 2, wherein after obtaining the vocal probability of the sub-audio, the method further comprises:

Obtain an array of probabilities that all sub-audios of the original audio belong to human voices;

The probability values in the array are filtered using the first number as a window to obtain the filtered probability.
The method according to claim 3, wherein the array of human voice probabilities is filtered by means of median filtering.
The method according to claim 2 or 3, wherein, according to a preset rule, determining whether the sub-audio is a human voice according to the human voice probability comprises:

Acquire the audio energy value of the sub-audio in the original audio at a certain time point; and set the human voice probability adjustment factor according to the energy value, including,

If the energy value is greater than the energy upper limit, the vocal probability adjustment factor of the sub-audio is set to 1;

If the energy value is less than the energy lower limit, the vocal probability adjustment factor of the sub-audio is set to 0;

If the energy value is not greater than the energy upper limit and not less than the energy lower limit, normalize the vocal probability adjustment factor to be between 0 and 1 according to the energy value;

Multiply the sub-audio vocal probability adjustment factor by the sub-audio vocal probability to obtain the revised sub-audio vocal probability;

And, comparing the modified sub-audio vocal probabilities with a decision threshold respectively to determine whether the sub-audio is human.
The method according to claim 5, wherein the method further comprises:

Obtain the sub-audio that is continuously judged as human voice in the original audio;

Acquiring the audio segment composed of the determined time points of the sub-audio that is continuously judged to be a human voice;

The audio clip is output.
The method according to claim 6, wherein:

The first duration is equal to the second duration;

And, the audio energy value of the determined moment in the sub audio is specifically the audio energy value of the sub audio center moment;

The acquiring the audio segment composed of the determined time points of the sub-audio continuously judged to be the human voice is specifically acquiring the audio segment composed of the central time point of the sub-audio continuously determined to be the human voice.
The method according to claim 6 or 7, wherein before the outputting audio, the method further comprises:

If the time interval between the adjacent audio clips is smaller than the third threshold, acquire the audio clips between the adjacent audio clips.
The method of claim 8, further comprising:

Count the duration of the output audio.
A device for audio recognition, comprising:

processor; and

A memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method of any of claims 1-9.