CN111145763A

CN111145763A - GRU-based voice recognition method and system in audio

Info

Publication number: CN111145763A
Application number: CN201911298207.0A
Authority: CN
Inventors: 曾志先; 肖龙源; 李稀敏; 蔡振华; 刘晓葳
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2020-05-12

Abstract

The invention discloses a method for identifying human voice in audio based on GRU, which comprises the following steps: s11, collecting audio to be identified, and storing audio data into an array; s12, converting the array into voiceprint characteristic data; s13, voiceprint characteristic data are recognized based on a preset GRU neural network recognition model, and two numerical values are output by an output layer of the preset GRU neural network recognition model; s14, converting the numerical value into the probability that the voiceprint characteristic data is the voice; s15, averaging all probabilities in a unit time, and if the average value is larger than a set default threshold value, determining that the voice occurs in the audio frequency in the unit time. The invention adopts an end-to-end network structure to identify the voice in the audio, the identification effect is good, and the quality of the voice fragments obtained by identification is high.

Description

GRU-based voice recognition method and system in audio

Technical Field

The invention relates to the technical field of audio identification, in particular to a method and a system for identifying human voice in audio based on GRU.

Background

With the development of voice technology and natural language processing technology, smart speakers have been developed rapidly in recent years, and more people operate devices and acquire information through smart speakers. The current intelligent sound box wakes up the equipment through a voice wake-up technology, and then recognizes the content of the user speaking through a voice recognition technology, so as to perform judgment.

The extraction of human voice is a main way of front-end processing of voice signals, and how to efficiently recognize segments including human voice in voice signals and extract the segments for voice recognition is a technical problem to be solved.

Disclosure of Invention

The invention provides a method and a system for identifying the voice in the audio based on GRU (generic text Unit), aiming at solving the problems.

In order to achieve the purpose, the invention adopts the technical scheme that:

a method for identifying human voice in audio based on GRU includes the following steps:

s11, collecting audio to be identified, taking A second as unit time, taking B second as a window and C second as displacement time in one unit time, and storing audio data with the time length of B second in each unit time into 1 array (actual time length is used when the time length is less than B second), wherein C is more than 0 and less than or equal to B and less than or equal to A;

s12, converting the array into voiceprint characteristic data;

s13, identifying the voiceprint characteristic data based on a preset GRU neural network identification model, wherein an output layer of the preset GRU neural network identification model outputs two numerical values which are the score of the voiceprint characteristic data which is human voice and the score of the voiceprint characteristic data which is non-human voice;

s14, converting the numerical value into the probability that the voiceprint characteristic data is the voice based on a SoftMax algorithm;

s15, averaging all the probabilities in a unit time, and if the average value is larger than a set default threshold value, determining that human voice appears in the audio in the unit time;

the method for constructing the preset GRU neural network recognition model comprises the following steps:

s21, acquiring a training set, wherein the training set comprises positive samples and negative samples, the positive samples comprise voice audio data of human voices in various environments, and the negative samples comprise non-human voice audio data in various environments;

s22, converting the audio data of the training set into voiceprint characteristic data of the training set;

s23, training a GRU neural network recognition model by taking the training set voiceprint feature data as an input layer, wherein an output layer of the GRU neural network recognition model outputs two numerical values which are respectively a score of the training set voiceprint feature data being human voice and a score of the training set voiceprint feature data being non-human voice;

and S24, carrying out multiple times of iterative training, using cross entropy loss as a loss function, and optimizing a loss value through an Adam algorithm until the loss value tends to be stable, so as to finish the training.

Preferably, the preset GRU neural network recognition model is an RNN cyclic network with a 3-layer GRU structure, and the number of neurons in a hidden layer is 300.

Preferably, a PyAudio tool is used for collecting the audio to be recognized or the audio data of the training set, wherein the collected data is character string data, and a numpy tool is used for converting the character string data into numerical data.

Preferably, the audio data of the array or the training set is transformed into 40-dimensional MFCC feature data using a python _ speed _ features tool.

More preferably, the MFCC feature data is subjected to a numerical normalization, and the normalization is calculated by (raw value-average value)/standard deviation.

Preferably, the positive sample comprises a vocal segment cut from a song sung by a singer, a dialogue in a talk show or a sound in a story show.

Based on the same inventive concept, the invention also provides a system for recognizing human voice in audio based on GRU, which comprises:

the audio acquisition terminal is used for acquiring audio data to be identified;

and the identification module is used for identifying the audio data based on the method and outputting an identification result.

The invention has the beneficial effects that:

(1) the end-to-end network structure is used for identification, the audio frequency of unit time length is input, the judgment whether the current audio frequency contains the voice is directly output, and the identification speed is high.

(2) After the audio data in unit time are judged for multiple times, the result is determined according to the average value, the identification accuracy is high, and meanwhile, the loss of the voice fragments obtained through identification is reduced.

(3) The specific network structure is small, and the occupied resources are less.

Drawings

Fig. 1 is a flowchart of a human voice recognition method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a human voice recognition system according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention clearer and more obvious, the present invention is further described in detail with reference to specific embodiments below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment provides a method and a system for recognizing human voice in audio based on a GRU applied to a smart speaker, as shown in fig. 1, comprising the following steps:

s1, a microphone of the intelligent sound box collects user voice in real time, the system monitors audio data of the microphone in a circulating mode by using a PyAudio tool, 2 seconds are used as unit time, 1 second is used as a window, 0.1 second is used as displacement time, at the moment, the audio in one unit time is divided into 10 audio in 1 second, and each 1 audio data is respectively stored into 1 array. Because the data collected by the PyAudio is in a string format, we convert the string data into a numeric format through the frompbuffer of the numpy tool.

S2, converting the audio data array in the numerical format into a 40-dimensional MFCC characteristic through a python _ speed _ features tool, and carrying out numerical standardization treatment, wherein the average value is calculated through a mean method of numpy, the standard deviation is calculated through an std method of numpy, and the standardized calculation mode is (original value-average value)/standard deviation. By the normalization process, the influence of deviation data, such as suddenly appearing noise, on the entire audio can be reduced.

And S3, before the audio data are transmitted into a preset GRU neural network recognition model, reducing the interference when the model is judged through front-end processing such as noise reduction, reverberation removal, echo removal and the like, then transmitting the preprocessed MFCC characteristic data into the model, and outputting two numerical values by an output layer of the model, wherein the first numerical value is the score that the audio corresponding to the current array is the human voice, and the second numerical value is the score that the audio corresponding to the current array is the non-human voice.

In this embodiment, an RNN loop network with a 3-layer GRU structure is used, and the number of neurons in the hidden layer is 300. The RNN loop network is used as a network structure, because the RNN loop network can make full use of information in time sequence and make judgment on probability by combining the information before and after, and the audio data is just established on the time sequence relation, the model is more in line with the actual requirement, and the identification result is more accurate.

The construction method of the preset GRU neural network recognition model comprises the following steps: a great deal of voice-related audio such as songs sung by singers, conversations in interview programs, sounds in story programs and voice segments in the audio are intercepted to be used as positive samples through the network, and the audio is low-interference voice audio, so that voice features in the audio can be more easily learned by a GRU neural network recognition model. In addition, human voice sounds in various environments are collected through the recording equipment to serve as a positive sample of training, and non-human voice audio frequency fragments such as animal calling sounds, car and train whistling sounds and sea wave sounds in a quiet environment are also collected in network audio frequencies to serve as a negative sample of training. And (3) training the GRU neural network recognition model by using the positive samples and the negative samples, wherein 80% of audio data is used as a training set, 20% is used as a test set, cross entropy loss is used as a loss function, and the loss value is optimized by an Adam algorithm. After 2000 times of iterative training, the loss value is stabilized at about 0.35, and at this time, the construction of the preset GRU neural network recognition model is completed.

The MFCC features can better reflect the features of human voice heard by human ears, and the audio data of the training set is converted into the MFCC features with the dimension of 40 through the python _ speech _ features tool to train the model.

And S4, converting the two numerical values of the output layer into probabilities through a SoftMax algorithm, and normalizing the probability that the current audio data is the voice and the probability that the current audio data is the non-voice to serve as a judgment result.

And S5, calculating the average value of the probability results of 10 audios in unit time of 2 seconds, and if the average value is larger than a set default threshold value, judging that the current 2-second audio has voice. And the voice part is extracted separately for voice recognition and voice awakening operation.

This embodiment still provides a voice recognition system in the audio based on GRU who is applied to on the smart speaker, as shown in FIG. 2, including audio acquisition terminal 1 and identification module 2 that set up on the smart speaker.

And the audio acquisition terminal 1 is used for acquiring audio data to be identified and sending the audio data to the identification module 2. After the recognition module 2 receives the voice data, the voice data is recognized based on the method, and the voice part in the voice is extracted independently for voice recognition and voice awakening operation.

The specific network used by the system of the invention has the advantages of small structure and less occupied resources, so the system is suitable for mobile equipment and embedded equipment.

Those skilled in the art will understand that all or part of the steps in the above embodiments of the audio data recognition method may be implemented by a program instructing related hardware to complete, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the above description shows and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for identifying human voice in audio based on GRU is characterized by comprising the following steps:

s12, converting the array into voiceprint characteristic data;

2. The method of claim 1, wherein the predetermined GRU neural network recognition model is an RNN cyclic network with a 3-layer GRU structure, and the number of hidden layer neurons is 300.

3. The method of claim 1, wherein PyAudio tool is used to collect the audio to be recognized or the audio data of the training set, wherein the collected data is character string data, and a numpy tool is used to convert the character string data into numerical data.

4. A method of human voice recognition in GRU based audio as claimed in claim 1, wherein the audio data of the array or training set is converted into 40 dimensional MFCC feature data using a python _ speed _ features tool.

5. The method of claim 4, wherein the MFCC feature data is normalized by a value calculated as (raw-mean)/standard deviation.

6. The method of claim 1, wherein the positive sample comprises a voice clip audibly clipped from a song sung by a singer, a conversation in a talk show, or a story program.

7. A system for identifying human voice in audio based on GRU, comprising:

an identification module for identifying the audio data based on the method of any one of claims 1 to 6 and outputting the identification result.