CN111145763A - GRU-based voice recognition method and system in audio - Google Patents
GRU-based voice recognition method and system in audio Download PDFInfo
- Publication number
- CN111145763A CN111145763A CN201911298207.0A CN201911298207A CN111145763A CN 111145763 A CN111145763 A CN 111145763A CN 201911298207 A CN201911298207 A CN 201911298207A CN 111145763 A CN111145763 A CN 111145763A
- Authority
- CN
- China
- Prior art keywords
- audio
- data
- voice
- gru
- training set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000013528 artificial neural network Methods 0.000 claims abstract description 19
- 238000012935 Averaging Methods 0.000 claims abstract description 3
- 238000012549 training Methods 0.000 claims description 30
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 238000006073 displacement reaction Methods 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 239000012634 fragment Substances 0.000 abstract description 3
- 230000000694 effects Effects 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 description 4
- 238000010606 normalization Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention discloses a method for identifying human voice in audio based on GRU, which comprises the following steps: s11, collecting audio to be identified, and storing audio data into an array; s12, converting the array into voiceprint characteristic data; s13, voiceprint characteristic data are recognized based on a preset GRU neural network recognition model, and two numerical values are output by an output layer of the preset GRU neural network recognition model; s14, converting the numerical value into the probability that the voiceprint characteristic data is the voice; s15, averaging all probabilities in a unit time, and if the average value is larger than a set default threshold value, determining that the voice occurs in the audio frequency in the unit time. The invention adopts an end-to-end network structure to identify the voice in the audio, the identification effect is good, and the quality of the voice fragments obtained by identification is high.
Description
Technical Field
The invention relates to the technical field of audio identification, in particular to a method and a system for identifying human voice in audio based on GRU.
Background
With the development of voice technology and natural language processing technology, smart speakers have been developed rapidly in recent years, and more people operate devices and acquire information through smart speakers. The current intelligent sound box wakes up the equipment through a voice wake-up technology, and then recognizes the content of the user speaking through a voice recognition technology, so as to perform judgment.
The extraction of human voice is a main way of front-end processing of voice signals, and how to efficiently recognize segments including human voice in voice signals and extract the segments for voice recognition is a technical problem to be solved.
Disclosure of Invention
The invention provides a method and a system for identifying the voice in the audio based on GRU (generic text Unit), aiming at solving the problems.
In order to achieve the purpose, the invention adopts the technical scheme that:
a method for identifying human voice in audio based on GRU includes the following steps:
s11, collecting audio to be identified, taking A second as unit time, taking B second as a window and C second as displacement time in one unit time, and storing audio data with the time length of B second in each unit time into 1 array (actual time length is used when the time length is less than B second), wherein C is more than 0 and less than or equal to B and less than or equal to A;
s12, converting the array into voiceprint characteristic data;
s13, identifying the voiceprint characteristic data based on a preset GRU neural network identification model, wherein an output layer of the preset GRU neural network identification model outputs two numerical values which are the score of the voiceprint characteristic data which is human voice and the score of the voiceprint characteristic data which is non-human voice;
s14, converting the numerical value into the probability that the voiceprint characteristic data is the voice based on a SoftMax algorithm;
s15, averaging all the probabilities in a unit time, and if the average value is larger than a set default threshold value, determining that human voice appears in the audio in the unit time;
the method for constructing the preset GRU neural network recognition model comprises the following steps:
s21, acquiring a training set, wherein the training set comprises positive samples and negative samples, the positive samples comprise voice audio data of human voices in various environments, and the negative samples comprise non-human voice audio data in various environments;
s22, converting the audio data of the training set into voiceprint characteristic data of the training set;
s23, training a GRU neural network recognition model by taking the training set voiceprint feature data as an input layer, wherein an output layer of the GRU neural network recognition model outputs two numerical values which are respectively a score of the training set voiceprint feature data being human voice and a score of the training set voiceprint feature data being non-human voice;
and S24, carrying out multiple times of iterative training, using cross entropy loss as a loss function, and optimizing a loss value through an Adam algorithm until the loss value tends to be stable, so as to finish the training.
Preferably, the preset GRU neural network recognition model is an RNN cyclic network with a 3-layer GRU structure, and the number of neurons in a hidden layer is 300.
Preferably, a PyAudio tool is used for collecting the audio to be recognized or the audio data of the training set, wherein the collected data is character string data, and a numpy tool is used for converting the character string data into numerical data.
Preferably, the audio data of the array or the training set is transformed into 40-dimensional MFCC feature data using a python _ speed _ features tool.
More preferably, the MFCC feature data is subjected to a numerical normalization, and the normalization is calculated by (raw value-average value)/standard deviation.
Preferably, the positive sample comprises a vocal segment cut from a song sung by a singer, a dialogue in a talk show or a sound in a story show.
Based on the same inventive concept, the invention also provides a system for recognizing human voice in audio based on GRU, which comprises:
the audio acquisition terminal is used for acquiring audio data to be identified;
and the identification module is used for identifying the audio data based on the method and outputting an identification result.
The invention has the beneficial effects that:
(1) the end-to-end network structure is used for identification, the audio frequency of unit time length is input, the judgment whether the current audio frequency contains the voice is directly output, and the identification speed is high.
(2) After the audio data in unit time are judged for multiple times, the result is determined according to the average value, the identification accuracy is high, and meanwhile, the loss of the voice fragments obtained through identification is reduced.
(3) The specific network structure is small, and the occupied resources are less.
Drawings
Fig. 1 is a flowchart of a human voice recognition method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a human voice recognition system according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention clearer and more obvious, the present invention is further described in detail with reference to specific embodiments below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment provides a method and a system for recognizing human voice in audio based on a GRU applied to a smart speaker, as shown in fig. 1, comprising the following steps:
s1, a microphone of the intelligent sound box collects user voice in real time, the system monitors audio data of the microphone in a circulating mode by using a PyAudio tool, 2 seconds are used as unit time, 1 second is used as a window, 0.1 second is used as displacement time, at the moment, the audio in one unit time is divided into 10 audio in 1 second, and each 1 audio data is respectively stored into 1 array. Because the data collected by the PyAudio is in a string format, we convert the string data into a numeric format through the frompbuffer of the numpy tool.
S2, converting the audio data array in the numerical format into a 40-dimensional MFCC characteristic through a python _ speed _ features tool, and carrying out numerical standardization treatment, wherein the average value is calculated through a mean method of numpy, the standard deviation is calculated through an std method of numpy, and the standardized calculation mode is (original value-average value)/standard deviation. By the normalization process, the influence of deviation data, such as suddenly appearing noise, on the entire audio can be reduced.
And S3, before the audio data are transmitted into a preset GRU neural network recognition model, reducing the interference when the model is judged through front-end processing such as noise reduction, reverberation removal, echo removal and the like, then transmitting the preprocessed MFCC characteristic data into the model, and outputting two numerical values by an output layer of the model, wherein the first numerical value is the score that the audio corresponding to the current array is the human voice, and the second numerical value is the score that the audio corresponding to the current array is the non-human voice.
In this embodiment, an RNN loop network with a 3-layer GRU structure is used, and the number of neurons in the hidden layer is 300. The RNN loop network is used as a network structure, because the RNN loop network can make full use of information in time sequence and make judgment on probability by combining the information before and after, and the audio data is just established on the time sequence relation, the model is more in line with the actual requirement, and the identification result is more accurate.
The construction method of the preset GRU neural network recognition model comprises the following steps: a great deal of voice-related audio such as songs sung by singers, conversations in interview programs, sounds in story programs and voice segments in the audio are intercepted to be used as positive samples through the network, and the audio is low-interference voice audio, so that voice features in the audio can be more easily learned by a GRU neural network recognition model. In addition, human voice sounds in various environments are collected through the recording equipment to serve as a positive sample of training, and non-human voice audio frequency fragments such as animal calling sounds, car and train whistling sounds and sea wave sounds in a quiet environment are also collected in network audio frequencies to serve as a negative sample of training. And (3) training the GRU neural network recognition model by using the positive samples and the negative samples, wherein 80% of audio data is used as a training set, 20% is used as a test set, cross entropy loss is used as a loss function, and the loss value is optimized by an Adam algorithm. After 2000 times of iterative training, the loss value is stabilized at about 0.35, and at this time, the construction of the preset GRU neural network recognition model is completed.
The MFCC features can better reflect the features of human voice heard by human ears, and the audio data of the training set is converted into the MFCC features with the dimension of 40 through the python _ speech _ features tool to train the model.
And S4, converting the two numerical values of the output layer into probabilities through a SoftMax algorithm, and normalizing the probability that the current audio data is the voice and the probability that the current audio data is the non-voice to serve as a judgment result.
And S5, calculating the average value of the probability results of 10 audios in unit time of 2 seconds, and if the average value is larger than a set default threshold value, judging that the current 2-second audio has voice. And the voice part is extracted separately for voice recognition and voice awakening operation.
This embodiment still provides a voice recognition system in the audio based on GRU who is applied to on the smart speaker, as shown in FIG. 2, including audio acquisition terminal 1 and identification module 2 that set up on the smart speaker.
And the audio acquisition terminal 1 is used for acquiring audio data to be identified and sending the audio data to the identification module 2. After the recognition module 2 receives the voice data, the voice data is recognized based on the method, and the voice part in the voice is extracted independently for voice recognition and voice awakening operation.
The specific network used by the system of the invention has the advantages of small structure and less occupied resources, so the system is suitable for mobile equipment and embedded equipment.
Those skilled in the art will understand that all or part of the steps in the above embodiments of the audio data recognition method may be implemented by a program instructing related hardware to complete, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the above description shows and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (7)
1. A method for identifying human voice in audio based on GRU is characterized by comprising the following steps:
s11, collecting audio to be identified, taking A second as unit time, taking B second as a window and C second as displacement time in one unit time, and storing audio data with the time length of B second in each unit time into 1 array (actual time length is used when the time length is less than B second), wherein C is more than 0 and less than or equal to B and less than or equal to A;
s12, converting the array into voiceprint characteristic data;
s13, identifying the voiceprint characteristic data based on a preset GRU neural network identification model, wherein an output layer of the preset GRU neural network identification model outputs two numerical values which are the score of the voiceprint characteristic data which is human voice and the score of the voiceprint characteristic data which is non-human voice;
s14, converting the numerical value into the probability that the voiceprint characteristic data is the voice based on a SoftMax algorithm;
s15, averaging all the probabilities in a unit time, and if the average value is larger than a set default threshold value, determining that human voice appears in the audio in the unit time;
the method for constructing the preset GRU neural network recognition model comprises the following steps:
s21, acquiring a training set, wherein the training set comprises positive samples and negative samples, the positive samples comprise voice audio data of human voices in various environments, and the negative samples comprise non-human voice audio data in various environments;
s22, converting the audio data of the training set into voiceprint characteristic data of the training set;
s23, training a GRU neural network recognition model by taking the training set voiceprint feature data as an input layer, wherein an output layer of the GRU neural network recognition model outputs two numerical values which are respectively a score of the training set voiceprint feature data being human voice and a score of the training set voiceprint feature data being non-human voice;
and S24, carrying out multiple times of iterative training, using cross entropy loss as a loss function, and optimizing a loss value through an Adam algorithm until the loss value tends to be stable, so as to finish the training.
2. The method of claim 1, wherein the predetermined GRU neural network recognition model is an RNN cyclic network with a 3-layer GRU structure, and the number of hidden layer neurons is 300.
3. The method of claim 1, wherein PyAudio tool is used to collect the audio to be recognized or the audio data of the training set, wherein the collected data is character string data, and a numpy tool is used to convert the character string data into numerical data.
4. A method of human voice recognition in GRU based audio as claimed in claim 1, wherein the audio data of the array or training set is converted into 40 dimensional MFCC feature data using a python _ speed _ features tool.
5. The method of claim 4, wherein the MFCC feature data is normalized by a value calculated as (raw-mean)/standard deviation.
6. The method of claim 1, wherein the positive sample comprises a voice clip audibly clipped from a song sung by a singer, a conversation in a talk show, or a story program.
7. A system for identifying human voice in audio based on GRU, comprising:
the audio acquisition terminal is used for acquiring audio data to be identified;
an identification module for identifying the audio data based on the method of any one of claims 1 to 6 and outputting the identification result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911298207.0A CN111145763A (en) | 2019-12-17 | 2019-12-17 | GRU-based voice recognition method and system in audio |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911298207.0A CN111145763A (en) | 2019-12-17 | 2019-12-17 | GRU-based voice recognition method and system in audio |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111145763A true CN111145763A (en) | 2020-05-12 |
Family
ID=70518539
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911298207.0A Pending CN111145763A (en) | 2019-12-17 | 2019-12-17 | GRU-based voice recognition method and system in audio |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111145763A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112270933A (en) * | 2020-11-12 | 2021-01-26 | 北京猿力未来科技有限公司 | Audio identification method and device |
CN112397073A (en) * | 2020-11-04 | 2021-02-23 | 北京三快在线科技有限公司 | Audio data processing method and device |
CN112863548A (en) * | 2021-01-22 | 2021-05-28 | 北京百度网讯科技有限公司 | Method for training audio detection model, audio detection method and device thereof |
CN113284501A (en) * | 2021-05-18 | 2021-08-20 | 平安科技(深圳)有限公司 | Singer identification method, singer identification device, singer identification equipment and storage medium |
CN115065912A (en) * | 2022-06-22 | 2022-09-16 | 广州市迪声音响有限公司 | Feedback inhibition device for screening sound box energy based on voiceprint screen technology |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050187761A1 (en) * | 2004-02-10 | 2005-08-25 | Samsung Electronics Co., Ltd. | Apparatus, method, and medium for distinguishing vocal sound from other sounds |
CN109599117A (en) * | 2018-11-14 | 2019-04-09 | 厦门快商通信息技术有限公司 | A kind of audio data recognition methods and human voice anti-replay identifying system |
CN110085251A (en) * | 2019-04-26 | 2019-08-02 | 腾讯音乐娱乐科技(深圳)有限公司 | Voice extracting method, voice extraction element and Related product |
-
2019
- 2019-12-17 CN CN201911298207.0A patent/CN111145763A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050187761A1 (en) * | 2004-02-10 | 2005-08-25 | Samsung Electronics Co., Ltd. | Apparatus, method, and medium for distinguishing vocal sound from other sounds |
CN109599117A (en) * | 2018-11-14 | 2019-04-09 | 厦门快商通信息技术有限公司 | A kind of audio data recognition methods and human voice anti-replay identifying system |
CN110085251A (en) * | 2019-04-26 | 2019-08-02 | 腾讯音乐娱乐科技(深圳)有限公司 | Voice extracting method, voice extraction element and Related product |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112397073A (en) * | 2020-11-04 | 2021-02-23 | 北京三快在线科技有限公司 | Audio data processing method and device |
CN112397073B (en) * | 2020-11-04 | 2023-11-21 | 北京三快在线科技有限公司 | Audio data processing method and device |
CN112270933A (en) * | 2020-11-12 | 2021-01-26 | 北京猿力未来科技有限公司 | Audio identification method and device |
WO2022100691A1 (en) * | 2020-11-12 | 2022-05-19 | 北京猿力未来科技有限公司 | Audio recognition method and device |
CN112270933B (en) * | 2020-11-12 | 2024-03-12 | 北京猿力未来科技有限公司 | Audio identification method and device |
CN112863548A (en) * | 2021-01-22 | 2021-05-28 | 北京百度网讯科技有限公司 | Method for training audio detection model, audio detection method and device thereof |
CN113284501A (en) * | 2021-05-18 | 2021-08-20 | 平安科技(深圳)有限公司 | Singer identification method, singer identification device, singer identification equipment and storage medium |
CN113284501B (en) * | 2021-05-18 | 2024-03-08 | 平安科技(深圳)有限公司 | Singer identification method, singer identification device, singer identification equipment and storage medium |
CN115065912A (en) * | 2022-06-22 | 2022-09-16 | 广州市迪声音响有限公司 | Feedback inhibition device for screening sound box energy based on voiceprint screen technology |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110364143B (en) | Voice awakening method and device and intelligent electronic equipment | |
CN110428810B (en) | Voice wake-up recognition method and device and electronic equipment | |
CN111145763A (en) | GRU-based voice recognition method and system in audio | |
KR100636317B1 (en) | Distributed Speech Recognition System and method | |
US9336780B2 (en) | Identification of a local speaker | |
WO2021139425A1 (en) | Voice activity detection method, apparatus and device, and storage medium | |
CN107871499B (en) | Speech recognition method, system, computer device and computer-readable storage medium | |
CN104575504A (en) | Method for personalized television voice wake-up by voiceprint and voice identification | |
WO2014153800A1 (en) | Voice recognition system | |
CN105679310A (en) | Method and system for speech recognition | |
CN110827795A (en) | Voice input end judgment method, device, equipment, system and storage medium | |
EP3989217B1 (en) | Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium | |
CN111210829A (en) | Speech recognition method, apparatus, system, device and computer readable storage medium | |
CN113192535B (en) | Voice keyword retrieval method, system and electronic device | |
CN113823293B (en) | Speaker recognition method and system based on voice enhancement | |
CN110428853A (en) | Voice activity detection method, Voice activity detection device and electronic equipment | |
US11763801B2 (en) | Method and system for outputting target audio, readable storage medium, and electronic device | |
CN108091340B (en) | Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium | |
CN109215634A (en) | A kind of method and its system of more word voice control on-off systems | |
CN109065026B (en) | Recording control method and device | |
CN112185425A (en) | Audio signal processing method, device, equipment and storage medium | |
Chakroun et al. | Efficient text-independent speaker recognition with short utterances in both clean and uncontrolled environments | |
CN112116909A (en) | Voice recognition method, device and system | |
CN114664303A (en) | Continuous voice instruction rapid recognition control system | |
US11769491B1 (en) | Performing utterance detection using convolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200512 |