CN111145763A - GRU-based voice recognition method and system in audio - Google Patents

GRU-based voice recognition method and system in audio Download PDF

Info

Publication number
CN111145763A
CN111145763A CN201911298207.0A CN201911298207A CN111145763A CN 111145763 A CN111145763 A CN 111145763A CN 201911298207 A CN201911298207 A CN 201911298207A CN 111145763 A CN111145763 A CN 111145763A
Authority
CN
China
Prior art keywords
audio
data
voice
gru
training set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911298207.0A
Other languages
Chinese (zh)
Inventor
曾志先
肖龙源
李稀敏
蔡振华
刘晓葳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN201911298207.0A priority Critical patent/CN111145763A/en
Publication of CN111145763A publication Critical patent/CN111145763A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a method for identifying human voice in audio based on GRU, which comprises the following steps: s11, collecting audio to be identified, and storing audio data into an array; s12, converting the array into voiceprint characteristic data; s13, voiceprint characteristic data are recognized based on a preset GRU neural network recognition model, and two numerical values are output by an output layer of the preset GRU neural network recognition model; s14, converting the numerical value into the probability that the voiceprint characteristic data is the voice; s15, averaging all probabilities in a unit time, and if the average value is larger than a set default threshold value, determining that the voice occurs in the audio frequency in the unit time. The invention adopts an end-to-end network structure to identify the voice in the audio, the identification effect is good, and the quality of the voice fragments obtained by identification is high.

Description

GRU-based voice recognition method and system in audio
Technical Field
The invention relates to the technical field of audio identification, in particular to a method and a system for identifying human voice in audio based on GRU.
Background
With the development of voice technology and natural language processing technology, smart speakers have been developed rapidly in recent years, and more people operate devices and acquire information through smart speakers. The current intelligent sound box wakes up the equipment through a voice wake-up technology, and then recognizes the content of the user speaking through a voice recognition technology, so as to perform judgment.
The extraction of human voice is a main way of front-end processing of voice signals, and how to efficiently recognize segments including human voice in voice signals and extract the segments for voice recognition is a technical problem to be solved.
Disclosure of Invention
The invention provides a method and a system for identifying the voice in the audio based on GRU (generic text Unit), aiming at solving the problems.
In order to achieve the purpose, the invention adopts the technical scheme that:
a method for identifying human voice in audio based on GRU includes the following steps:
s11, collecting audio to be identified, taking A second as unit time, taking B second as a window and C second as displacement time in one unit time, and storing audio data with the time length of B second in each unit time into 1 array (actual time length is used when the time length is less than B second), wherein C is more than 0 and less than or equal to B and less than or equal to A;
s12, converting the array into voiceprint characteristic data;
s13, identifying the voiceprint characteristic data based on a preset GRU neural network identification model, wherein an output layer of the preset GRU neural network identification model outputs two numerical values which are the score of the voiceprint characteristic data which is human voice and the score of the voiceprint characteristic data which is non-human voice;
s14, converting the numerical value into the probability that the voiceprint characteristic data is the voice based on a SoftMax algorithm;
s15, averaging all the probabilities in a unit time, and if the average value is larger than a set default threshold value, determining that human voice appears in the audio in the unit time;
the method for constructing the preset GRU neural network recognition model comprises the following steps:
s21, acquiring a training set, wherein the training set comprises positive samples and negative samples, the positive samples comprise voice audio data of human voices in various environments, and the negative samples comprise non-human voice audio data in various environments;
s22, converting the audio data of the training set into voiceprint characteristic data of the training set;
s23, training a GRU neural network recognition model by taking the training set voiceprint feature data as an input layer, wherein an output layer of the GRU neural network recognition model outputs two numerical values which are respectively a score of the training set voiceprint feature data being human voice and a score of the training set voiceprint feature data being non-human voice;
and S24, carrying out multiple times of iterative training, using cross entropy loss as a loss function, and optimizing a loss value through an Adam algorithm until the loss value tends to be stable, so as to finish the training.
Preferably, the preset GRU neural network recognition model is an RNN cyclic network with a 3-layer GRU structure, and the number of neurons in a hidden layer is 300.
Preferably, a PyAudio tool is used for collecting the audio to be recognized or the audio data of the training set, wherein the collected data is character string data, and a numpy tool is used for converting the character string data into numerical data.
Preferably, the audio data of the array or the training set is transformed into 40-dimensional MFCC feature data using a python _ speed _ features tool.
More preferably, the MFCC feature data is subjected to a numerical normalization, and the normalization is calculated by (raw value-average value)/standard deviation.
Preferably, the positive sample comprises a vocal segment cut from a song sung by a singer, a dialogue in a talk show or a sound in a story show.
Based on the same inventive concept, the invention also provides a system for recognizing human voice in audio based on GRU, which comprises:
the audio acquisition terminal is used for acquiring audio data to be identified;
and the identification module is used for identifying the audio data based on the method and outputting an identification result.
The invention has the beneficial effects that:
(1) the end-to-end network structure is used for identification, the audio frequency of unit time length is input, the judgment whether the current audio frequency contains the voice is directly output, and the identification speed is high.
(2) After the audio data in unit time are judged for multiple times, the result is determined according to the average value, the identification accuracy is high, and meanwhile, the loss of the voice fragments obtained through identification is reduced.
(3) The specific network structure is small, and the occupied resources are less.
Drawings
Fig. 1 is a flowchart of a human voice recognition method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a human voice recognition system according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention clearer and more obvious, the present invention is further described in detail with reference to specific embodiments below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment provides a method and a system for recognizing human voice in audio based on a GRU applied to a smart speaker, as shown in fig. 1, comprising the following steps:
s1, a microphone of the intelligent sound box collects user voice in real time, the system monitors audio data of the microphone in a circulating mode by using a PyAudio tool, 2 seconds are used as unit time, 1 second is used as a window, 0.1 second is used as displacement time, at the moment, the audio in one unit time is divided into 10 audio in 1 second, and each 1 audio data is respectively stored into 1 array. Because the data collected by the PyAudio is in a string format, we convert the string data into a numeric format through the frompbuffer of the numpy tool.
S2, converting the audio data array in the numerical format into a 40-dimensional MFCC characteristic through a python _ speed _ features tool, and carrying out numerical standardization treatment, wherein the average value is calculated through a mean method of numpy, the standard deviation is calculated through an std method of numpy, and the standardized calculation mode is (original value-average value)/standard deviation. By the normalization process, the influence of deviation data, such as suddenly appearing noise, on the entire audio can be reduced.
And S3, before the audio data are transmitted into a preset GRU neural network recognition model, reducing the interference when the model is judged through front-end processing such as noise reduction, reverberation removal, echo removal and the like, then transmitting the preprocessed MFCC characteristic data into the model, and outputting two numerical values by an output layer of the model, wherein the first numerical value is the score that the audio corresponding to the current array is the human voice, and the second numerical value is the score that the audio corresponding to the current array is the non-human voice.
In this embodiment, an RNN loop network with a 3-layer GRU structure is used, and the number of neurons in the hidden layer is 300. The RNN loop network is used as a network structure, because the RNN loop network can make full use of information in time sequence and make judgment on probability by combining the information before and after, and the audio data is just established on the time sequence relation, the model is more in line with the actual requirement, and the identification result is more accurate.
The construction method of the preset GRU neural network recognition model comprises the following steps: a great deal of voice-related audio such as songs sung by singers, conversations in interview programs, sounds in story programs and voice segments in the audio are intercepted to be used as positive samples through the network, and the audio is low-interference voice audio, so that voice features in the audio can be more easily learned by a GRU neural network recognition model. In addition, human voice sounds in various environments are collected through the recording equipment to serve as a positive sample of training, and non-human voice audio frequency fragments such as animal calling sounds, car and train whistling sounds and sea wave sounds in a quiet environment are also collected in network audio frequencies to serve as a negative sample of training. And (3) training the GRU neural network recognition model by using the positive samples and the negative samples, wherein 80% of audio data is used as a training set, 20% is used as a test set, cross entropy loss is used as a loss function, and the loss value is optimized by an Adam algorithm. After 2000 times of iterative training, the loss value is stabilized at about 0.35, and at this time, the construction of the preset GRU neural network recognition model is completed.
The MFCC features can better reflect the features of human voice heard by human ears, and the audio data of the training set is converted into the MFCC features with the dimension of 40 through the python _ speech _ features tool to train the model.
And S4, converting the two numerical values of the output layer into probabilities through a SoftMax algorithm, and normalizing the probability that the current audio data is the voice and the probability that the current audio data is the non-voice to serve as a judgment result.
And S5, calculating the average value of the probability results of 10 audios in unit time of 2 seconds, and if the average value is larger than a set default threshold value, judging that the current 2-second audio has voice. And the voice part is extracted separately for voice recognition and voice awakening operation.
This embodiment still provides a voice recognition system in the audio based on GRU who is applied to on the smart speaker, as shown in FIG. 2, including audio acquisition terminal 1 and identification module 2 that set up on the smart speaker.
And the audio acquisition terminal 1 is used for acquiring audio data to be identified and sending the audio data to the identification module 2. After the recognition module 2 receives the voice data, the voice data is recognized based on the method, and the voice part in the voice is extracted independently for voice recognition and voice awakening operation.
The specific network used by the system of the invention has the advantages of small structure and less occupied resources, so the system is suitable for mobile equipment and embedded equipment.
Those skilled in the art will understand that all or part of the steps in the above embodiments of the audio data recognition method may be implemented by a program instructing related hardware to complete, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the above description shows and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A method for identifying human voice in audio based on GRU is characterized by comprising the following steps:
s11, collecting audio to be identified, taking A second as unit time, taking B second as a window and C second as displacement time in one unit time, and storing audio data with the time length of B second in each unit time into 1 array (actual time length is used when the time length is less than B second), wherein C is more than 0 and less than or equal to B and less than or equal to A;
s12, converting the array into voiceprint characteristic data;
s13, identifying the voiceprint characteristic data based on a preset GRU neural network identification model, wherein an output layer of the preset GRU neural network identification model outputs two numerical values which are the score of the voiceprint characteristic data which is human voice and the score of the voiceprint characteristic data which is non-human voice;
s14, converting the numerical value into the probability that the voiceprint characteristic data is the voice based on a SoftMax algorithm;
s15, averaging all the probabilities in a unit time, and if the average value is larger than a set default threshold value, determining that human voice appears in the audio in the unit time;
the method for constructing the preset GRU neural network recognition model comprises the following steps:
s21, acquiring a training set, wherein the training set comprises positive samples and negative samples, the positive samples comprise voice audio data of human voices in various environments, and the negative samples comprise non-human voice audio data in various environments;
s22, converting the audio data of the training set into voiceprint characteristic data of the training set;
s23, training a GRU neural network recognition model by taking the training set voiceprint feature data as an input layer, wherein an output layer of the GRU neural network recognition model outputs two numerical values which are respectively a score of the training set voiceprint feature data being human voice and a score of the training set voiceprint feature data being non-human voice;
and S24, carrying out multiple times of iterative training, using cross entropy loss as a loss function, and optimizing a loss value through an Adam algorithm until the loss value tends to be stable, so as to finish the training.
2. The method of claim 1, wherein the predetermined GRU neural network recognition model is an RNN cyclic network with a 3-layer GRU structure, and the number of hidden layer neurons is 300.
3. The method of claim 1, wherein PyAudio tool is used to collect the audio to be recognized or the audio data of the training set, wherein the collected data is character string data, and a numpy tool is used to convert the character string data into numerical data.
4. A method of human voice recognition in GRU based audio as claimed in claim 1, wherein the audio data of the array or training set is converted into 40 dimensional MFCC feature data using a python _ speed _ features tool.
5. The method of claim 4, wherein the MFCC feature data is normalized by a value calculated as (raw-mean)/standard deviation.
6. The method of claim 1, wherein the positive sample comprises a voice clip audibly clipped from a song sung by a singer, a conversation in a talk show, or a story program.
7. A system for identifying human voice in audio based on GRU, comprising:
the audio acquisition terminal is used for acquiring audio data to be identified;
an identification module for identifying the audio data based on the method of any one of claims 1 to 6 and outputting the identification result.
CN201911298207.0A 2019-12-17 2019-12-17 GRU-based voice recognition method and system in audio Pending CN111145763A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911298207.0A CN111145763A (en) 2019-12-17 2019-12-17 GRU-based voice recognition method and system in audio

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911298207.0A CN111145763A (en) 2019-12-17 2019-12-17 GRU-based voice recognition method and system in audio

Publications (1)

Publication Number Publication Date
CN111145763A true CN111145763A (en) 2020-05-12

Family

ID=70518539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911298207.0A Pending CN111145763A (en) 2019-12-17 2019-12-17 GRU-based voice recognition method and system in audio

Country Status (1)

Country Link
CN (1) CN111145763A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270933A (en) * 2020-11-12 2021-01-26 北京猿力未来科技有限公司 Audio identification method and device
CN112397073A (en) * 2020-11-04 2021-02-23 北京三快在线科技有限公司 Audio data processing method and device
CN112863548A (en) * 2021-01-22 2021-05-28 北京百度网讯科技有限公司 Method for training audio detection model, audio detection method and device thereof
CN113284501A (en) * 2021-05-18 2021-08-20 平安科技(深圳)有限公司 Singer identification method, singer identification device, singer identification equipment and storage medium
CN115065912A (en) * 2022-06-22 2022-09-16 广州市迪声音响有限公司 Feedback inhibition device for screening sound box energy based on voiceprint screen technology

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050187761A1 (en) * 2004-02-10 2005-08-25 Samsung Electronics Co., Ltd. Apparatus, method, and medium for distinguishing vocal sound from other sounds
CN109599117A (en) * 2018-11-14 2019-04-09 厦门快商通信息技术有限公司 A kind of audio data recognition methods and human voice anti-replay identifying system
CN110085251A (en) * 2019-04-26 2019-08-02 腾讯音乐娱乐科技(深圳)有限公司 Voice extracting method, voice extraction element and Related product

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050187761A1 (en) * 2004-02-10 2005-08-25 Samsung Electronics Co., Ltd. Apparatus, method, and medium for distinguishing vocal sound from other sounds
CN109599117A (en) * 2018-11-14 2019-04-09 厦门快商通信息技术有限公司 A kind of audio data recognition methods and human voice anti-replay identifying system
CN110085251A (en) * 2019-04-26 2019-08-02 腾讯音乐娱乐科技(深圳)有限公司 Voice extracting method, voice extraction element and Related product

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112397073A (en) * 2020-11-04 2021-02-23 北京三快在线科技有限公司 Audio data processing method and device
CN112397073B (en) * 2020-11-04 2023-11-21 北京三快在线科技有限公司 Audio data processing method and device
CN112270933A (en) * 2020-11-12 2021-01-26 北京猿力未来科技有限公司 Audio identification method and device
WO2022100691A1 (en) * 2020-11-12 2022-05-19 北京猿力未来科技有限公司 Audio recognition method and device
CN112270933B (en) * 2020-11-12 2024-03-12 北京猿力未来科技有限公司 Audio identification method and device
CN112863548A (en) * 2021-01-22 2021-05-28 北京百度网讯科技有限公司 Method for training audio detection model, audio detection method and device thereof
CN113284501A (en) * 2021-05-18 2021-08-20 平安科技(深圳)有限公司 Singer identification method, singer identification device, singer identification equipment and storage medium
CN113284501B (en) * 2021-05-18 2024-03-08 平安科技(深圳)有限公司 Singer identification method, singer identification device, singer identification equipment and storage medium
CN115065912A (en) * 2022-06-22 2022-09-16 广州市迪声音响有限公司 Feedback inhibition device for screening sound box energy based on voiceprint screen technology

Similar Documents

Publication Publication Date Title
CN110364143B (en) Voice awakening method and device and intelligent electronic equipment
CN110428810B (en) Voice wake-up recognition method and device and electronic equipment
CN111145763A (en) GRU-based voice recognition method and system in audio
KR100636317B1 (en) Distributed Speech Recognition System and method
US9336780B2 (en) Identification of a local speaker
WO2021139425A1 (en) Voice activity detection method, apparatus and device, and storage medium
CN107871499B (en) Speech recognition method, system, computer device and computer-readable storage medium
CN104575504A (en) Method for personalized television voice wake-up by voiceprint and voice identification
WO2014153800A1 (en) Voice recognition system
CN105679310A (en) Method and system for speech recognition
CN110827795A (en) Voice input end judgment method, device, equipment, system and storage medium
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN111210829A (en) Speech recognition method, apparatus, system, device and computer readable storage medium
CN113192535B (en) Voice keyword retrieval method, system and electronic device
CN113823293B (en) Speaker recognition method and system based on voice enhancement
CN110428853A (en) Voice activity detection method, Voice activity detection device and electronic equipment
US11763801B2 (en) Method and system for outputting target audio, readable storage medium, and electronic device
CN108091340B (en) Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium
CN109215634A (en) A kind of method and its system of more word voice control on-off systems
CN109065026B (en) Recording control method and device
CN112185425A (en) Audio signal processing method, device, equipment and storage medium
Chakroun et al. Efficient text-independent speaker recognition with short utterances in both clean and uncontrolled environments
CN112116909A (en) Voice recognition method, device and system
CN114664303A (en) Continuous voice instruction rapid recognition control system
US11769491B1 (en) Performing utterance detection using convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200512