CN112185357A - Device and method for simultaneously recognizing human voice and non-human voice - Google Patents
Device and method for simultaneously recognizing human voice and non-human voice Download PDFInfo
- Publication number
- CN112185357A CN112185357A CN202011384504.XA CN202011384504A CN112185357A CN 112185357 A CN112185357 A CN 112185357A CN 202011384504 A CN202011384504 A CN 202011384504A CN 112185357 A CN112185357 A CN 112185357A
- Authority
- CN
- China
- Prior art keywords
- human voice
- recognition
- voice
- human
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000012545 processing Methods 0.000 claims abstract description 47
- 238000000605 extraction Methods 0.000 claims abstract description 11
- 230000004927 fusion Effects 0.000 claims abstract description 10
- 238000013528 artificial neural network Methods 0.000 claims description 25
- 238000012549 training Methods 0.000 claims description 25
- 230000001186 cumulative effect Effects 0.000 claims description 21
- 238000009432 framing Methods 0.000 claims description 6
- 230000005236 sound signal Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000001514 detection method Methods 0.000 claims description 3
- 230000005855 radiation Effects 0.000 claims description 3
- NGVDGCNFYWLIFO-UHFFFAOYSA-N pyridoxal 5'-phosphate Chemical compound CC1=NC=C(COP(O)(O)=O)C(C=O)=C1O NGVDGCNFYWLIFO-UHFFFAOYSA-N 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 14
- 230000004044 response Effects 0.000 abstract description 7
- 230000008569 process Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 206010011469 Crying Diseases 0.000 description 1
- 206010041235 Snoring Diseases 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000002747 voluntary effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
A device for simultaneously identifying human voice and non-human voice comprises a sound source input unit, a feature extraction unit connected with the sound source input unit, N identification models and N identification result processing units, wherein each identification model is connected with one identification result processing unit; the N recognition models consist of two recognition models, namely a human voice recognition model and a non-human voice recognition model; the recognition result processing unit judges and recognizes the output results of all recognition models as human voice or non-human voice; the device also comprises an identification result fusion unit which is used for triggering the upper-layer application according to the result of the human-voice non-human-voice identification result processing unit. The invention also discloses a method for simultaneously identifying the human voice and the non-human voice. The invention can solve the problem that multi-source complex signals in a sound source are simultaneously and respectively identified; under the condition of ensuring the identification effects of the two, the identification response speed is high, and the response is sensitive.
Description
Technical Field
The invention belongs to the technical field of intelligent voice recognition, and particularly relates to a device and a method for recognizing human voice and non-human voice simultaneously.
Background
Most of the current products based on voice interaction are focused on the recognition of human voice or the recognition of a certain type of signal in a voice signal, but the voice recognition is not only limited to the human voice recognition, but the voice is a multi-source complex signal. With the rapid landing of speech recognition, products with more various demands are also promoted. How to simultaneously recognize human voice and non-human voice can effectively utilize multi-source information, enrich the practicability of products and become a trend of future voice recognition development.
The human voice recognition mainly has the following problems: 1) the amount of speech information is large. The speech patterns are different not only for different speakers but also for the same speaker, for example, speech information of a speaker is different between voluntary speaking and careful speaking. The way a person speaks varies over time. 2) Ambiguity of speech. When a speaker speaks, different words may sound similar. This is common in english and chinese. 3) The phonetic characteristics of a single letter or word, etc. are influenced by the context so as to change accents, pitch, volume, speed of articulation, etc. 4) Environmental noise and interference have a serious influence on speech recognition, resulting in a low recognition rate. Due to the above factors, it is difficult to achieve agreement in recognition rate between users or evaluators in the voice recognition test.
The speech recognition of non-human voice mainly has the following problems: 1) the data collection is limited. The non-human voices are various in types, the acquisition of specific environments and places is limited, for example, snore, earthquake early warning sound or crying of children and the like, and the quantity of the linguistic data directly influences the recognition effect. 2) Is susceptible to interference from the human acoustic environment. The non-human voice identification is characterized in that when human voice is used as background environment identification, the identification quality is directly influenced by the noise degree. 3) The recognition speed is slow, when the human voice and the non-human voice are recognized simultaneously, two or more than two models may be needed to recognize simultaneously, and the recognition speed is limited by hardware and memory of the end-side voice recognition. For the above reasons, how to obtain good recognition and small human intervention is a major technical difficulty in testing non-human recognition.
Disclosure of Invention
In order to overcome the defects of the existing corpus processing technology, the invention discloses a device and a method for simultaneously identifying human voice and non-human voice.
The device for simultaneously identifying the human voice and the non-human voice comprises a sound source input unit, a feature extraction unit connected with the sound source input unit, N identification models and N identification result processing units, wherein each identification model is connected with one identification result processing unit; the N recognition models consist of two recognition models, namely a human voice recognition model and a non-human voice recognition model; n is greater than or equal to 2;
the input ends of the N recognition models are all connected with the output end of the characteristic extraction unit, the output ends of the N recognition result processing units are all connected with the input end of the recognition result fusion unit,
the recognition result processing unit judges and recognizes the output results of all recognition models as human voice or non-human voice;
the device also comprises an identification result fusion unit which is used for triggering the upper-layer application according to the result of the human-voice non-human-voice identification result processing unit.
Specifically, the recognition model is in the following form:
the first part P (Y | W) represents the probability of the occurrence of the corresponding speech given the text sequence W, i.e. the acoustic model; the second part represents the probability p (W) of the text sequence W, i.e. the language model, the subscript W of the argmax function representing the words or words that make up W of the text sequence.
Preferably, the determination manner of the recognition result processing unit is specifically:
identifying the human voice, calculating N frame average probability and N frame accumulative probability of N frame decoding results output by N identification models, and outputting a human voice identification result if the N frame average probability reaches a specified average threshold of the human voice and the N frame accumulative probability reaches a specified accumulative threshold of the human voice;
for the non-human voice recognition, calculating N frame average probability and N frame cumulative probability of N frame decoding results of the non-human voice recognition model, and outputting a non-human voice recognition result if the N frame average probability reaches a specified average threshold of the non-human voice and the N frame cumulative probability reaches a specified cumulative threshold of the non-human voice;
wherein the average value of the N frame results reaches a specified average threshold value P (mean), which is in accordance with equation (4),
and the cumulative value of the N frame results reaches the specified cumulative threshold value P (acc), which conforms to the formula (5);
p (i) represents the probability of the ith frame.
Preferably, the human voice recognition model is trained by adopting the following method:
preparing a voice corpus, a non-voice corpus, a plurality of non-target voices and a noise corpus of the non-target voices to extract features of neural network training;
the training features are divided into key features and non-key features, and the voice features are marked as corresponding texts to be used as the key features input by the voice neural network;
selecting any part of non-human voice characteristics and various non-target human voices and noise characteristics of the non-human voices to be marked as noise to serve as non-key characteristics input by a human voice neural network;
wherein the ratio of the data volume of the non-critical features to the critical features does not exceed 1: 3;
and (4) taking the key features and the non-key features as all inputs of the human voice acoustic model training, and performing neural network training to output a human voice recognition model.
Preferably, the non-human voice recognition model is trained by adopting the following method:
marking the non-human voice features as corresponding texts as key features input by the non-human voice neural network, and selecting any part of human voice features and various non-target human voices and noise features of the non-human voices as noise to be marked as the non-key features input by the non-human voice neural network;
wherein the ratio of the data volume of the non-critical features to the critical features does not exceed 1: 3;
and taking the key non-human voice features and the non-key noise features as all inputs of the non-human voice acoustic model training, and performing neural network training to output the non-human voice acoustic model.
The invention also discloses a method for simultaneously identifying the human voice and the non-human voice, which comprises the following steps:
preprocessing an input sound signal;
extracting acoustic characteristic signals from the preprocessed sound signals;
simultaneously inputting the characteristic signals into N recognition models consisting of two recognition models, namely a human voice recognition model and a non-human voice recognition model;
the N recognition models respectively input the recognition results into N recognition result processing units,
the recognition result processing unit judges and recognizes the output results of the respective recognition models as human voice or non-human voice.
Preferably, the determination manner of the recognition result processing unit is specifically:
identifying the human voice, calculating N frame average probability and N frame accumulative probability of N frame decoding results output by N identification models, and outputting a human voice identification result if the N frame average probability reaches a specified average threshold of the human voice and the N frame accumulative probability reaches a specified accumulative threshold of the human voice;
for the non-human voice recognition, calculating N frame average probability and N frame cumulative probability of N frame decoding results of the non-human voice recognition model, and outputting a non-human voice recognition result if the N frame average probability reaches a specified average threshold of the non-human voice and the N frame cumulative probability reaches a specified cumulative threshold of the non-human voice;
wherein the average value of the N frame results reaches a specified average threshold value P (mean), which is in accordance with equation (4),
and the cumulative value of the N frame results reaches the specified cumulative threshold value P (acc), which conforms to the formula (5);
p (i) represents the probability of the ith frame.
Preferably, the pretreatment comprises the following treatment modes: carrying out end point detection to find a starting point and an end point of the voice signal; the high-frequency signal of the voice is emphasized to remove the influence of the lip radiation through pre-emphasis; and processing the audio into a signal with short-time stationarity by framing, and performing windowing and emphasizing operation on the central segment of each short-segment signal after framing.
Preferably, the manner of extracting the acoustic feature signal is any one of MFCC, LPC, PLP, and LPCC.
The invention can solve the technical problem of simultaneously and respectively identifying multi-source complex signals in a sound source under a complex sound environment, namely simultaneously identifying human voice and non-human voice, and has good identification effect for non-specific speakers; the human voice audio recognition is slightly influenced by the target non-human voice environment; the non-human voice audio frequency has good overall recognition effect by methods such as corpus expansion and the like, and is slightly influenced by human voice environment; under the condition of ensuring the human voice and non-human voice recognition effect, the recognition response speed is high, and the response is sensitive.
Drawings
FIG. 1 is a schematic diagram of an embodiment of training human voice and non-human voice acoustic models according to the present invention;
fig. 2 is a schematic diagram of an embodiment of the device for simultaneously recognizing human voice and non-human voice according to the present invention.
Detailed Description
The following provides a more detailed description of the present invention.
The device for simultaneously identifying the human voice and the non-human voice comprises a sound source input unit, a feature extraction unit connected with the sound source input unit, N identification models and N identification result processing units, wherein each identification model is connected with one identification result processing unit; the N recognition models consist of two recognition models, namely a human voice recognition model and a non-human voice recognition model; n is greater than or equal to 2;
the input ends of the N recognition models are all connected with the output end of the characteristic extraction unit, the output ends of the N recognition result processing units are all connected with the input end of the recognition result fusion unit,
the recognition result processing unit judges and recognizes the output results of all recognition models as human voice or non-human voice;
the device also comprises an identification result fusion unit which is used for triggering the upper-layer application according to the result of the human-voice non-human-voice identification result processing unit
The method provided by the invention is characterized in that multi-source complex signals in a sound source are simultaneously and respectively identified, human voice signal identification is converted into corresponding texts or commands, non-human voice signals of specified types are converted into self-defined output, and the device mainly comprises a human voice non-human voice neural network training module and a human voice non-human voice combined identification module.
The human-voice non-human-voice neural network training module is mainly used for inputting labeled linguistic data by using a neural network architecture to train an output Acoustic Model (AM), carrying out prediction statistics on pronunciation probability of unknown labeled sound source input and outputting a statistical result, and meanwhile, carrying out Viterbi decoding by combining a Language Model (LM) to obtain an optimal path output text and a corresponding joint probability.
A typical speech recognition probability model is shown in equation (1),
in formula (1), W denotes a text sequence, Y denotes an input speech signal, and P (W | Y) denotes a text sequence that obtains the maximum probability of speech correspondence given an unknown labeled speech. argmax is a function of parameters (sets) to the function, and the subscript W of the argmax function denotes the words or words of W that make up the text sequence.
Equation (2) can be derived from bayesian criteria,
the denominator p (y) in formula (2) represents the probability of the audio, and when an input is given, the probability can be regarded as p (y) =1, so as to obtain formula (3).
The first part P (Y | W) in equation (3) represents the probability of the corresponding speech occurring for a given text sequence, i.e. the Acoustic Model (AM); the second part represents the probability p (w) of the text sequence, the Language Model (LM). In practical application, the acoustic model AM is obtained by inputting large-scale linguistic data and training through a neural network, the language model LM is modeled by utilizing entries which can appear in a preset practical use scene, and the two models are combined through decoding to obtain an effective recognition result.
In order to train an acoustic model with balanced recognition effect, the distribution of linguistic data needs to be controlled, and a typical process diagram of a training structure is shown in fig. 1.
Preparing a voice corpus, a non-voice corpus, a plurality of non-target voices and a noise corpus of the non-target voices to extract features of neural network training; the method comprises the steps of dividing the voice characteristics into key characteristics and non-key characteristics, marking the voice characteristics as corresponding texts as the key characteristics input by the voice neural network, and marking part of the non-voice characteristics and noise characteristics of a plurality of non-target voices and non-voice as noise as the non-key characteristics input by the voice neural network. And taking the key voice features and the non-key noise features as all inputs of the voice acoustic model training, and performing neural network training to output the voice acoustic model. Marking the non-human voice features as corresponding texts as key features input by the non-human voice neural network, marking part of human voice features and noise features of various non-target human voices and non-human voices as noise as the non-key features input by the non-human voice neural network, wherein the non-key features are as follows: the ratio of the data amount of the key features does not exceed 1: 3. taking key non-human voice features and non-key noise features as all inputs of non-human voice acoustic model training, wherein the non-key features: the ratio of the data amount of the key features does not exceed 1: and 3, carrying out neural network training and outputting the non-human acoustic model.
In order to reduce the situation that the target non-human voice sound signal is mistakenly recognized as human voice in the human voice acoustic model recognition process, part of non-human voice features are used as noise for training; and randomly selecting a corpus which does not contain obvious human voice in the non-human voice features as the noise of the human voice.
The device for simultaneously identifying the human voice and the non-human voice comprises a sound source input unit, a feature extraction unit, a model identification unit, an identification result processing unit and an identification result fusion unit, and a specific structural schematic diagram is shown in figure 2.
The sound source input unit carries out necessary preprocessing on multi-source sound complex signals which are input in real time and can comprise human voice, non-human voice and noise.
The pretreatment generally includes the following processes: the method comprises the steps of carrying out end point detection to find a starting point and an end point of a voice signal, carrying out emphasis on a high-frequency signal of the voice by pre-emphasis to remove the influence of lip radiation, framing to process audio into a signal with short-time stationarity, and carrying out windowing and emphasizing on a central segment of each short-segment signal after framing.
The feature extraction unit extracts features of the preprocessed voice signal, and the types of feature extraction modes are not limited to MFCC (Mel Frequency cepstral Coefficient), LPC (Linear Prediction Coefficient), PLP (Perceptual Linear Prediction), LPCC (Linear Prediction cepstral Coefficient) and other modes;
the model identification unit comprises N identification models (N is more than or equal to 2) and is divided into a human voice identification model and a non-human voice identification model, for example, as in the specific embodiment shown in FIG. 2, the model identification unit comprises 1 human voice identification model and N non-human voice identification models. Correspondingly, the system also comprises a recognition result processing unit of 1 human voice and a recognition result processing unit of n non-human voices.
Each recognition model receives input same feature signals from the feature extraction unit and respectively recognizes each frame of feature in parallel, the human voice acoustic model performs recognition decoding in combination with the language model to obtain recognition content and corresponding probability calculation, the non-human voice acoustic model can perform decoding in combination with the language model to obtain the recognition content and corresponding probability calculation, the optimal output of the acoustic model can be directly used as the recognition content, meanwhile, the average probability and the accumulated probability of multi-frame results are used as judgment bases, the decoding step is omitted, and operation resources are saved.
The recognition result processing unit is divided into N processing units (N is more than or equal to 2), the N processing units are connected with the model recognition units in a one-to-one correspondence mode, different recognition result processing units are not related to each other, and each processing unit carries out independent processing judgment on each frame of recognition result output by the model recognition unit connected with the processing unit.
Each recognition result processing unit judges the human voice recognition and the non-human voice recognition at the same time,
and identifying the human voice, calculating N frame average probability and N frame accumulative probability of N frame decoding results output by the N identification models, and outputting the human voice identification result if the N frame average probability reaches a specified average threshold of the human voice and the N frame accumulative probability reaches a specified accumulative threshold of the human voice.
And for the non-human voice recognition, calculating N frame average probability and N frame accumulative probability of N frame decoding results of the non-human voice recognition model, and outputting a non-human voice recognition result if the N frame average probability reaches a specified average threshold of the non-human voice and the N frame accumulative probability reaches a specified accumulative threshold of the non-human voice.
It should be noted that each recognition result processing unit only processes the result output by the corresponding recognition model, for example, 2 voice recognition result processing units, one for recognizing english command words and one for recognizing chinese command words; two non-human voice recognition result processing units, one for recognizing baby cry and one for recognizing earthquake early warning sound, respectively output recognition results, and uniformly calculate the average value and the accumulated value of the values of the recognition results.
The average value of the multi-frame results, namely the average probability of N frames, reaches a specified average threshold value P (mean), namely the formula (4),
and the cumulative value of the multi-frame result, namely the cumulative probability of the N frames reaches a specified cumulative threshold value P (acc), namely the formula (5);
and P (i) represents the probability of the ith frame, and then the non-human voice recognition result is output.
The recognition result fusion unit triggers the corresponding upper-layer application according to the result of the human-voice non-human-voice recognition result processing unit, for example, the following results may exist: a voice recognized and a non-voice recognized, a voice recognized but a non-voice unrecognized, a voice unrecognized but a non-voice recognized, a voice unrecognized and a non-voice unrecognized.
For example, in the case that the human voice is recognized and the non-human voice is recognized, the recognition result fusion unit may trigger the upper layer application, and by judging the priority and whether the two kinds of recognition conflict, one of the human voice and non-human voice recognition may be preferentially responded to, for example, if the human voice recognition result is "turn off light", and the non-human voice recognition result is "earthquakewarning sound a", the light flashing is preferentially responded to the non-human voice recognition result.
In order to solve the problems that the recognition of human voice and non-human voice in multi-source complex sound signals is limited on a terminal product and the response speed is low, the invention adjusts the corpus proportion and reduces an acoustic model through training parameters in the process of recognizing model training, the smaller the acoustic model is, the less resources are needed by recognition operation, thereby realizing that a plurality of recognition models are operated in parallel, reducing the resource utilization rate on the terminal product and effectively improving the recognition speed.
In order to achieve good human voice recognition and non-human voice recognition effects, the corpus is collected to expand various corpora and control model parameters, and the size of the recognition model is controlled and the recognition effect is good; in order to solve the problem that the human voice and non-human voice models are easily interfered by non-target sounds in a complex environment in the recognition process, the recognition results of multi-frame voice signals are used for joint judgment in the recognition result processing process, a certain threshold is set, and the interference of the non-target environment sounds on the target sound recognition is reduced to a great extent.
When the target non-human voice does not exist, the human voice test recognition effect is normal, such as 'intelligent housekeeper' and 'air conditioner on', and the like; when target non-human voices are played circularly, the target non-human voice recognition effect is good, such as 'Earth-warning A' and 'Earth-warning B'; target non-human voice is played circularly, and human voice test is carried out simultaneously, so that the human voice and non-human voice recognition effects are good.
The invention can solve the technical problem of simultaneously and respectively identifying multi-source complex signals in a sound source under a complex sound environment, namely simultaneously identifying human voice and non-human voice, and has good identification effect for non-specific speakers; the human voice audio recognition is slightly influenced by the target non-human voice environment; the non-human voice audio frequency has good overall recognition effect by methods such as corpus expansion and the like, and is slightly influenced by human voice environment; under the condition of ensuring the human voice and non-human voice recognition effect, the recognition response speed is high, and the response is sensitive.
The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.
Claims (9)
1. A device for simultaneously identifying human voice and non-human voice is characterized by comprising a sound source input unit, a feature extraction unit connected with the sound source input unit, N identification models and N identification result processing units, wherein each identification model is connected with one identification result processing unit; the N recognition models consist of two recognition models, namely a human voice recognition model and a non-human voice recognition model; n is greater than or equal to 2;
the input ends of the N recognition models are all connected with the output end of the characteristic extraction unit, the output ends of the N recognition result processing units are all connected with the input end of the recognition result fusion unit,
the recognition result processing unit judges and recognizes the output results of all recognition models as human voice or non-human voice;
the device also comprises an identification result fusion unit which is used for triggering the upper-layer application according to the result of the human-voice non-human-voice identification result processing unit.
2. The apparatus for simultaneously recognizing human voice and non-human voice according to claim 1, wherein the recognition model is of the form:
the first part P (Y | W) represents the probability of the occurrence of the corresponding speech given the text sequence W, i.e. the acoustic model; the second part represents the probability p (W) of the text sequence W, i.e. the language model, the subscript W of the argmax function representing the words or words that make up W of the text sequence.
3. The apparatus for simultaneously recognizing human voice and non-human voice according to claim 1, wherein the recognition result processing unit is specifically configured to:
identifying the human voice, calculating N frame average probability and N frame accumulative probability of N frame decoding results output by N identification models, and outputting a human voice identification result if the N frame average probability reaches a specified average threshold of the human voice and the N frame accumulative probability reaches a specified accumulative threshold of the human voice;
for the non-human voice recognition, calculating N frame average probability and N frame cumulative probability of N frame decoding results of the non-human voice recognition model, and outputting a non-human voice recognition result if the N frame average probability reaches a specified average threshold of the non-human voice and the N frame cumulative probability reaches a specified cumulative threshold of the non-human voice;
wherein the average value of the N frame results reaches a specified average threshold value P (mean), which is in accordance with equation (4),
and the probability accumulated value of the N frame results reaches the appointed accumulated threshold value P (acc), namely the formula (5) is met;
p (i) represents the probability of the ith frame.
4. The apparatus for simultaneously recognizing human voice and non-human voice according to claim 1, wherein the human voice recognition model is trained by the following method:
preparing a voice corpus, a non-voice corpus, a plurality of non-target voices and a noise corpus of the non-target voices to extract features of neural network training;
the training features are divided into key features and non-key features, and the voice features are marked as corresponding texts to be used as the key features input by the voice neural network;
selecting any part of non-human voice characteristics and various non-target human voices and noise characteristics of the non-human voices to be marked as noise to serve as non-key characteristics input by a human voice neural network;
wherein the ratio of the data volume of the non-critical features to the critical features does not exceed 1: 3;
and (4) taking the key features and the non-key features as all inputs of the human voice acoustic model training, and performing neural network training to output a human voice recognition model.
5. The apparatus for simultaneously recognizing human voice and non-human voice according to claim 1, wherein the non-human voice recognition model is trained by the following method:
marking the non-human voice features as corresponding texts as key features input by the non-human voice neural network, and selecting any part of human voice features and various non-target human voices and noise features of the non-human voices as noise to be marked as the non-key features input by the non-human voice neural network;
wherein the ratio of the data volume of the non-critical features to the critical features does not exceed 1: 3;
and taking the key non-human voice features and the non-key noise features as all inputs of the non-human voice acoustic model training, and performing neural network training to output the non-human voice acoustic model.
6. A method for simultaneously recognizing human voice and non-human voice is characterized by comprising the following steps:
preprocessing an input sound signal;
extracting acoustic characteristic signals from the preprocessed sound signals;
simultaneously inputting the characteristic signals into N recognition models consisting of two recognition models, namely a human voice recognition model and a non-human voice recognition model;
the N recognition models respectively input the recognition results into N recognition result processing units,
the recognition result processing unit judges and recognizes the output results of the respective recognition models as human voice or non-human voice.
7. The method for simultaneously recognizing human voice and non-human voice according to claim 6, wherein the determination manner of the recognition result processing unit is specifically as follows:
identifying the human voice, calculating N frame average probability and N frame accumulative probability of N frame decoding results output by N identification models, and outputting a human voice identification result if the N frame average probability reaches a specified average threshold of the human voice and the N frame accumulative probability reaches a specified accumulative threshold of the human voice;
for the non-human voice recognition, calculating N frame average probability and N frame cumulative probability of N frame decoding results of the non-human voice recognition model, and outputting a non-human voice recognition result if the N frame average probability reaches a specified average threshold of the non-human voice and the N frame cumulative probability reaches a specified cumulative threshold of the non-human voice;
wherein the average value of the N frame results reaches a specified average threshold value P (mean), which is in accordance with equation (4),
and the cumulative value of the N frame results reaches the specified cumulative threshold value P (acc), which conforms to the formula (5);
p (i) represents the probability of the ith frame.
8. The method for simultaneously recognizing human voice and non-human voice according to claim 6, wherein the preprocessing comprises the following processing modes: carrying out end point detection to find a starting point and an end point of the voice signal; the high-frequency signal of the voice is emphasized to remove the influence of the lip radiation through pre-emphasis; and processing the audio into a signal with short-time stationarity by framing, and performing windowing and emphasizing operation on the central segment of each short-segment signal after framing.
9. The method for simultaneously recognizing human voice and non-human voice according to claim 6, wherein the manner of extracting the acoustic feature signal is any one of MFCC, LPC, PLP and LPCC.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011384504.XA CN112185357A (en) | 2020-12-02 | 2020-12-02 | Device and method for simultaneously recognizing human voice and non-human voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011384504.XA CN112185357A (en) | 2020-12-02 | 2020-12-02 | Device and method for simultaneously recognizing human voice and non-human voice |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112185357A true CN112185357A (en) | 2021-01-05 |
Family
ID=73918359
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011384504.XA Pending CN112185357A (en) | 2020-12-02 | 2020-12-02 | Device and method for simultaneously recognizing human voice and non-human voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112185357A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113593523A (en) * | 2021-01-20 | 2021-11-02 | 腾讯科技(深圳)有限公司 | Speech detection method and device based on artificial intelligence and electronic equipment |
CN114299950A (en) * | 2021-12-30 | 2022-04-08 | 北京字跳网络技术有限公司 | Subtitle generating method, device and equipment |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103700370A (en) * | 2013-12-04 | 2014-04-02 | 北京中科模识科技有限公司 | Broadcast television voice recognition method and system |
CN103761968A (en) * | 2008-07-02 | 2014-04-30 | 谷歌公司 | Speech recognition with parallel recognition tasks |
CN108734192A (en) * | 2018-01-31 | 2018-11-02 | 国家电网公司 | A kind of support vector machines mechanical failure diagnostic method based on voting mechanism |
CN109243467A (en) * | 2018-11-14 | 2019-01-18 | 龙马智声(珠海)科技有限公司 | Sound-groove model construction method, method for recognizing sound-groove and system |
CN109872713A (en) * | 2019-03-05 | 2019-06-11 | 深圳市友杰智新科技有限公司 | A kind of voice awakening method and device |
CN110992966A (en) * | 2019-12-25 | 2020-04-10 | 开放智能机器(上海)有限公司 | Human voice separation method and system |
CN111246285A (en) * | 2020-03-24 | 2020-06-05 | 北京奇艺世纪科技有限公司 | Method for separating sound in comment video and method and device for adjusting volume |
CN111370025A (en) * | 2020-02-25 | 2020-07-03 | 广州酷狗计算机科技有限公司 | Audio recognition method and device and computer storage medium |
CN111418008A (en) * | 2017-11-30 | 2020-07-14 | 三星电子株式会社 | Method for providing service based on location of sound source and voice recognition apparatus therefor |
CN111583932A (en) * | 2020-04-30 | 2020-08-25 | 厦门快商通科技股份有限公司 | Sound separation method, device and equipment based on human voice model |
CN111883113A (en) * | 2020-07-30 | 2020-11-03 | 云知声智能科技股份有限公司 | Voice recognition method and device |
-
2020
- 2020-12-02 CN CN202011384504.XA patent/CN112185357A/en active Pending
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761968A (en) * | 2008-07-02 | 2014-04-30 | 谷歌公司 | Speech recognition with parallel recognition tasks |
CN103700370A (en) * | 2013-12-04 | 2014-04-02 | 北京中科模识科技有限公司 | Broadcast television voice recognition method and system |
CN111418008A (en) * | 2017-11-30 | 2020-07-14 | 三星电子株式会社 | Method for providing service based on location of sound source and voice recognition apparatus therefor |
CN108734192A (en) * | 2018-01-31 | 2018-11-02 | 国家电网公司 | A kind of support vector machines mechanical failure diagnostic method based on voting mechanism |
CN109243467A (en) * | 2018-11-14 | 2019-01-18 | 龙马智声(珠海)科技有限公司 | Sound-groove model construction method, method for recognizing sound-groove and system |
CN109872713A (en) * | 2019-03-05 | 2019-06-11 | 深圳市友杰智新科技有限公司 | A kind of voice awakening method and device |
CN110992966A (en) * | 2019-12-25 | 2020-04-10 | 开放智能机器(上海)有限公司 | Human voice separation method and system |
CN111370025A (en) * | 2020-02-25 | 2020-07-03 | 广州酷狗计算机科技有限公司 | Audio recognition method and device and computer storage medium |
CN111246285A (en) * | 2020-03-24 | 2020-06-05 | 北京奇艺世纪科技有限公司 | Method for separating sound in comment video and method and device for adjusting volume |
CN111583932A (en) * | 2020-04-30 | 2020-08-25 | 厦门快商通科技股份有限公司 | Sound separation method, device and equipment based on human voice model |
CN111883113A (en) * | 2020-07-30 | 2020-11-03 | 云知声智能科技股份有限公司 | Voice recognition method and device |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113593523A (en) * | 2021-01-20 | 2021-11-02 | 腾讯科技(深圳)有限公司 | Speech detection method and device based on artificial intelligence and electronic equipment |
CN113593523B (en) * | 2021-01-20 | 2024-06-21 | 腾讯科技(深圳)有限公司 | Speech detection method and device based on artificial intelligence and electronic equipment |
CN114299950A (en) * | 2021-12-30 | 2022-04-08 | 北京字跳网络技术有限公司 | Subtitle generating method, device and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109817213B (en) | Method, device and equipment for performing voice recognition on self-adaptive language | |
CN111933129B (en) | Audio processing method, language model training method and device and computer equipment | |
CN107767863B (en) | Voice awakening method and system and intelligent terminal | |
CN107993665B (en) | Method for determining role of speaker in multi-person conversation scene, intelligent conference method and system | |
CN102982811B (en) | Voice endpoint detection method based on real-time decoding | |
CN107972028B (en) | Man-machine interaction method and device and electronic equipment | |
CN105679310A (en) | Method and system for speech recognition | |
CN109887511A (en) | A kind of voice wake-up optimization method based on cascade DNN | |
CN102945673A (en) | Continuous speech recognition method with speech command range changed dynamically | |
CN112102850A (en) | Processing method, device and medium for emotion recognition and electronic equipment | |
CN111009235A (en) | Voice recognition method based on CLDNN + CTC acoustic model | |
CN106653002A (en) | Literal live broadcasting method and platform | |
CN112185357A (en) | Device and method for simultaneously recognizing human voice and non-human voice | |
CN110853669B (en) | Audio identification method, device and equipment | |
Sinha et al. | Acoustic-phonetic feature based dialect identification in Hindi Speech | |
CN113192535A (en) | Voice keyword retrieval method, system and electronic device | |
Rabiee et al. | Persian accents identification using an adaptive neural network | |
CN110246518A (en) | Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features | |
CN112216270B (en) | Speech phoneme recognition method and system, electronic equipment and storage medium | |
Gharsellaoui et al. | Automatic emotion recognition using auditory and prosodic indicative features | |
Dusan et al. | On integrating insights from human speech perception into automatic speech recognition. | |
WO2020073839A1 (en) | Voice wake-up method, apparatus and system, and electronic device | |
CN111009236A (en) | Voice recognition method based on DBLSTM + CTC acoustic model | |
Tawaqal et al. | Recognizing five major dialects in Indonesia based on MFCC and DRNN | |
Sinha et al. | Fusion of multi-stream speech features for dialect classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210105 |
|
RJ01 | Rejection of invention patent application after publication |