CN112185357A - Device and method for simultaneously recognizing human voice and non-human voice - Google Patents

Device and method for simultaneously recognizing human voice and non-human voice Download PDF

Info

Publication number
CN112185357A
CN112185357A CN202011384504.XA CN202011384504A CN112185357A CN 112185357 A CN112185357 A CN 112185357A CN 202011384504 A CN202011384504 A CN 202011384504A CN 112185357 A CN112185357 A CN 112185357A
Authority
CN
China
Prior art keywords
human voice
recognition
voice
human
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011384504.XA
Other languages
Chinese (zh)
Inventor
张琼方
何云鹏
许兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chipintelli Technology Co Ltd
Original Assignee
Chipintelli Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chipintelli Technology Co Ltd filed Critical Chipintelli Technology Co Ltd
Priority to CN202011384504.XA priority Critical patent/CN112185357A/en
Publication of CN112185357A publication Critical patent/CN112185357A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

A device for simultaneously identifying human voice and non-human voice comprises a sound source input unit, a feature extraction unit connected with the sound source input unit, N identification models and N identification result processing units, wherein each identification model is connected with one identification result processing unit; the N recognition models consist of two recognition models, namely a human voice recognition model and a non-human voice recognition model; the recognition result processing unit judges and recognizes the output results of all recognition models as human voice or non-human voice; the device also comprises an identification result fusion unit which is used for triggering the upper-layer application according to the result of the human-voice non-human-voice identification result processing unit. The invention also discloses a method for simultaneously identifying the human voice and the non-human voice. The invention can solve the problem that multi-source complex signals in a sound source are simultaneously and respectively identified; under the condition of ensuring the identification effects of the two, the identification response speed is high, and the response is sensitive.

Description

Device and method for simultaneously recognizing human voice and non-human voice
Technical Field
The invention belongs to the technical field of intelligent voice recognition, and particularly relates to a device and a method for recognizing human voice and non-human voice simultaneously.
Background
Most of the current products based on voice interaction are focused on the recognition of human voice or the recognition of a certain type of signal in a voice signal, but the voice recognition is not only limited to the human voice recognition, but the voice is a multi-source complex signal. With the rapid landing of speech recognition, products with more various demands are also promoted. How to simultaneously recognize human voice and non-human voice can effectively utilize multi-source information, enrich the practicability of products and become a trend of future voice recognition development.
The human voice recognition mainly has the following problems: 1) the amount of speech information is large. The speech patterns are different not only for different speakers but also for the same speaker, for example, speech information of a speaker is different between voluntary speaking and careful speaking. The way a person speaks varies over time. 2) Ambiguity of speech. When a speaker speaks, different words may sound similar. This is common in english and chinese. 3) The phonetic characteristics of a single letter or word, etc. are influenced by the context so as to change accents, pitch, volume, speed of articulation, etc. 4) Environmental noise and interference have a serious influence on speech recognition, resulting in a low recognition rate. Due to the above factors, it is difficult to achieve agreement in recognition rate between users or evaluators in the voice recognition test.
The speech recognition of non-human voice mainly has the following problems: 1) the data collection is limited. The non-human voices are various in types, the acquisition of specific environments and places is limited, for example, snore, earthquake early warning sound or crying of children and the like, and the quantity of the linguistic data directly influences the recognition effect. 2) Is susceptible to interference from the human acoustic environment. The non-human voice identification is characterized in that when human voice is used as background environment identification, the identification quality is directly influenced by the noise degree. 3) The recognition speed is slow, when the human voice and the non-human voice are recognized simultaneously, two or more than two models may be needed to recognize simultaneously, and the recognition speed is limited by hardware and memory of the end-side voice recognition. For the above reasons, how to obtain good recognition and small human intervention is a major technical difficulty in testing non-human recognition.
Disclosure of Invention
In order to overcome the defects of the existing corpus processing technology, the invention discloses a device and a method for simultaneously identifying human voice and non-human voice.
The device for simultaneously identifying the human voice and the non-human voice comprises a sound source input unit, a feature extraction unit connected with the sound source input unit, N identification models and N identification result processing units, wherein each identification model is connected with one identification result processing unit; the N recognition models consist of two recognition models, namely a human voice recognition model and a non-human voice recognition model; n is greater than or equal to 2;
the input ends of the N recognition models are all connected with the output end of the characteristic extraction unit, the output ends of the N recognition result processing units are all connected with the input end of the recognition result fusion unit,
the recognition result processing unit judges and recognizes the output results of all recognition models as human voice or non-human voice;
the device also comprises an identification result fusion unit which is used for triggering the upper-layer application according to the result of the human-voice non-human-voice identification result processing unit.
Specifically, the recognition model is in the following form:
Figure 408184DEST_PATH_IMAGE001
the first part P (Y | W) represents the probability of the occurrence of the corresponding speech given the text sequence W, i.e. the acoustic model; the second part represents the probability p (W) of the text sequence W, i.e. the language model, the subscript W of the argmax function representing the words or words that make up W of the text sequence.
Preferably, the determination manner of the recognition result processing unit is specifically:
identifying the human voice, calculating N frame average probability and N frame accumulative probability of N frame decoding results output by N identification models, and outputting a human voice identification result if the N frame average probability reaches a specified average threshold of the human voice and the N frame accumulative probability reaches a specified accumulative threshold of the human voice;
for the non-human voice recognition, calculating N frame average probability and N frame cumulative probability of N frame decoding results of the non-human voice recognition model, and outputting a non-human voice recognition result if the N frame average probability reaches a specified average threshold of the non-human voice and the N frame cumulative probability reaches a specified cumulative threshold of the non-human voice;
wherein the average value of the N frame results reaches a specified average threshold value P (mean), which is in accordance with equation (4),
Figure 387642DEST_PATH_IMAGE002
and the cumulative value of the N frame results reaches the specified cumulative threshold value P (acc), which conforms to the formula (5);
Figure 698537DEST_PATH_IMAGE003
p (i) represents the probability of the ith frame.
Preferably, the human voice recognition model is trained by adopting the following method:
preparing a voice corpus, a non-voice corpus, a plurality of non-target voices and a noise corpus of the non-target voices to extract features of neural network training;
the training features are divided into key features and non-key features, and the voice features are marked as corresponding texts to be used as the key features input by the voice neural network;
selecting any part of non-human voice characteristics and various non-target human voices and noise characteristics of the non-human voices to be marked as noise to serve as non-key characteristics input by a human voice neural network;
wherein the ratio of the data volume of the non-critical features to the critical features does not exceed 1: 3;
and (4) taking the key features and the non-key features as all inputs of the human voice acoustic model training, and performing neural network training to output a human voice recognition model.
Preferably, the non-human voice recognition model is trained by adopting the following method:
marking the non-human voice features as corresponding texts as key features input by the non-human voice neural network, and selecting any part of human voice features and various non-target human voices and noise features of the non-human voices as noise to be marked as the non-key features input by the non-human voice neural network;
wherein the ratio of the data volume of the non-critical features to the critical features does not exceed 1: 3;
and taking the key non-human voice features and the non-key noise features as all inputs of the non-human voice acoustic model training, and performing neural network training to output the non-human voice acoustic model.
The invention also discloses a method for simultaneously identifying the human voice and the non-human voice, which comprises the following steps:
preprocessing an input sound signal;
extracting acoustic characteristic signals from the preprocessed sound signals;
simultaneously inputting the characteristic signals into N recognition models consisting of two recognition models, namely a human voice recognition model and a non-human voice recognition model;
the N recognition models respectively input the recognition results into N recognition result processing units,
the recognition result processing unit judges and recognizes the output results of the respective recognition models as human voice or non-human voice.
Preferably, the determination manner of the recognition result processing unit is specifically:
identifying the human voice, calculating N frame average probability and N frame accumulative probability of N frame decoding results output by N identification models, and outputting a human voice identification result if the N frame average probability reaches a specified average threshold of the human voice and the N frame accumulative probability reaches a specified accumulative threshold of the human voice;
for the non-human voice recognition, calculating N frame average probability and N frame cumulative probability of N frame decoding results of the non-human voice recognition model, and outputting a non-human voice recognition result if the N frame average probability reaches a specified average threshold of the non-human voice and the N frame cumulative probability reaches a specified cumulative threshold of the non-human voice;
wherein the average value of the N frame results reaches a specified average threshold value P (mean), which is in accordance with equation (4),
Figure 139752DEST_PATH_IMAGE002
and the cumulative value of the N frame results reaches the specified cumulative threshold value P (acc), which conforms to the formula (5);
Figure 750862DEST_PATH_IMAGE003
p (i) represents the probability of the ith frame.
Preferably, the pretreatment comprises the following treatment modes: carrying out end point detection to find a starting point and an end point of the voice signal; the high-frequency signal of the voice is emphasized to remove the influence of the lip radiation through pre-emphasis; and processing the audio into a signal with short-time stationarity by framing, and performing windowing and emphasizing operation on the central segment of each short-segment signal after framing.
Preferably, the manner of extracting the acoustic feature signal is any one of MFCC, LPC, PLP, and LPCC.
The invention can solve the technical problem of simultaneously and respectively identifying multi-source complex signals in a sound source under a complex sound environment, namely simultaneously identifying human voice and non-human voice, and has good identification effect for non-specific speakers; the human voice audio recognition is slightly influenced by the target non-human voice environment; the non-human voice audio frequency has good overall recognition effect by methods such as corpus expansion and the like, and is slightly influenced by human voice environment; under the condition of ensuring the human voice and non-human voice recognition effect, the recognition response speed is high, and the response is sensitive.
Drawings
FIG. 1 is a schematic diagram of an embodiment of training human voice and non-human voice acoustic models according to the present invention;
fig. 2 is a schematic diagram of an embodiment of the device for simultaneously recognizing human voice and non-human voice according to the present invention.
Detailed Description
The following provides a more detailed description of the present invention.
The device for simultaneously identifying the human voice and the non-human voice comprises a sound source input unit, a feature extraction unit connected with the sound source input unit, N identification models and N identification result processing units, wherein each identification model is connected with one identification result processing unit; the N recognition models consist of two recognition models, namely a human voice recognition model and a non-human voice recognition model; n is greater than or equal to 2;
the input ends of the N recognition models are all connected with the output end of the characteristic extraction unit, the output ends of the N recognition result processing units are all connected with the input end of the recognition result fusion unit,
the recognition result processing unit judges and recognizes the output results of all recognition models as human voice or non-human voice;
the device also comprises an identification result fusion unit which is used for triggering the upper-layer application according to the result of the human-voice non-human-voice identification result processing unit
The method provided by the invention is characterized in that multi-source complex signals in a sound source are simultaneously and respectively identified, human voice signal identification is converted into corresponding texts or commands, non-human voice signals of specified types are converted into self-defined output, and the device mainly comprises a human voice non-human voice neural network training module and a human voice non-human voice combined identification module.
The human-voice non-human-voice neural network training module is mainly used for inputting labeled linguistic data by using a neural network architecture to train an output Acoustic Model (AM), carrying out prediction statistics on pronunciation probability of unknown labeled sound source input and outputting a statistical result, and meanwhile, carrying out Viterbi decoding by combining a Language Model (LM) to obtain an optimal path output text and a corresponding joint probability.
A typical speech recognition probability model is shown in equation (1),
Figure 104483DEST_PATH_IMAGE004
in formula (1), W denotes a text sequence, Y denotes an input speech signal, and P (W | Y) denotes a text sequence that obtains the maximum probability of speech correspondence given an unknown labeled speech. argmax is a function of parameters (sets) to the function, and the subscript W of the argmax function denotes the words or words of W that make up the text sequence.
Equation (2) can be derived from bayesian criteria,
Figure 918986DEST_PATH_IMAGE005
the denominator p (y) in formula (2) represents the probability of the audio, and when an input is given, the probability can be regarded as p (y) =1, so as to obtain formula (3).
Figure 180203DEST_PATH_IMAGE001
The first part P (Y | W) in equation (3) represents the probability of the corresponding speech occurring for a given text sequence, i.e. the Acoustic Model (AM); the second part represents the probability p (w) of the text sequence, the Language Model (LM). In practical application, the acoustic model AM is obtained by inputting large-scale linguistic data and training through a neural network, the language model LM is modeled by utilizing entries which can appear in a preset practical use scene, and the two models are combined through decoding to obtain an effective recognition result.
In order to train an acoustic model with balanced recognition effect, the distribution of linguistic data needs to be controlled, and a typical process diagram of a training structure is shown in fig. 1.
Preparing a voice corpus, a non-voice corpus, a plurality of non-target voices and a noise corpus of the non-target voices to extract features of neural network training; the method comprises the steps of dividing the voice characteristics into key characteristics and non-key characteristics, marking the voice characteristics as corresponding texts as the key characteristics input by the voice neural network, and marking part of the non-voice characteristics and noise characteristics of a plurality of non-target voices and non-voice as noise as the non-key characteristics input by the voice neural network. And taking the key voice features and the non-key noise features as all inputs of the voice acoustic model training, and performing neural network training to output the voice acoustic model. Marking the non-human voice features as corresponding texts as key features input by the non-human voice neural network, marking part of human voice features and noise features of various non-target human voices and non-human voices as noise as the non-key features input by the non-human voice neural network, wherein the non-key features are as follows: the ratio of the data amount of the key features does not exceed 1: 3. taking key non-human voice features and non-key noise features as all inputs of non-human voice acoustic model training, wherein the non-key features: the ratio of the data amount of the key features does not exceed 1: and 3, carrying out neural network training and outputting the non-human acoustic model.
In order to reduce the situation that the target non-human voice sound signal is mistakenly recognized as human voice in the human voice acoustic model recognition process, part of non-human voice features are used as noise for training; and randomly selecting a corpus which does not contain obvious human voice in the non-human voice features as the noise of the human voice.
The device for simultaneously identifying the human voice and the non-human voice comprises a sound source input unit, a feature extraction unit, a model identification unit, an identification result processing unit and an identification result fusion unit, and a specific structural schematic diagram is shown in figure 2.
The sound source input unit carries out necessary preprocessing on multi-source sound complex signals which are input in real time and can comprise human voice, non-human voice and noise.
The pretreatment generally includes the following processes: the method comprises the steps of carrying out end point detection to find a starting point and an end point of a voice signal, carrying out emphasis on a high-frequency signal of the voice by pre-emphasis to remove the influence of lip radiation, framing to process audio into a signal with short-time stationarity, and carrying out windowing and emphasizing on a central segment of each short-segment signal after framing.
The feature extraction unit extracts features of the preprocessed voice signal, and the types of feature extraction modes are not limited to MFCC (Mel Frequency cepstral Coefficient), LPC (Linear Prediction Coefficient), PLP (Perceptual Linear Prediction), LPCC (Linear Prediction cepstral Coefficient) and other modes;
the model identification unit comprises N identification models (N is more than or equal to 2) and is divided into a human voice identification model and a non-human voice identification model, for example, as in the specific embodiment shown in FIG. 2, the model identification unit comprises 1 human voice identification model and N non-human voice identification models. Correspondingly, the system also comprises a recognition result processing unit of 1 human voice and a recognition result processing unit of n non-human voices.
Each recognition model receives input same feature signals from the feature extraction unit and respectively recognizes each frame of feature in parallel, the human voice acoustic model performs recognition decoding in combination with the language model to obtain recognition content and corresponding probability calculation, the non-human voice acoustic model can perform decoding in combination with the language model to obtain the recognition content and corresponding probability calculation, the optimal output of the acoustic model can be directly used as the recognition content, meanwhile, the average probability and the accumulated probability of multi-frame results are used as judgment bases, the decoding step is omitted, and operation resources are saved.
The recognition result processing unit is divided into N processing units (N is more than or equal to 2), the N processing units are connected with the model recognition units in a one-to-one correspondence mode, different recognition result processing units are not related to each other, and each processing unit carries out independent processing judgment on each frame of recognition result output by the model recognition unit connected with the processing unit.
Each recognition result processing unit judges the human voice recognition and the non-human voice recognition at the same time,
and identifying the human voice, calculating N frame average probability and N frame accumulative probability of N frame decoding results output by the N identification models, and outputting the human voice identification result if the N frame average probability reaches a specified average threshold of the human voice and the N frame accumulative probability reaches a specified accumulative threshold of the human voice.
And for the non-human voice recognition, calculating N frame average probability and N frame accumulative probability of N frame decoding results of the non-human voice recognition model, and outputting a non-human voice recognition result if the N frame average probability reaches a specified average threshold of the non-human voice and the N frame accumulative probability reaches a specified accumulative threshold of the non-human voice.
It should be noted that each recognition result processing unit only processes the result output by the corresponding recognition model, for example, 2 voice recognition result processing units, one for recognizing english command words and one for recognizing chinese command words; two non-human voice recognition result processing units, one for recognizing baby cry and one for recognizing earthquake early warning sound, respectively output recognition results, and uniformly calculate the average value and the accumulated value of the values of the recognition results.
The average value of the multi-frame results, namely the average probability of N frames, reaches a specified average threshold value P (mean), namely the formula (4),
Figure 380240DEST_PATH_IMAGE002
and the cumulative value of the multi-frame result, namely the cumulative probability of the N frames reaches a specified cumulative threshold value P (acc), namely the formula (5);
Figure 422539DEST_PATH_IMAGE003
and P (i) represents the probability of the ith frame, and then the non-human voice recognition result is output.
The recognition result fusion unit triggers the corresponding upper-layer application according to the result of the human-voice non-human-voice recognition result processing unit, for example, the following results may exist: a voice recognized and a non-voice recognized, a voice recognized but a non-voice unrecognized, a voice unrecognized but a non-voice recognized, a voice unrecognized and a non-voice unrecognized.
For example, in the case that the human voice is recognized and the non-human voice is recognized, the recognition result fusion unit may trigger the upper layer application, and by judging the priority and whether the two kinds of recognition conflict, one of the human voice and non-human voice recognition may be preferentially responded to, for example, if the human voice recognition result is "turn off light", and the non-human voice recognition result is "earthquakewarning sound a", the light flashing is preferentially responded to the non-human voice recognition result.
In order to solve the problems that the recognition of human voice and non-human voice in multi-source complex sound signals is limited on a terminal product and the response speed is low, the invention adjusts the corpus proportion and reduces an acoustic model through training parameters in the process of recognizing model training, the smaller the acoustic model is, the less resources are needed by recognition operation, thereby realizing that a plurality of recognition models are operated in parallel, reducing the resource utilization rate on the terminal product and effectively improving the recognition speed.
In order to achieve good human voice recognition and non-human voice recognition effects, the corpus is collected to expand various corpora and control model parameters, and the size of the recognition model is controlled and the recognition effect is good; in order to solve the problem that the human voice and non-human voice models are easily interfered by non-target sounds in a complex environment in the recognition process, the recognition results of multi-frame voice signals are used for joint judgment in the recognition result processing process, a certain threshold is set, and the interference of the non-target environment sounds on the target sound recognition is reduced to a great extent.
When the target non-human voice does not exist, the human voice test recognition effect is normal, such as 'intelligent housekeeper' and 'air conditioner on', and the like; when target non-human voices are played circularly, the target non-human voice recognition effect is good, such as 'Earth-warning A' and 'Earth-warning B'; target non-human voice is played circularly, and human voice test is carried out simultaneously, so that the human voice and non-human voice recognition effects are good.
The invention can solve the technical problem of simultaneously and respectively identifying multi-source complex signals in a sound source under a complex sound environment, namely simultaneously identifying human voice and non-human voice, and has good identification effect for non-specific speakers; the human voice audio recognition is slightly influenced by the target non-human voice environment; the non-human voice audio frequency has good overall recognition effect by methods such as corpus expansion and the like, and is slightly influenced by human voice environment; under the condition of ensuring the human voice and non-human voice recognition effect, the recognition response speed is high, and the response is sensitive.
The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.

Claims (9)

1. A device for simultaneously identifying human voice and non-human voice is characterized by comprising a sound source input unit, a feature extraction unit connected with the sound source input unit, N identification models and N identification result processing units, wherein each identification model is connected with one identification result processing unit; the N recognition models consist of two recognition models, namely a human voice recognition model and a non-human voice recognition model; n is greater than or equal to 2;
the input ends of the N recognition models are all connected with the output end of the characteristic extraction unit, the output ends of the N recognition result processing units are all connected with the input end of the recognition result fusion unit,
the recognition result processing unit judges and recognizes the output results of all recognition models as human voice or non-human voice;
the device also comprises an identification result fusion unit which is used for triggering the upper-layer application according to the result of the human-voice non-human-voice identification result processing unit.
2. The apparatus for simultaneously recognizing human voice and non-human voice according to claim 1, wherein the recognition model is of the form:
Figure DEST_PATH_IMAGE001
the first part P (Y | W) represents the probability of the occurrence of the corresponding speech given the text sequence W, i.e. the acoustic model; the second part represents the probability p (W) of the text sequence W, i.e. the language model, the subscript W of the argmax function representing the words or words that make up W of the text sequence.
3. The apparatus for simultaneously recognizing human voice and non-human voice according to claim 1, wherein the recognition result processing unit is specifically configured to:
identifying the human voice, calculating N frame average probability and N frame accumulative probability of N frame decoding results output by N identification models, and outputting a human voice identification result if the N frame average probability reaches a specified average threshold of the human voice and the N frame accumulative probability reaches a specified accumulative threshold of the human voice;
for the non-human voice recognition, calculating N frame average probability and N frame cumulative probability of N frame decoding results of the non-human voice recognition model, and outputting a non-human voice recognition result if the N frame average probability reaches a specified average threshold of the non-human voice and the N frame cumulative probability reaches a specified cumulative threshold of the non-human voice;
wherein the average value of the N frame results reaches a specified average threshold value P (mean), which is in accordance with equation (4),
Figure DEST_PATH_IMAGE002
and the probability accumulated value of the N frame results reaches the appointed accumulated threshold value P (acc), namely the formula (5) is met;
Figure DEST_PATH_IMAGE003
p (i) represents the probability of the ith frame.
4. The apparatus for simultaneously recognizing human voice and non-human voice according to claim 1, wherein the human voice recognition model is trained by the following method:
preparing a voice corpus, a non-voice corpus, a plurality of non-target voices and a noise corpus of the non-target voices to extract features of neural network training;
the training features are divided into key features and non-key features, and the voice features are marked as corresponding texts to be used as the key features input by the voice neural network;
selecting any part of non-human voice characteristics and various non-target human voices and noise characteristics of the non-human voices to be marked as noise to serve as non-key characteristics input by a human voice neural network;
wherein the ratio of the data volume of the non-critical features to the critical features does not exceed 1: 3;
and (4) taking the key features and the non-key features as all inputs of the human voice acoustic model training, and performing neural network training to output a human voice recognition model.
5. The apparatus for simultaneously recognizing human voice and non-human voice according to claim 1, wherein the non-human voice recognition model is trained by the following method:
marking the non-human voice features as corresponding texts as key features input by the non-human voice neural network, and selecting any part of human voice features and various non-target human voices and noise features of the non-human voices as noise to be marked as the non-key features input by the non-human voice neural network;
wherein the ratio of the data volume of the non-critical features to the critical features does not exceed 1: 3;
and taking the key non-human voice features and the non-key noise features as all inputs of the non-human voice acoustic model training, and performing neural network training to output the non-human voice acoustic model.
6. A method for simultaneously recognizing human voice and non-human voice is characterized by comprising the following steps:
preprocessing an input sound signal;
extracting acoustic characteristic signals from the preprocessed sound signals;
simultaneously inputting the characteristic signals into N recognition models consisting of two recognition models, namely a human voice recognition model and a non-human voice recognition model;
the N recognition models respectively input the recognition results into N recognition result processing units,
the recognition result processing unit judges and recognizes the output results of the respective recognition models as human voice or non-human voice.
7. The method for simultaneously recognizing human voice and non-human voice according to claim 6, wherein the determination manner of the recognition result processing unit is specifically as follows:
identifying the human voice, calculating N frame average probability and N frame accumulative probability of N frame decoding results output by N identification models, and outputting a human voice identification result if the N frame average probability reaches a specified average threshold of the human voice and the N frame accumulative probability reaches a specified accumulative threshold of the human voice;
for the non-human voice recognition, calculating N frame average probability and N frame cumulative probability of N frame decoding results of the non-human voice recognition model, and outputting a non-human voice recognition result if the N frame average probability reaches a specified average threshold of the non-human voice and the N frame cumulative probability reaches a specified cumulative threshold of the non-human voice;
wherein the average value of the N frame results reaches a specified average threshold value P (mean), which is in accordance with equation (4),
Figure 36167DEST_PATH_IMAGE002
and the cumulative value of the N frame results reaches the specified cumulative threshold value P (acc), which conforms to the formula (5);
Figure 225709DEST_PATH_IMAGE003
p (i) represents the probability of the ith frame.
8. The method for simultaneously recognizing human voice and non-human voice according to claim 6, wherein the preprocessing comprises the following processing modes: carrying out end point detection to find a starting point and an end point of the voice signal; the high-frequency signal of the voice is emphasized to remove the influence of the lip radiation through pre-emphasis; and processing the audio into a signal with short-time stationarity by framing, and performing windowing and emphasizing operation on the central segment of each short-segment signal after framing.
9. The method for simultaneously recognizing human voice and non-human voice according to claim 6, wherein the manner of extracting the acoustic feature signal is any one of MFCC, LPC, PLP and LPCC.
CN202011384504.XA 2020-12-02 2020-12-02 Device and method for simultaneously recognizing human voice and non-human voice Pending CN112185357A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011384504.XA CN112185357A (en) 2020-12-02 2020-12-02 Device and method for simultaneously recognizing human voice and non-human voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011384504.XA CN112185357A (en) 2020-12-02 2020-12-02 Device and method for simultaneously recognizing human voice and non-human voice

Publications (1)

Publication Number Publication Date
CN112185357A true CN112185357A (en) 2021-01-05

Family

ID=73918359

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011384504.XA Pending CN112185357A (en) 2020-12-02 2020-12-02 Device and method for simultaneously recognizing human voice and non-human voice

Country Status (1)

Country Link
CN (1) CN112185357A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113593523A (en) * 2021-01-20 2021-11-02 腾讯科技(深圳)有限公司 Speech detection method and device based on artificial intelligence and electronic equipment
CN114299950A (en) * 2021-12-30 2022-04-08 北京字跳网络技术有限公司 Subtitle generating method, device and equipment

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103700370A (en) * 2013-12-04 2014-04-02 北京中科模识科技有限公司 Broadcast television voice recognition method and system
CN103761968A (en) * 2008-07-02 2014-04-30 谷歌公司 Speech recognition with parallel recognition tasks
CN108734192A (en) * 2018-01-31 2018-11-02 国家电网公司 A kind of support vector machines mechanical failure diagnostic method based on voting mechanism
CN109243467A (en) * 2018-11-14 2019-01-18 龙马智声(珠海)科技有限公司 Sound-groove model construction method, method for recognizing sound-groove and system
CN109872713A (en) * 2019-03-05 2019-06-11 深圳市友杰智新科技有限公司 A kind of voice awakening method and device
CN110992966A (en) * 2019-12-25 2020-04-10 开放智能机器(上海)有限公司 Human voice separation method and system
CN111246285A (en) * 2020-03-24 2020-06-05 北京奇艺世纪科技有限公司 Method for separating sound in comment video and method and device for adjusting volume
CN111370025A (en) * 2020-02-25 2020-07-03 广州酷狗计算机科技有限公司 Audio recognition method and device and computer storage medium
CN111418008A (en) * 2017-11-30 2020-07-14 三星电子株式会社 Method for providing service based on location of sound source and voice recognition apparatus therefor
CN111583932A (en) * 2020-04-30 2020-08-25 厦门快商通科技股份有限公司 Sound separation method, device and equipment based on human voice model
CN111883113A (en) * 2020-07-30 2020-11-03 云知声智能科技股份有限公司 Voice recognition method and device

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761968A (en) * 2008-07-02 2014-04-30 谷歌公司 Speech recognition with parallel recognition tasks
CN103700370A (en) * 2013-12-04 2014-04-02 北京中科模识科技有限公司 Broadcast television voice recognition method and system
CN111418008A (en) * 2017-11-30 2020-07-14 三星电子株式会社 Method for providing service based on location of sound source and voice recognition apparatus therefor
CN108734192A (en) * 2018-01-31 2018-11-02 国家电网公司 A kind of support vector machines mechanical failure diagnostic method based on voting mechanism
CN109243467A (en) * 2018-11-14 2019-01-18 龙马智声(珠海)科技有限公司 Sound-groove model construction method, method for recognizing sound-groove and system
CN109872713A (en) * 2019-03-05 2019-06-11 深圳市友杰智新科技有限公司 A kind of voice awakening method and device
CN110992966A (en) * 2019-12-25 2020-04-10 开放智能机器(上海)有限公司 Human voice separation method and system
CN111370025A (en) * 2020-02-25 2020-07-03 广州酷狗计算机科技有限公司 Audio recognition method and device and computer storage medium
CN111246285A (en) * 2020-03-24 2020-06-05 北京奇艺世纪科技有限公司 Method for separating sound in comment video and method and device for adjusting volume
CN111583932A (en) * 2020-04-30 2020-08-25 厦门快商通科技股份有限公司 Sound separation method, device and equipment based on human voice model
CN111883113A (en) * 2020-07-30 2020-11-03 云知声智能科技股份有限公司 Voice recognition method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113593523A (en) * 2021-01-20 2021-11-02 腾讯科技(深圳)有限公司 Speech detection method and device based on artificial intelligence and electronic equipment
CN113593523B (en) * 2021-01-20 2024-06-21 腾讯科技(深圳)有限公司 Speech detection method and device based on artificial intelligence and electronic equipment
CN114299950A (en) * 2021-12-30 2022-04-08 北京字跳网络技术有限公司 Subtitle generating method, device and equipment

Similar Documents

Publication Publication Date Title
CN109817213B (en) Method, device and equipment for performing voice recognition on self-adaptive language
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
CN107767863B (en) Voice awakening method and system and intelligent terminal
CN107993665B (en) Method for determining role of speaker in multi-person conversation scene, intelligent conference method and system
CN102982811B (en) Voice endpoint detection method based on real-time decoding
CN107972028B (en) Man-machine interaction method and device and electronic equipment
CN105679310A (en) Method and system for speech recognition
CN109887511A (en) A kind of voice wake-up optimization method based on cascade DNN
CN102945673A (en) Continuous speech recognition method with speech command range changed dynamically
CN112102850A (en) Processing method, device and medium for emotion recognition and electronic equipment
CN111009235A (en) Voice recognition method based on CLDNN + CTC acoustic model
CN106653002A (en) Literal live broadcasting method and platform
CN112185357A (en) Device and method for simultaneously recognizing human voice and non-human voice
CN110853669B (en) Audio identification method, device and equipment
Sinha et al. Acoustic-phonetic feature based dialect identification in Hindi Speech
CN113192535A (en) Voice keyword retrieval method, system and electronic device
Rabiee et al. Persian accents identification using an adaptive neural network
CN110246518A (en) Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features
CN112216270B (en) Speech phoneme recognition method and system, electronic equipment and storage medium
Gharsellaoui et al. Automatic emotion recognition using auditory and prosodic indicative features
Dusan et al. On integrating insights from human speech perception into automatic speech recognition.
WO2020073839A1 (en) Voice wake-up method, apparatus and system, and electronic device
CN111009236A (en) Voice recognition method based on DBLSTM + CTC acoustic model
Tawaqal et al. Recognizing five major dialects in Indonesia based on MFCC and DRNN
Sinha et al. Fusion of multi-stream speech features for dialect classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210105

RJ01 Rejection of invention patent application after publication