CN112185357A

CN112185357A - Device and method for simultaneously recognizing human voice and non-human voice

Info

Publication number: CN112185357A
Application number: CN202011384504.XA
Authority: CN
Inventors: 张琼方; 何云鹏; 许兵
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2021-01-05

Abstract

A device for simultaneously identifying human voice and non-human voice comprises a sound source input unit, a feature extraction unit connected with the sound source input unit, N identification models and N identification result processing units, wherein each identification model is connected with one identification result processing unit; the N recognition models consist of two recognition models, namely a human voice recognition model and a non-human voice recognition model; the recognition result processing unit judges and recognizes the output results of all recognition models as human voice or non-human voice; the device also comprises an identification result fusion unit which is used for triggering the upper-layer application according to the result of the human-voice non-human-voice identification result processing unit. The invention also discloses a method for simultaneously identifying the human voice and the non-human voice. The invention can solve the problem that multi-source complex signals in a sound source are simultaneously and respectively identified; under the condition of ensuring the identification effects of the two, the identification response speed is high, and the response is sensitive.

Description

Device and method for simultaneously recognizing human voice and non-human voice

Technical Field

The invention belongs to the technical field of intelligent voice recognition, and particularly relates to a device and a method for recognizing human voice and non-human voice simultaneously.

Background

Most of the current products based on voice interaction are focused on the recognition of human voice or the recognition of a certain type of signal in a voice signal, but the voice recognition is not only limited to the human voice recognition, but the voice is a multi-source complex signal. With the rapid landing of speech recognition, products with more various demands are also promoted. How to simultaneously recognize human voice and non-human voice can effectively utilize multi-source information, enrich the practicability of products and become a trend of future voice recognition development.

The human voice recognition mainly has the following problems: 1) the amount of speech information is large. The speech patterns are different not only for different speakers but also for the same speaker, for example, speech information of a speaker is different between voluntary speaking and careful speaking. The way a person speaks varies over time. 2) Ambiguity of speech. When a speaker speaks, different words may sound similar. This is common in english and chinese. 3) The phonetic characteristics of a single letter or word, etc. are influenced by the context so as to change accents, pitch, volume, speed of articulation, etc. 4) Environmental noise and interference have a serious influence on speech recognition, resulting in a low recognition rate. Due to the above factors, it is difficult to achieve agreement in recognition rate between users or evaluators in the voice recognition test.

The speech recognition of non-human voice mainly has the following problems: 1) the data collection is limited. The non-human voices are various in types, the acquisition of specific environments and places is limited, for example, snore, earthquake early warning sound or crying of children and the like, and the quantity of the linguistic data directly influences the recognition effect. 2) Is susceptible to interference from the human acoustic environment. The non-human voice identification is characterized in that when human voice is used as background environment identification, the identification quality is directly influenced by the noise degree. 3) The recognition speed is slow, when the human voice and the non-human voice are recognized simultaneously, two or more than two models may be needed to recognize simultaneously, and the recognition speed is limited by hardware and memory of the end-side voice recognition. For the above reasons, how to obtain good recognition and small human intervention is a major technical difficulty in testing non-human recognition.

Disclosure of Invention

In order to overcome the defects of the existing corpus processing technology, the invention discloses a device and a method for simultaneously identifying human voice and non-human voice.

The device for simultaneously identifying the human voice and the non-human voice comprises a sound source input unit, a feature extraction unit connected with the sound source input unit, N identification models and N identification result processing units, wherein each identification model is connected with one identification result processing unit; the N recognition models consist of two recognition models, namely a human voice recognition model and a non-human voice recognition model; n is greater than or equal to 2;

the input ends of the N recognition models are all connected with the output end of the characteristic extraction unit, the output ends of the N recognition result processing units are all connected with the input end of the recognition result fusion unit,

the recognition result processing unit judges and recognizes the output results of all recognition models as human voice or non-human voice;

the device also comprises an identification result fusion unit which is used for triggering the upper-layer application according to the result of the human-voice non-human-voice identification result processing unit.

Specifically, the recognition model is in the following form:

the first part P (Y | W) represents the probability of the occurrence of the corresponding speech given the text sequence W, i.e. the acoustic model; the second part represents the probability p (W) of the text sequence W, i.e. the language model, the subscript W of the argmax function representing the words or words that make up W of the text sequence.

Preferably, the determination manner of the recognition result processing unit is specifically:

identifying the human voice, calculating N frame average probability and N frame accumulative probability of N frame decoding results output by N identification models, and outputting a human voice identification result if the N frame average probability reaches a specified average threshold of the human voice and the N frame accumulative probability reaches a specified accumulative threshold of the human voice;

for the non-human voice recognition, calculating N frame average probability and N frame cumulative probability of N frame decoding results of the non-human voice recognition model, and outputting a non-human voice recognition result if the N frame average probability reaches a specified average threshold of the non-human voice and the N frame cumulative probability reaches a specified cumulative threshold of the non-human voice;

wherein the average value of the N frame results reaches a specified average threshold value P (mean), which is in accordance with equation (4),

and the cumulative value of the N frame results reaches the specified cumulative threshold value P (acc), which conforms to the formula (5);

p (i) represents the probability of the ith frame.

Preferably, the human voice recognition model is trained by adopting the following method:

preparing a voice corpus, a non-voice corpus, a plurality of non-target voices and a noise corpus of the non-target voices to extract features of neural network training;

the training features are divided into key features and non-key features, and the voice features are marked as corresponding texts to be used as the key features input by the voice neural network;

selecting any part of non-human voice characteristics and various non-target human voices and noise characteristics of the non-human voices to be marked as noise to serve as non-key characteristics input by a human voice neural network;

wherein the ratio of the data volume of the non-critical features to the critical features does not exceed 1: 3;

and (4) taking the key features and the non-key features as all inputs of the human voice acoustic model training, and performing neural network training to output a human voice recognition model.

Preferably, the non-human voice recognition model is trained by adopting the following method:

marking the non-human voice features as corresponding texts as key features input by the non-human voice neural network, and selecting any part of human voice features and various non-target human voices and noise features of the non-human voices as noise to be marked as the non-key features input by the non-human voice neural network;

and taking the key non-human voice features and the non-key noise features as all inputs of the non-human voice acoustic model training, and performing neural network training to output the non-human voice acoustic model.

The invention also discloses a method for simultaneously identifying the human voice and the non-human voice, which comprises the following steps:

preprocessing an input sound signal;

extracting acoustic characteristic signals from the preprocessed sound signals;

simultaneously inputting the characteristic signals into N recognition models consisting of two recognition models, namely a human voice recognition model and a non-human voice recognition model;

the N recognition models respectively input the recognition results into N recognition result processing units,

the recognition result processing unit judges and recognizes the output results of the respective recognition models as human voice or non-human voice.

p (i) represents the probability of the ith frame.

Preferably, the pretreatment comprises the following treatment modes: carrying out end point detection to find a starting point and an end point of the voice signal; the high-frequency signal of the voice is emphasized to remove the influence of the lip radiation through pre-emphasis; and processing the audio into a signal with short-time stationarity by framing, and performing windowing and emphasizing operation on the central segment of each short-segment signal after framing.

Preferably, the manner of extracting the acoustic feature signal is any one of MFCC, LPC, PLP, and LPCC.

The invention can solve the technical problem of simultaneously and respectively identifying multi-source complex signals in a sound source under a complex sound environment, namely simultaneously identifying human voice and non-human voice, and has good identification effect for non-specific speakers; the human voice audio recognition is slightly influenced by the target non-human voice environment; the non-human voice audio frequency has good overall recognition effect by methods such as corpus expansion and the like, and is slightly influenced by human voice environment; under the condition of ensuring the human voice and non-human voice recognition effect, the recognition response speed is high, and the response is sensitive.

Drawings

FIG. 1 is a schematic diagram of an embodiment of training human voice and non-human voice acoustic models according to the present invention;

fig. 2 is a schematic diagram of an embodiment of the device for simultaneously recognizing human voice and non-human voice according to the present invention.

Detailed Description

The following provides a more detailed description of the present invention.

the device also comprises an identification result fusion unit which is used for triggering the upper-layer application according to the result of the human-voice non-human-voice identification result processing unit

The method provided by the invention is characterized in that multi-source complex signals in a sound source are simultaneously and respectively identified, human voice signal identification is converted into corresponding texts or commands, non-human voice signals of specified types are converted into self-defined output, and the device mainly comprises a human voice non-human voice neural network training module and a human voice non-human voice combined identification module.

The human-voice non-human-voice neural network training module is mainly used for inputting labeled linguistic data by using a neural network architecture to train an output Acoustic Model (AM), carrying out prediction statistics on pronunciation probability of unknown labeled sound source input and outputting a statistical result, and meanwhile, carrying out Viterbi decoding by combining a Language Model (LM) to obtain an optimal path output text and a corresponding joint probability.

A typical speech recognition probability model is shown in equation (1),

in formula (1), W denotes a text sequence, Y denotes an input speech signal, and P (W | Y) denotes a text sequence that obtains the maximum probability of speech correspondence given an unknown labeled speech. argmax is a function of parameters (sets) to the function, and the subscript W of the argmax function denotes the words or words of W that make up the text sequence.

Equation (2) can be derived from bayesian criteria,

the denominator p (y) in formula (2) represents the probability of the audio, and when an input is given, the probability can be regarded as p (y) =1, so as to obtain formula (3).

The first part P (Y | W) in equation (3) represents the probability of the corresponding speech occurring for a given text sequence, i.e. the Acoustic Model (AM); the second part represents the probability p (w) of the text sequence, the Language Model (LM). In practical application, the acoustic model AM is obtained by inputting large-scale linguistic data and training through a neural network, the language model LM is modeled by utilizing entries which can appear in a preset practical use scene, and the two models are combined through decoding to obtain an effective recognition result.

In order to train an acoustic model with balanced recognition effect, the distribution of linguistic data needs to be controlled, and a typical process diagram of a training structure is shown in fig. 1.

Preparing a voice corpus, a non-voice corpus, a plurality of non-target voices and a noise corpus of the non-target voices to extract features of neural network training; the method comprises the steps of dividing the voice characteristics into key characteristics and non-key characteristics, marking the voice characteristics as corresponding texts as the key characteristics input by the voice neural network, and marking part of the non-voice characteristics and noise characteristics of a plurality of non-target voices and non-voice as noise as the non-key characteristics input by the voice neural network. And taking the key voice features and the non-key noise features as all inputs of the voice acoustic model training, and performing neural network training to output the voice acoustic model. Marking the non-human voice features as corresponding texts as key features input by the non-human voice neural network, marking part of human voice features and noise features of various non-target human voices and non-human voices as noise as the non-key features input by the non-human voice neural network, wherein the non-key features are as follows: the ratio of the data amount of the key features does not exceed 1: 3. taking key non-human voice features and non-key noise features as all inputs of non-human voice acoustic model training, wherein the non-key features: the ratio of the data amount of the key features does not exceed 1: and 3, carrying out neural network training and outputting the non-human acoustic model.

In order to reduce the situation that the target non-human voice sound signal is mistakenly recognized as human voice in the human voice acoustic model recognition process, part of non-human voice features are used as noise for training; and randomly selecting a corpus which does not contain obvious human voice in the non-human voice features as the noise of the human voice.

The device for simultaneously identifying the human voice and the non-human voice comprises a sound source input unit, a feature extraction unit, a model identification unit, an identification result processing unit and an identification result fusion unit, and a specific structural schematic diagram is shown in figure 2.

The sound source input unit carries out necessary preprocessing on multi-source sound complex signals which are input in real time and can comprise human voice, non-human voice and noise.

The pretreatment generally includes the following processes: the method comprises the steps of carrying out end point detection to find a starting point and an end point of a voice signal, carrying out emphasis on a high-frequency signal of the voice by pre-emphasis to remove the influence of lip radiation, framing to process audio into a signal with short-time stationarity, and carrying out windowing and emphasizing on a central segment of each short-segment signal after framing.

The feature extraction unit extracts features of the preprocessed voice signal, and the types of feature extraction modes are not limited to MFCC (Mel Frequency cepstral Coefficient), LPC (Linear Prediction Coefficient), PLP (Perceptual Linear Prediction), LPCC (Linear Prediction cepstral Coefficient) and other modes;

the model identification unit comprises N identification models (N is more than or equal to 2) and is divided into a human voice identification model and a non-human voice identification model, for example, as in the specific embodiment shown in FIG. 2, the model identification unit comprises 1 human voice identification model and N non-human voice identification models. Correspondingly, the system also comprises a recognition result processing unit of 1 human voice and a recognition result processing unit of n non-human voices.

Each recognition model receives input same feature signals from the feature extraction unit and respectively recognizes each frame of feature in parallel, the human voice acoustic model performs recognition decoding in combination with the language model to obtain recognition content and corresponding probability calculation, the non-human voice acoustic model can perform decoding in combination with the language model to obtain the recognition content and corresponding probability calculation, the optimal output of the acoustic model can be directly used as the recognition content, meanwhile, the average probability and the accumulated probability of multi-frame results are used as judgment bases, the decoding step is omitted, and operation resources are saved.

The recognition result processing unit is divided into N processing units (N is more than or equal to 2), the N processing units are connected with the model recognition units in a one-to-one correspondence mode, different recognition result processing units are not related to each other, and each processing unit carries out independent processing judgment on each frame of recognition result output by the model recognition unit connected with the processing unit.

Each recognition result processing unit judges the human voice recognition and the non-human voice recognition at the same time,

and identifying the human voice, calculating N frame average probability and N frame accumulative probability of N frame decoding results output by the N identification models, and outputting the human voice identification result if the N frame average probability reaches a specified average threshold of the human voice and the N frame accumulative probability reaches a specified accumulative threshold of the human voice.

And for the non-human voice recognition, calculating N frame average probability and N frame accumulative probability of N frame decoding results of the non-human voice recognition model, and outputting a non-human voice recognition result if the N frame average probability reaches a specified average threshold of the non-human voice and the N frame accumulative probability reaches a specified accumulative threshold of the non-human voice.

It should be noted that each recognition result processing unit only processes the result output by the corresponding recognition model, for example, 2 voice recognition result processing units, one for recognizing english command words and one for recognizing chinese command words; two non-human voice recognition result processing units, one for recognizing baby cry and one for recognizing earthquake early warning sound, respectively output recognition results, and uniformly calculate the average value and the accumulated value of the values of the recognition results.

The average value of the multi-frame results, namely the average probability of N frames, reaches a specified average threshold value P (mean), namely the formula (4),

and the cumulative value of the multi-frame result, namely the cumulative probability of the N frames reaches a specified cumulative threshold value P (acc), namely the formula (5);

and P (i) represents the probability of the ith frame, and then the non-human voice recognition result is output.

The recognition result fusion unit triggers the corresponding upper-layer application according to the result of the human-voice non-human-voice recognition result processing unit, for example, the following results may exist: a voice recognized and a non-voice recognized, a voice recognized but a non-voice unrecognized, a voice unrecognized but a non-voice recognized, a voice unrecognized and a non-voice unrecognized.

For example, in the case that the human voice is recognized and the non-human voice is recognized, the recognition result fusion unit may trigger the upper layer application, and by judging the priority and whether the two kinds of recognition conflict, one of the human voice and non-human voice recognition may be preferentially responded to, for example, if the human voice recognition result is "turn off light", and the non-human voice recognition result is "earthquakewarning sound a", the light flashing is preferentially responded to the non-human voice recognition result.

In order to solve the problems that the recognition of human voice and non-human voice in multi-source complex sound signals is limited on a terminal product and the response speed is low, the invention adjusts the corpus proportion and reduces an acoustic model through training parameters in the process of recognizing model training, the smaller the acoustic model is, the less resources are needed by recognition operation, thereby realizing that a plurality of recognition models are operated in parallel, reducing the resource utilization rate on the terminal product and effectively improving the recognition speed.

In order to achieve good human voice recognition and non-human voice recognition effects, the corpus is collected to expand various corpora and control model parameters, and the size of the recognition model is controlled and the recognition effect is good; in order to solve the problem that the human voice and non-human voice models are easily interfered by non-target sounds in a complex environment in the recognition process, the recognition results of multi-frame voice signals are used for joint judgment in the recognition result processing process, a certain threshold is set, and the interference of the non-target environment sounds on the target sound recognition is reduced to a great extent.

When the target non-human voice does not exist, the human voice test recognition effect is normal, such as 'intelligent housekeeper' and 'air conditioner on', and the like; when target non-human voices are played circularly, the target non-human voice recognition effect is good, such as 'Earth-warning A' and 'Earth-warning B'; target non-human voice is played circularly, and human voice test is carried out simultaneously, so that the human voice and non-human voice recognition effects are good.

The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.

Claims

1. A device for simultaneously identifying human voice and non-human voice is characterized by comprising a sound source input unit, a feature extraction unit connected with the sound source input unit, N identification models and N identification result processing units, wherein each identification model is connected with one identification result processing unit; the N recognition models consist of two recognition models, namely a human voice recognition model and a non-human voice recognition model; n is greater than or equal to 2;

2. The apparatus for simultaneously recognizing human voice and non-human voice according to claim 1, wherein the recognition model is of the form:

3. The apparatus for simultaneously recognizing human voice and non-human voice according to claim 1, wherein the recognition result processing unit is specifically configured to:

and the probability accumulated value of the N frame results reaches the appointed accumulated threshold value P (acc), namely the formula (5) is met;

p (i) represents the probability of the ith frame.

4. The apparatus for simultaneously recognizing human voice and non-human voice according to claim 1, wherein the human voice recognition model is trained by the following method:

5. The apparatus for simultaneously recognizing human voice and non-human voice according to claim 1, wherein the non-human voice recognition model is trained by the following method:

6. A method for simultaneously recognizing human voice and non-human voice is characterized by comprising the following steps:

preprocessing an input sound signal;

extracting acoustic characteristic signals from the preprocessed sound signals;

7. The method for simultaneously recognizing human voice and non-human voice according to claim 6, wherein the determination manner of the recognition result processing unit is specifically as follows:

p (i) represents the probability of the ith frame.

8. The method for simultaneously recognizing human voice and non-human voice according to claim 6, wherein the preprocessing comprises the following processing modes: carrying out end point detection to find a starting point and an end point of the voice signal; the high-frequency signal of the voice is emphasized to remove the influence of the lip radiation through pre-emphasis; and processing the audio into a signal with short-time stationarity by framing, and performing windowing and emphasizing operation on the central segment of each short-segment signal after framing.

9. The method for simultaneously recognizing human voice and non-human voice according to claim 6, wherein the manner of extracting the acoustic feature signal is any one of MFCC, LPC, PLP and LPCC.