CN106531158A - Method and device for recognizing answer voice - Google Patents
Method and device for recognizing answer voice Download PDFInfo
- Publication number
- CN106531158A CN106531158A CN201611081923.XA CN201611081923A CN106531158A CN 106531158 A CN106531158 A CN 106531158A CN 201611081923 A CN201611081923 A CN 201611081923A CN 106531158 A CN106531158 A CN 106531158A
- Authority
- CN
- China
- Prior art keywords
- response
- voice
- identified
- identification model
- response mode
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/32—Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
Abstract
The invention relates to the field of computer paralanguage information, and particularly relates to a method and a device for recognizing answer voice, so as to solve the problem that the current answer voice recognition method is inaccurate in answer voice recognition. The method of the embodiment of the invention comprises steps: to-be-recognized answer voice is acquired; an answer mode recognition model is used for determining an answer mode corresponding to the to-be-recognized answer voice; if the answer mode is a formal answer mode, the to-be-recognized answer voice is inputted to a first voice recognition system; and if the answer mode is an informal answer mode, the to-be-recognized answer voice is inputted to a second voice recognition system. when the method of the embodiment of the invention recognizes the answer voice, whether the answer voice is the formal answer mode or the informal answer mode is firstly recognized, the answer voice is inputted to a different voice recognition system for recognition for the formal answer mode and the informal answer mode, and the overall voice recognition performance is thus enhanced.
Description
Technical field
The present invention relates to computer paralanguage field, more particularly to a kind of recognition methodss of response voice and device.
Background technology
In recent years, computer paralinguistic becomes the study hotspot of speech language process field, and speech recognition technology is sent out
Exhibition is to promoting the intelligent, development of the novel human-machine interaction technology of hommization and application with important effect.
Speech recognition is exactly the technology that voice is changed into automatically text using computer, in voice always human lives
Interactive important medium, therefore allow machine to realize the identification to voice it is critical that a step.Can make in many occasions at present
Voice is recorded with voice recorder, and needs the voice to recording in voice recorder to be analyzed.For example, in flying scene
In, the voice on aircraft is recorded using cockpit voice recording instrument, by recognizing the voice in cockpit voice recording instrument to flying after flight terminates
Row quality is evaluated.At present, when the voice messaging for recording in voice recorder is identified, use machine automatic
Know method for distinguishing, specifically, the voice recorded in voice recorder is divided into into a sentence using end points technology of identification to be identified
Response voice, and by response phonetic entry to be identified in speech recognition system, recognized by the speech recognition system to be identified
Response voice.As response voice to be identified speaks object and environment is divided into formal response voice and unofficial according to different
Response voice, formal response voice are different with the corresponding voice environment of unofficial response voice, and speaker's tone, intonation are equal
Differ;And the method that the response phonetic entry speech recognition system for getting is identified directly is tended not to by prior art
Response voice is recognized accurately.
In sum, current response audio recognition method is not accurate enough when response voice is recognized.
The content of the invention
The present invention provides a kind of recognition methodss of response voice and device, to solve current response audio recognition method
The not accurate enough problem when response voice is recognized.
Based on the problems referred to above, the embodiment of the present invention provides a kind of recognition methodss of response voice, including:
Obtain response voice to be identified;
The corresponding response mode of the response voice to be identified is determined using response mode identification model;Wherein, it is described to answer
The mode identification model of answering is the machine learning model for having supervision;
If the response mode is formal response mode, by the first speech recognition of response phonetic entry system to be identified
System, so that first speech recognition system recognizes the response voice to be identified, and exports the response voice pair to be identified
The text message answered;
If the response mode is unofficial response mode, by second speech recognition of response phonetic entry to be identified
System, so that second speech recognition system recognizes the response voice to be identified, and exports the response voice to be identified
Corresponding text message;
Wherein, first speech recognition system and second speech recognition system are configured with different parameters.
As the embodiment of the present invention is when response voice is recognized, after obtaining response voice to be identified, using response mode
Identification model determines the corresponding response mode of response voice to be identified, for formal response mode and the input of unofficial response mode
Different speech recognition systems are identified.As the first speech recognition system is used for recognizing formal response voice, the second voice
Identifying system is used for recognizing unofficial response voice, and the first speech recognition system and the second speech recognition system are configured with not
Same parameter, is identified using different speech recognition systems for different response modes, so that answering to be identified
The identification for answering voice is more accurate.
Optionally, the use response mode identification model determines the corresponding response mode of the response voice to be identified,
Specifically include:
The phonetic feature extracted from the response voice to be identified is input into into the response mode identification model;
Obtain the corresponding response mode of response voice described to be identified of the response mode identification model output.
As the embodiment of the present invention carries out response voice to be identified after feature extraction, will be the phonetic feature for extracting defeated
Enter response mode identification model, the corresponding response mode of response voice to be identified is determined by response mode identification model.
Optionally, the phonetic feature includes frame level feature, chip level feature and Utterance level feature;
Phonetic feature is extracted from response voice according to following manner:
Using feature extraction tools, moved according to default frame length and frame, the frame level for extracting the response voice to be identified is special
Levy;
The frame level feature is done into the disposal of gentle filter, and calculus of differences is done to the frame level feature after smoothing processing, it is determined that
The chip level feature of the response voice to be identified;
According to default statistical parameter, process is analyzed to the chip level feature, determines the response voice to be identified
Utterance level feature.
As the embodiment of the present invention extracts frame level, chip level, section level phonetic feature from response voice to be identified, so as to protect
Card response mode identification model accurately recognizes the corresponding response mode of the response voice to be identified.
Optionally, the response mode identification model is obtained according to following manner:
It is determined that the training set comprising multiple response voices, and the test set comprising multiple response voices;Wherein, the instruction
Practice the response voice concentrated different from the response voice in the test set;
For any one response voice in the training set, will be the phonetic feature extracted from the response voice defeated
It is trained in entering the response mode identification model to before training;
For any one response voice in the test set, will be the phonetic feature extracted from the response voice defeated
Enter in the response mode identification model to after training, and obtain the response voice pair of the response mode identification model output
The response mode answered;
In the test set according to the response mode identification model output after training, each response voice is corresponding should
Mode is answered, the correct recognition rata of the response mode identification model after the training is determined, if the correct recognition rata is more than setting
Threshold value, determines that the training of the response mode identification model after the training is completed, and preserves the response mode identification mould after the training
Type.
Due to the embodiment of the present invention using training set in multiple response voices response mode identification model is trained,
Whether requirement is met using the response mode identification model after the response phonetic decision training in test set after training, in response
When mode identification model recognizes that the correct recognition rata of the response voice in the test set is more than given threshold, the response mode is determined
Identification model training is completed, and preserves the response mode identification model after the training;If correct recognition rata is less than given threshold, make
It is trained with the response voice in training set again, until the correct recognition rata of response mode identification model is more than setting threshold
Value, so that ensure that the response mode identification model for obtaining more accurately recognizes the corresponding response mode of response voice to be identified.
Optionally, the response mode identification model is support vector machines model.
On the other hand, the embodiment of the present invention also provides a kind of identifying device of response voice, including:
Acquisition module, for obtaining response voice to be identified;
Identification module, for determining the corresponding answer party of the response voice to be identified using response mode identification model
Formula;Wherein, the response mode identification model is the machine learning model for having supervision;
Judge module, if being formal response mode for the response mode, by the response phonetic entry to be identified
First speech recognition system, so that first speech recognition system recognizes the response voice to be identified, and treats described in exporting
The corresponding text message of identification response voice;If the response mode is unofficial response mode, by the response to be identified
The second speech recognition system of phonetic entry, so that second speech recognition system recognizes the response voice to be identified, and it is defeated
Go out the corresponding text message of the response voice to be identified;Wherein, first speech recognition system and second voice are known
Other system configuration has different parameters.
Optionally, the identification module, specifically for:
The phonetic feature extracted from the response voice to be identified is input into into the response mode identification model;Obtain institute
State the corresponding response mode of response voice described to be identified of response mode identification model output.
Optionally, the phonetic feature includes frame level feature, chip level feature and Utterance level feature;
The identification module, specifically for:
Phonetic feature is extracted from response voice according to following manner:
Using feature extraction tools, moved according to default frame length and frame, the frame level for extracting the response voice to be identified is special
Levy;The frame level feature is done into the disposal of gentle filter, and calculus of differences is done to the frame level feature after smoothing processing, it is determined that described treat
The chip level feature of identification response voice;According to default statistical parameter, process is analyzed to the chip level feature, it is determined that described
The Utterance level feature of response voice to be identified.
Optionally, the acquisition module, is additionally operable to:
The response mode identification model is obtained according to following manner:
It is determined that the training set comprising multiple response voices, and the test set comprising multiple response voices;Wherein, the instruction
Practice the response voice concentrated different from the response voice in the test set;For any one response language in the training set
Sound, the phonetic feature for extracting is input in the response mode identification model before training is instructed from the response voice
Practice;For any one response voice in the test set, the phonetic feature extracted from the response voice is input to
In response mode identification model after training, and it is corresponding to obtain the response voice of the response mode identification model output
Response mode;In the test set according to the response mode identification model output after training, each response voice is corresponding should
Mode is answered, the recognition correct rate of the response mode identification model after the training is determined, if the recognition correct rate is more than setting
Threshold value, determines that the training of the response mode identification model after the training is completed, and preserves the response mode identification mould after the training
Type.
Optionally, the response mode identification model is support vector machines model.
Description of the drawings
For the technical scheme being illustrated more clearly that in the embodiment of the present invention, below will be to making needed for embodiment description
Accompanying drawing is briefly introduced, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for this
For the those of ordinary skill in field, without having to pay creative labor, can be obtaining which according to these accompanying drawings
His accompanying drawing.
Flow charts of the Fig. 1 for the recognition methodss of embodiment of the present invention response voice;
Fig. 2 is the flow chart that the embodiment of the present invention extracts phonetic feature;
Fig. 3 is the method flow diagram that the embodiment of the present invention obtains response mode identification model;
Fig. 4 is the overall flow figure of the method that the embodiment of the present invention obtains response mode identification model;
Fig. 5 A are the corresponding recognition result accuracy rate schematic diagram of embodiment of the present invention SVM kernel function;
Fig. 5 B are embodiment of the present invention SVM kernel function Performance comparision figure;
Structural representations of the Fig. 6 for the identifying device of embodiment of the present invention response voice.
Specific embodiment
The embodiment of the present invention obtains response voice to be identified;The response to be identified is determined using response mode identification model
The corresponding response mode of voice;Wherein, the response mode identification model is the machine learning model for having supervision;If the response
Mode is formal response mode, then by first speech recognition system of response phonetic entry to be identified, so that first language
Sound identifying system recognizes the response voice to be identified, and exports the corresponding text message of the response voice to be identified;If institute
It is unofficial response mode to state response mode, then by second speech recognition system of response phonetic entry to be identified, so that institute
State the second speech recognition system and recognize the response voice to be identified, and export the corresponding text envelope of the response voice to be identified
Breath;Wherein, first speech recognition system and second speech recognition system are configured with different parameters.
As the embodiment of the present invention is when response voice is recognized, after obtaining response voice to be identified, using response mode
Identification model determines the corresponding response mode of response voice to be identified, for formal response mode and the input of unofficial response mode
Different speech recognition systems are identified.As the first speech recognition system is used for recognizing formal response voice, the second voice
Identifying system is used for recognizing unofficial response voice, and the first speech recognition system and the second speech recognition system are configured with not
Same parameter, the embodiment of the present invention recognize that response voice is formal response mode or unofficial response mode first, for difference
Response mode be identified using different speech recognition systems, so as to lift overall speech recognition performance, to be identified
The identification of response voice is more accurate.
It should be noted that the method for the response mode of the identification response voice of the embodiment of the present invention, can be not only used for
The effect of speech recognition system is lifted, other higher-level systems, such as Speaker Recognition System, abnormal sound prison can also be applied to
Examining system etc..
In order that the object, technical solutions and advantages of the present invention are clearer, below in conjunction with accompanying drawing the present invention is made into
One step ground is described in detail, it is clear that described embodiment is only present invention some embodiments, rather than the enforcement of whole
Example.Based on the embodiment in the present invention, what those of ordinary skill in the art were obtained under the premise of creative work is not made
All other embodiment, belongs to the scope of protection of the invention.
As shown in figure 1, the recognition methodss of embodiment of the present invention response voice include:
Step 101, acquisition response voice to be identified;
Step 102, the corresponding response mode of the response voice to be identified is determined using response mode identification model;Its
In, the response mode identification model is the machine learning model for having supervision;
If step 103, the response mode are formal response mode, by first language of response phonetic entry to be identified
Sound identifying system so that first speech recognition system recognizes the response voice to be identified, and export it is described it is to be identified should
Answer the corresponding text message of voice;If the response mode is unofficial response mode, will be the response voice to be identified defeated
Enter the second speech recognition system, so that second speech recognition system recognizes the response voice to be identified, and export described
The corresponding text message of response voice to be identified;Wherein, first speech recognition system and second speech recognition system
It is configured with different parameters.
The corresponding response mode of embodiment of the present invention response voice to be identified includes formal response mode and unofficial response
Mode;
The embodiment of the present invention is can apply in flying scene, and the response mode of the response voice in flying scene is carried out
Identification, recognizes that the response mode of aloft response voice is formal response mode or unofficial response mode.Wherein, formally should
The identification voice for answering mode is the indicative dialogue between driver and ground control centre;For example, driver earthward controls
Center sends asks for instructions, and ground control centre carries out response for asking for instructions for driver, and earthward control centre replys true to driver
Recognize.
The identification voice of unofficial response mode is the dialogue between driver and copilot, or between driver and ground control tower
Dialogue;For example, the voice chatted between driver and copilot, between driver and copilot with regard to flight course in guiding language
Sound, earthward aircraft state etc. is reported at control tower center to driver.
It should be noted that the embodiment of the present invention is not limited in flying scene, it is available in any linguistic field border
Response mode recognition methodss of the embodiment of the present invention, also, in different language contextses, alignment type response mode and unofficial
The definition of response mode is also not quite similar.For example, A, B are football match announcer, it is determined that dialog information between A and B
During response mode, the dialogue between A and B with regard to the football match is defined as into the dialogue of formal response mode, by A and B it
Between the dialogue unrelated with the football match be defined as the dialogue of unofficial response mode.
The embodiment of the present invention is determining the corresponding answer party of the response voice to be identified using response mode identification model
During formula, specifically using following method:
Optionally, the phonetic feature extracted from the response voice to be identified is input into into the response mode and recognizes mould
Type;Obtain the corresponding response mode of response voice described to be identified of the response mode identification model output.
Wherein, the response mode identification model of the embodiment of the present invention is the machine learning model for having supervision, specifically, this
The response mode identification model of bright embodiment is SVM (support vector machine) model.
The embodiment of the present invention using feature extraction tools, is extracted described to be identified after response voice to be identified is got
Phonetic feature in response voice.
In enforcement, during phonetic feature of the embodiment of the present invention in response voice to be identified is extracted, using Multi-layer technology
Mode extracts the phonetic feature in response voice to be identified.
The phonetic feature of the embodiment of the present invention includes frame level (frame level) feature, chip level feature (segment
) and section level (part level) feature level.
Specifically, the embodiment of the present invention uses openSMILE feature extraction tools, and response voice to be identified is layered
Extract, extract the phonetic feature in response voice to be identified.
Optionally, using feature extraction tools, moved according to default frame length and frame, extract the response voice to be identified
Frame level feature;The frame level feature is done into the disposal of gentle filter, and calculus of differences is done to the frame level feature after smoothing processing, it is determined that
The chip level feature of the response voice to be identified;According to default statistical parameter, process is analyzed to the chip level feature, really
The Utterance level feature of the fixed response voice to be identified..
The method that the embodiment of the present invention extracts phonetic feature from response voice to be identified is described in detail below.
The first step, extracts the frame level feature in response voice to be identified.
Wherein, frame level is characterized as the ground floor phonetic feature in response voice to be identified.
In enforcement, using openSMILE feature extraction tools, frame length 20ms, frame move 10ms, altogether comprising 16 dimensional features, tool
The frame level characteristic parameter of body as shown in table 1, is specifically included:
RMSenergy (Root Mean Square energy, energy root-mean-square), mfcc (Mel-Frequency
Cepstral Coefficient, mel-frequency cepstrum coefficient) 1-12 dimensions, zcr (zero-crossing rate, zero-crossing rate),
Voice_prob (voiced sound accounting), F0 (according to the fundamental frequency that cepstrum is calculated).
Table 1
The English of frame level feature is write a Chinese character in simplified form | The Chinese explanation of frame level feature |
RMSenergy | Energy root-mean-square |
mfcc(1-12) | Mel-frequency cepstrum coefficient 1-12 is tieed up |
zcr | Zero-crossing rate (frame level) |
Voice_prob | By autocorrelation calculation voiced sound accounting |
F0 | According to the fundamental frequency that cepstrum is calculated |
Second step, extracts the chip level feature in response voice to be identified.
Wherein, chip level is characterized as the second layer phonetic feature in response voice to be identified.
Specifically, the frame level feature is done into the disposal of gentle filter, and partite transport is made the difference to the frame level feature after smoothing processing
Calculate, determine the chip level feature in the response voice to be identified.
In enforcement, the frame sequence to obtaining in the first step carries out the smothing filtering sma (smoothed that length of window is 3 frames
by a moving average filter);
After smothing filtering being carried out to frame sequence, first-order difference de (delta is done to the feature after smooth
coefficient)。
Wherein, treat that specific chip level feature analysiss function as shown in table 2, is specifically included:
Sma (smothing filtering) and de (first-order difference).
Table 2
The English of chip level feature analysiss function is write a Chinese character in simplified form | The Chinese explanation of chip level feature analysiss function |
sma | Smothing filtering |
de | First-order difference |
After the first step and second step, 16*2=32 dimension phonetic features are being obtained.
3rd step, extracts the Utterance level feature in response voice to be identified.
Wherein, Utterance level feature is the third layer phonetic feature in response voice to be identified.
Specifically, according to default statistical parameter, process is analyzed to the chip level feature, determine it is described it is to be identified should
Answer the Utterance level feature in voice.
In enforcement, statistical analysiss are done to the feature of second step output, mainly include 12 statistical parameters, counted according to 12
The feature chip level feature that parameter is exported to second step is analyzed process, obtains the Utterance level feature in response voice to be identified.
Specific default 12 statistical parameters are as shown in table 3, including:
Max (maximum, envelope take maximum), min (minute, envelope take minima), range (envelope variation models
Enclose), maxpos (maximum position, maximum value position), (minute position, envelope minima are absolute for minpos
Position), amean (Arithmetic mean, envelope count average), linregc1 (the linear approximation slope of envelope),
The linregc2 linear approximation of envelope (skew), linregerrQ (root-mean-square of the linear predictor and actual value of envelope),
Stddev (standard deviation), skewness (three rank degrees of skewness), kurtosis (quadravalence kurtosis).
Table 3
The English of Utterance level feature statistical parameter is write a Chinese character in simplified form | The Chinese explanation of Utterance level feature statistical parameter |
max | Envelope takes maximum |
min | Envelope takes minima |
range | Envelope variation scope |
maxpos | Maximum value position |
minpos | Envelope minima absolute position |
amean | Envelope counts average |
linregc1 | The linear approximation slope of envelope |
linregc2 | The linear approximation skew of envelope |
linregerrQ | The linear predictor of envelope and the root-mean-square of actual value |
stddev | Standard deviation |
skewness | Three rank degrees of skewness |
kurtosis | Quadravalence kurtosis |
As shown in Fig. 2 when the embodiment of the present invention extracts the Utterance level feature in response voice to be identified in the third step, being pin
Chip level feature to obtaining in second step carries out statistical analysiss, and including default 12 statistical parameters, then through the 3rd step
After Utterance level feature is extracted, 16*2*12=384 dimension phonetic features are obtained.
After the embodiment of the present invention extracts the phonetic feature in response voice to be identified by feature extraction tools, will carry
The phonetic feature of taking-up is input in response mode identification model, so that the response mode identification model is special according to the voice
Levy and recognize the corresponding response mode of the response voice to be identified;And obtain voice of the response mode identification model according to input
Feature, the corresponding response mode of the response voice to be identified of output.
It should be noted that the response mode identification model of the embodiment of the present invention be through training in advance, for recognizing
The model of response mode.
Due to identification of the embodiment of the present invention to the corresponding response mode of response voice to be identified, mainly by means of answer party
Formula identification model, and the response mode identification model is the model through training in advance, therefore, the embodiment of the present invention also includes
One important ingredient, that is, train response mode identification model.
The process of response mode identification model is trained the following detailed description of the embodiment of the present invention.
As shown in figure 3, the method that the embodiment of the present invention obtains response mode identification model includes:
Step 301, training set of the determination comprising multiple response voices, and the test set comprising multiple response voices;Its
In, the response voice in the training set is different from the response voice in the test set;
Step 302, for any one response voice in the training set, the language that will be extracted from the response voice
Sound feature is input in the response mode identification model before training and is trained;
Step 303, for any one response voice in the test set, the language that will be extracted from the response voice
Sound feature is input in the response mode identification model after training, and obtain the described of response mode identification model output should
Answer the corresponding response mode of voice;
Step 304, according to after training response mode identification model output the test set in each response voice
Corresponding response mode, determines the recognition correct rate of the response mode identification model after the training, if the recognition correct rate
More than given threshold, determine that the training of the response mode identification model after the training is completed, preserve the answer party after the training
Formula identification model.
In step 301, the embodiment of the present invention is it is determined that when training set and test set, choose multiple response languages from corpus
Sound, by the multiple response voice composition training sets for selecting or test set.
The corpus of the embodiment of the present invention is the voice prerecorded, and the voice prerecorded includes multiple formally should
Answer the response voice of mode and unofficial response mode.
For example, corpus can be the voice of 17.5 hours recorded during practical flight is performed, and record it
Afterwards, the voice of 17.5 hours is labeled, it is assumed that mark includes 18 speakers in determining the voice of 17.5 hours altogether,
The response voice of 4668 formal response modes, and the response voice of 2257 unofficial response modes are contained wherein, then
The ratio of the response voice of the response voice and unofficial response mode of formal response mode is 2.07:1, and all response languages
The speech sample frequency of sound is all 16KHz, and quantified precision is 16bit.
Multiple response voices are selected in all response voices from corpus, constitute training set;Preferably, training set
In formal response mode response voice and unofficial response mode response voice ratio, be close to formal response in corpus
The ratio of the response voice of the response voice of mode and unofficial response mode.
For example, determine two training sets, respectively training set A and training set B, and determine a test set C, wherein,
In training set A, B and test set C the formal response voice of response mode and the quantity of the response voice of unofficial response mode and
Ratio is as shown in table 4:
The response voice of 1580 formal response modes of selection from corpus, and 1580 unofficial response modes
Response voice constitutes training set A, the response voice of the response voice of formal response mode and unofficial response mode in training set A
Ratio be 1:1;The response voice of 3270 formal response modes, and 1580 unofficial answer parties are chosen from corpus
Response voice composition training set B of formula, the response of the response voice of formal response mode and unofficial response mode in training set B
The ratio of voice is 2.07:1;Choose the response voice of 1400 formal response modes from corpus, and 677 unofficially
The response voice composition test set C of response mode, the response voice of formal response mode and unofficial response mode in test set C
Response voice ratio be 2.07:1.
Table 4
Below by taking training set A, B shown in table 4 and test set C as an example, the method for illustrating to train response mode identification model.
Specifically, the embodiment of the present invention is by each response voice in training set A and training set B, to response mode
Identification model is trained, after the completion of training, by the response mode after each the response phonetic entry training in test set C
Identification model, if response mode identification model output test set C in the corresponding response mode of response voice correct recognition rata
During more than given threshold, determine that the response mode identification model training is completed, and preserve the response mode identification mould that training is completed
Type.
Below for any one response voice in training set A, illustrate to train the process of response mode identification model:
1st, feature extraction tools are used, extracts the phonetic feature of the response voice.
The method of the concrete phonetic feature for extracting response voice adopts said method, and in this not go into detail.
2nd, the response voice corresponding phonetic feature is input in response mode identification model and is trained.
Specifically, by response voice corresponding response phonetic entry response mode identification model, and by the response language
The corresponding response mode of sound is input into response mode identification model, so that the study of response mode identification model is to the phonetic feature correspondence
Response mode.
The embodiment of the present invention is adopted in manner just described, response mode identification model is entered using the response voice in training set
Row training, through training set A and training level B in multiple response voices repeatedly trained after, using test set C in should
Voice is answered, is judged whether the response mode identification model trains and is completed.
Specifically, using test set C judge response mode identification model whether train complete when, in test set C
Any one response voice, perform following operation:
1st, feature extraction tools are used, extracts the phonetic feature of the response voice;
The method of the concrete phonetic feature for extracting response voice adopts said method, and in this not go into detail.
2nd, the response mode identification model that the response voice corresponding phonetic feature is input into after training;
3rd, obtain the corresponding response mode of response voice of the response mode identification model output after training.
Specifically, response mode identification model is preset it is determined that the corresponding response mode of response voice is formal response
During mode, response mode identification model output " 1 ";When it is determined that the corresponding response mode of response voice is non-formula response mode,
Response mode identification model exports " 0 ".
Response mode identification model of the embodiment of the present invention after using training is to each the response voice in test set C
After being judged, each corresponding recognition result of response voice in test set C is determined;Response mode identification model is determined
Each corresponding recognition result of response voice in test set C, response mode corresponding with each response voice are compared,
Determine the correct recognition rata of the corresponding recognition results of test set C, if the correct recognition rata is more than given threshold, it is determined that the response
The training of mode identification model is completed, and preserves the response mode identification model after training;If the correct recognition rata no more than sets threshold
Value, then reselect training set and test set, continues training to the response mode identification model, until the response mode recognizes mould
Type is more than given threshold to the corresponding correct recognition rata of recognition result of response voice in test set.
As shown in figure 4, the embodiment of the present invention obtains the overall flow figure of the method for response mode identification model.
Step 401, training set of the determination comprising multiple response voices, and the test set comprising multiple response voices;Its
In, the response voice in the training set is different from the response voice in the test set;
The following steps 402,403 are for any one the response voice in training set.
Step 402, feature extraction tools are used, extract the phonetic feature in the response voice;
Step 403, by the phonetic feature for extracting, and the corresponding response mode of the response voice is input to answer party
It is trained in formula identification model;
The following steps 404,405 are for any one the response voice in training set.
Step 404, feature extraction tools are used, extract the phonetic feature in the response voice;
Step 405, the phonetic feature for extracting is input in response mode identification model it is identified;
Step 406, the recognition result for determining each response voice in the test set;
Step 407, by each response language in the recognition result of each response voice in the test set, with test set
The corresponding response mode of sound is compared, and determines the correct recognition rata of the corresponding recognition result of the test set;
Whether step 408, correct judgment discrimination are more than given threshold, if so, execution step 409, if it is not, return to step
401;
Step 409, determine response mode identification model training after the completion of, preserve the response mode identification mould after training
Type.
The embodiment of the present invention in two classification problems of identification response mode, employ support suitable for small data quantity to
Amount machine SVM classifier is used as response mode identification model, and compared for following kernel function:Linear kernel function, polynomial kernel letter
Number, gaussian radial basis function and arc tangent kernel function.
The embodiment of the present invention is respectively adopted linear kernel function, Polynomial kernel function, height based on training set as shown in table 4
This Radial basis kernel function and arc tangent kernel function are tested, the accuracy rate of the recognition result for obtaining as shown in Figure 5A, wherein,
When SVM kernel functions are linear kernel function, the accuracy rate of the corresponding recognition result of training set A is 80.30, the corresponding knowledge of training set B
The accuracy rate of other result is 81.02;SVM kernel functions are Polynomial kernel function, and during d=2, the corresponding identification knot of training set A
The accuracy rate of fruit is 77.95, and the accuracy rate of the corresponding recognition result of training set B is 79.25;SVM kernel functions are polynomial kernel letter
Number, and during d=3, the accuracy rate of the corresponding recognition result of training set A is 76.17, the standard of the corresponding recognition result of training set B
Really rate is 81.13;SVM kernel functions are Polynomial kernel function, and during d=4, the accuracy rate of the corresponding recognition result of training set A
For 63.79, the accuracy rate of the corresponding recognition result of training set B is 63.94;When SVM kernel functions are gaussian radial basis function, instruction
The accuracy rate for practicing the corresponding recognition results of collection A is 90.71, and the accuracy rate of the corresponding recognition result of training set B is 91.62;SVM cores
When function is arc tangent kernel function, the accuracy rate of the corresponding recognition result of training set A is 84.45, the corresponding identification knot of training set B
The accuracy rate of fruit is 89.56;
Also, SVM models are respectively adopted linear kernel function, Polynomial kernel function, gaussian radial basis function and anyway
Cut the Performance comparision of kernel function as shown in Figure 5 B.
Based on same inventive concept, a kind of identifying device of response mode is additionally provided in the embodiment of the present invention, due to this
The principle of device solve problem is similar to the knowledge method for distinguishing of embodiment of the present invention response mode, therefore the enforcement of the device can be with
Referring to the enforcement of method, repeat part and repeat no more.
As shown in fig. 6, the identifying device of embodiment of the present invention response voice, including:
Acquisition module 601, acquisition module, for obtaining response voice to be identified;
Identification module 602, for determining the corresponding response of the response voice to be identified using response mode identification model
Mode;Wherein, the response mode identification model is the machine learning model for having supervision;
Judge module 603, if being formal response mode for the response mode, will be the response voice to be identified defeated
Enter the first speech recognition system, so that first speech recognition system recognizes the response voice to be identified, and export described
The corresponding text message of response voice to be identified;If the response mode be unofficial response mode, will it is described it is to be identified answer
The second speech recognition system of phonetic entry is answered, so that second speech recognition system recognizes the response voice to be identified, and
Export the corresponding text message of the response voice to be identified;Wherein, first speech recognition system and second voice
Identifying system is configured with different parameters.
Optionally, the identification module 602, specifically for:
The phonetic feature extracted from the response voice to be identified is input into into the response mode identification model;Obtain institute
State the corresponding response mode of response voice described to be identified of response mode identification model output.
Optionally, the phonetic feature includes frame level feature, chip level feature and Utterance level feature;
The identification module 602, specifically for:
Phonetic feature is extracted from response voice according to following manner:
Using feature extraction tools, moved according to default frame length and frame, the frame level for extracting the response voice to be identified is special
Levy;The frame level feature is done into the disposal of gentle filter, and calculus of differences is done to the frame level feature after smoothing processing, it is determined that described treat
The chip level feature of identification response voice;According to default statistical parameter, process is analyzed to the chip level feature, it is determined that described
The Utterance level feature of response voice to be identified.
Optionally, the acquisition module 601, is additionally operable to:
The response mode identification model is obtained according to following manner:
It is determined that the training set comprising multiple response voices, and the test set comprising multiple response voices;Wherein, the instruction
Practice the response voice concentrated different from the response voice in the test set;For any one response language in the training set
Sound, the phonetic feature for extracting is input in the response mode identification model before training is instructed from the response voice
Practice;For any one response voice in the test set, the phonetic feature extracted from the response voice is input to
In response mode identification model after training, and it is corresponding to obtain the response voice of the response mode identification model output
Response mode;In the test set according to the response mode identification model output after training, each response voice is corresponding should
Mode is answered, the recognition correct rate of the response mode identification model after the training is determined, if the recognition correct rate is more than setting
Threshold value, determines that the training of the response mode identification model after the training is completed, and preserves the response mode identification mould after the training
Type.
Optionally, the response mode identification model is support vector machines model.
The present invention be with reference to method according to embodiments of the present invention, equipment (system), and computer program flow process
Figure and/or block diagram are describing.It should be understood that can be by computer program instructions flowchart and/or each stream in block diagram
The combination of journey and/or square frame and flow chart and/or the flow process in block diagram and/or square frame.These computer programs can be provided
Instruct the processor of general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices so that
A stream being capable of achieving by the instruction of the computing device of the computer or other programmable data processing devices in flow chart
The function of specifying in journey or one square frame of multiple flow processs and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in and can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory is produced to be included referring to
Make the manufacture of device, the command device realize in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or
The function of specifying in multiple square frames.
These computer program instructions can be also loaded in computer or other programmable data processing devices so that in meter
Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented process, so as in computer or
The instruction performed on other programmable devices is provided for realizing a flow process or multiple flow processs and/or block diagram in flow chart
A square frame or multiple square frames in specify function the step of.
, but those skilled in the art once know basic creation although preferred embodiments of the present invention have been described
Property concept, then can make other change and modification to these embodiments.So, claims are intended to be construed to include excellent
Select embodiment and fall into the had altered of the scope of the invention and change.
Obviously, those skilled in the art can carry out the essence of various changes and modification without deviating from the present invention to the present invention
God and scope.So, if these modifications of the present invention and modification belong to the scope of the claims in the present invention and its equivalent technologies
Within, then the present invention is also intended to comprising these changes and modification.
Claims (10)
1. a kind of recognition methodss of response voice, it is characterised in that the method includes:
Obtain response voice to be identified;
The corresponding response mode of the response voice to be identified is determined using response mode identification model;Wherein, the answer party
Formula identification model is the machine learning model for having supervision;
If the response mode is formal response mode, by first speech recognition system of response phonetic entry to be identified,
So that first speech recognition system recognizes the response voice to be identified, and it is corresponding to export the response voice to be identified
Text message;
If the response mode is unofficial response mode, by the second speech recognition of response phonetic entry system to be identified
System, so that second speech recognition system recognizes the response voice to be identified, and exports the response voice pair to be identified
The text message answered;
Wherein, first speech recognition system and second speech recognition system are configured with different parameters.
2. the method for claim 1, it is characterised in that the use response mode identification model determines described to be identified
The corresponding response mode of response voice, specifically includes:
The phonetic feature extracted from the response voice to be identified is input into into the response mode identification model;
Obtain the corresponding response mode of response voice described to be identified of the response mode identification model output.
3. method as claimed in claim 2, it is characterised in that the phonetic feature includes frame level feature, chip level feature and section
Level feature;
Phonetic feature is extracted from response voice according to following manner:
Using feature extraction tools, moved according to default frame length and frame, extract the frame level feature of the response voice to be identified;
The frame level feature is done into the disposal of gentle filter, and calculus of differences is done to the frame level feature after smoothing processing, it is determined that described
The chip level feature of response voice to be identified;
According to default statistical parameter, process is analyzed to the chip level feature, determines the section of the response voice to be identified
Level feature.
4. the method for claim 1, it is characterised in that the response mode identification model is obtained according to following manner:
It is determined that the training set comprising multiple response voices, and the test set comprising multiple response voices;Wherein, the training set
In response voice it is different from the response voice in the test set;
For any one response voice in the training set, the phonetic feature extracted from the response voice is input to
It is trained in response mode identification model before training;
For any one response voice in the test set, the phonetic feature extracted from the response voice is input to
In response mode identification model after training, and obtain the response language of the output of the response mode identification model after the training
The corresponding response mode of sound;
In the test set according to the response mode identification model output after the training, each response voice is corresponding should
Mode is answered, the recognition correct rate of the response mode identification model after the training is determined, if the recognition correct rate is more than setting
Threshold value, determines that the training of the response mode identification model after the training is completed, and preserves the response mode identification mould after the training
Type.
5. the method as described in Claims 1 to 4 is arbitrary, it is characterised in that the response mode identification model is supporting vector
Machine SVM models.
6. a kind of identifying device of response voice, it is characterised in that include:
Acquisition module, for obtaining response voice to be identified;
Identification module, for determining the corresponding response mode of the response voice to be identified using response mode identification model;Its
In, the response mode identification model is the machine learning model for having supervision;
Judge module, if being formal response mode for the response mode, by the response phonetic entry to be identified first
Speech recognition system, so that first speech recognition system recognizes the response voice to be identified, and exports described to be identified
The corresponding text message of response voice;If the response mode is unofficial response mode, by the response voice to be identified
The second speech recognition system is input into, so that second speech recognition system recognizes the response voice to be identified, and institute is exported
State the corresponding text message of response voice to be identified;Wherein, first speech recognition system and the second speech recognition system
It is under unified central planning to be equipped with different parameters.
7. device as claimed in claim 6, it is characterised in that the identification module, specifically for:
The phonetic feature extracted from the response voice to be identified is input into into the response mode identification model;Obtain described answering
Answer the corresponding response mode of response voice described to be identified of mode identification model output.
8. device as claimed in claim 7, it is characterised in that the phonetic feature includes frame level feature, chip level feature and section
Level feature;
The identification module, specifically for:
Phonetic feature is extracted from response voice according to following manner:
Using feature extraction tools, moved according to default frame length and frame, extract the frame level feature of the response voice to be identified;Will
The frame level feature does the disposal of gentle filter, and does calculus of differences to the frame level feature after smoothing processing, determines described to be identified
The chip level feature of response voice;According to default statistical parameter, process is analyzed to the chip level feature, it is determined that described wait to know
The Utterance level feature of voice is not replied.
9. device as claimed in claim 6, it is characterised in that the acquisition module, is additionally operable to:
The response mode identification model is obtained according to following manner:
It is determined that the training set comprising multiple response voices, and the test set comprising multiple response voices;Wherein, the training set
In response voice it is different from the response voice in the test set;For any one response voice in the training set, will
The phonetic feature extracted from the response voice is input in the response mode identification model before training and is trained;For
Any one response voice in the test set, the phonetic feature extracted from the response voice is input to after training
In response mode identification model, and it is corresponding to obtain the response voice of the output of the response mode identification model after the training
Response mode;In the test set according to the response mode identification model output after training, each response voice is corresponding should
Mode is answered, the recognition correct rate of the response mode identification model after the training is determined, if the recognition correct rate is more than setting
Threshold value, determines that the training of the response mode identification model after the training is completed, and preserves the response mode identification mould after the training
Type.
10. the device as described in claim 6~9 is arbitrary, it is characterised in that the response mode identification model is supporting vector
Machine SVM models.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611081923.XA CN106531158A (en) | 2016-11-30 | 2016-11-30 | Method and device for recognizing answer voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611081923.XA CN106531158A (en) | 2016-11-30 | 2016-11-30 | Method and device for recognizing answer voice |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106531158A true CN106531158A (en) | 2017-03-22 |
Family
ID=58353579
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611081923.XA Pending CN106531158A (en) | 2016-11-30 | 2016-11-30 | Method and device for recognizing answer voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106531158A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109065076A (en) * | 2018-09-05 | 2018-12-21 | 深圳追科技有限公司 | Setting method, device, equipment and the storage medium of audio tag |
CN109308783A (en) * | 2018-11-21 | 2019-02-05 | 黑龙江大学 | It knocks at the door automatic-answering back device anti-theft device and answer method |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101689367A (en) * | 2007-05-31 | 2010-03-31 | 摩托罗拉公司 | Method and system to configure audio processing paths for voice recognition |
CN102237087A (en) * | 2010-04-27 | 2011-11-09 | 中兴通讯股份有限公司 | Voice control method and voice control device |
CN102789779A (en) * | 2012-07-12 | 2012-11-21 | 广东外语外贸大学 | Speech recognition system and recognition method thereof |
CN103971700A (en) * | 2013-08-01 | 2014-08-06 | 哈尔滨理工大学 | Voice monitoring method and device |
CN104464756A (en) * | 2014-12-10 | 2015-03-25 | 黑龙江真美广播通讯器材有限公司 | Small speaker emotion recognition system |
CN104464724A (en) * | 2014-12-08 | 2015-03-25 | 南京邮电大学 | Speaker recognition method for deliberately pretended voices |
CN104517609A (en) * | 2013-09-27 | 2015-04-15 | 华为技术有限公司 | Voice recognition method and device |
CN105493179A (en) * | 2013-07-31 | 2016-04-13 | 微软技术许可有限责任公司 | System with multiple simultaneous speech recognizers |
CN105529027A (en) * | 2015-12-14 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | Voice identification method and apparatus |
CN105719664A (en) * | 2016-01-14 | 2016-06-29 | 盐城工学院 | Likelihood probability fuzzy entropy based voice emotion automatic identification method at tension state |
CN105812535A (en) * | 2014-12-29 | 2016-07-27 | 中兴通讯股份有限公司 | Method of recording speech communication information and terminal |
CN106033669A (en) * | 2015-03-18 | 2016-10-19 | 展讯通信(上海)有限公司 | Voice identification method and apparatus thereof |
CN106153065A (en) * | 2014-10-17 | 2016-11-23 | 现代自动车株式会社 | The method of audio frequency and video navigator, vehicle and control audio frequency and video navigator |
-
2016
- 2016-11-30 CN CN201611081923.XA patent/CN106531158A/en active Pending
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101689367A (en) * | 2007-05-31 | 2010-03-31 | 摩托罗拉公司 | Method and system to configure audio processing paths for voice recognition |
CN102237087A (en) * | 2010-04-27 | 2011-11-09 | 中兴通讯股份有限公司 | Voice control method and voice control device |
CN102789779A (en) * | 2012-07-12 | 2012-11-21 | 广东外语外贸大学 | Speech recognition system and recognition method thereof |
CN105493179A (en) * | 2013-07-31 | 2016-04-13 | 微软技术许可有限责任公司 | System with multiple simultaneous speech recognizers |
CN103971700A (en) * | 2013-08-01 | 2014-08-06 | 哈尔滨理工大学 | Voice monitoring method and device |
CN104517609A (en) * | 2013-09-27 | 2015-04-15 | 华为技术有限公司 | Voice recognition method and device |
CN106153065A (en) * | 2014-10-17 | 2016-11-23 | 现代自动车株式会社 | The method of audio frequency and video navigator, vehicle and control audio frequency and video navigator |
CN104464724A (en) * | 2014-12-08 | 2015-03-25 | 南京邮电大学 | Speaker recognition method for deliberately pretended voices |
CN104464756A (en) * | 2014-12-10 | 2015-03-25 | 黑龙江真美广播通讯器材有限公司 | Small speaker emotion recognition system |
CN105812535A (en) * | 2014-12-29 | 2016-07-27 | 中兴通讯股份有限公司 | Method of recording speech communication information and terminal |
CN106033669A (en) * | 2015-03-18 | 2016-10-19 | 展讯通信(上海)有限公司 | Voice identification method and apparatus thereof |
CN105529027A (en) * | 2015-12-14 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | Voice identification method and apparatus |
CN105719664A (en) * | 2016-01-14 | 2016-06-29 | 盐城工学院 | Likelihood probability fuzzy entropy based voice emotion automatic identification method at tension state |
Non-Patent Citations (1)
Title |
---|
唐刚: "飞行员应答语音的自动识别研究", 《工程科技Ⅱ辑》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109065076A (en) * | 2018-09-05 | 2018-12-21 | 深圳追科技有限公司 | Setting method, device, equipment and the storage medium of audio tag |
CN109065076B (en) * | 2018-09-05 | 2020-11-27 | 深圳追一科技有限公司 | Audio label setting method, device, equipment and storage medium |
CN109308783A (en) * | 2018-11-21 | 2019-02-05 | 黑龙江大学 | It knocks at the door automatic-answering back device anti-theft device and answer method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10388279B2 (en) | Voice interaction apparatus and voice interaction method | |
CN105374356B (en) | Audio recognition method, speech assessment method, speech recognition system and speech assessment system | |
WO2021128741A1 (en) | Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium | |
CN105632501B (en) | A kind of automatic accent classification method and device based on depth learning technology | |
US9754580B2 (en) | System and method for extracting and using prosody features | |
CN107972028B (en) | Man-machine interaction method and device and electronic equipment | |
CN110570873B (en) | Voiceprint wake-up method and device, computer equipment and storage medium | |
CN110457432A (en) | Interview methods of marking, device, equipment and storage medium | |
CN104573462B (en) | Fingerprint and voiceprint dual-authentication method | |
CN109545243A (en) | Pronunciation quality evaluating method, device, electronic equipment and storage medium | |
CN107767881A (en) | A kind of acquisition methods and device of the satisfaction of voice messaging | |
CN106875943A (en) | A kind of speech recognition system for big data analysis | |
US20180122377A1 (en) | Voice interaction apparatus and voice interaction method | |
Koolagudi et al. | Two stage emotion recognition based on speaking rate | |
CN109377981B (en) | Phoneme alignment method and device | |
CN103177733A (en) | Method and system for evaluating Chinese mandarin retroflex suffixation pronunciation quality | |
CN107886968B (en) | Voice evaluation method and system | |
CN108899033B (en) | Method and device for determining speaker characteristics | |
CN112992191B (en) | Voice endpoint detection method and device, electronic equipment and readable storage medium | |
CN104008752A (en) | Speech recognition device and method, and semiconductor integrated circuit device | |
Selvaraj et al. | Human speech emotion recognition | |
CN106782503A (en) | Automatic speech recognition method based on physiologic information in phonation | |
CN104299612B (en) | The detection method and device of imitative sound similarity | |
CN106782517A (en) | A kind of speech audio keyword filter method and device | |
CN110782902A (en) | Audio data determination method, apparatus, device and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170322 |