CN106531158A - Method and device for recognizing answer voice - Google Patents

Method and device for recognizing answer voice Download PDF

Info

Publication number
CN106531158A
CN106531158A CN201611081923.XA CN201611081923A CN106531158A CN 106531158 A CN106531158 A CN 106531158A CN 201611081923 A CN201611081923 A CN 201611081923A CN 106531158 A CN106531158 A CN 106531158A
Authority
CN
China
Prior art keywords
response
voice
identified
identification model
response mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611081923.XA
Other languages
Chinese (zh)
Inventor
谢湘
唐刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201611081923.XA priority Critical patent/CN106531158A/en
Publication of CN106531158A publication Critical patent/CN106531158A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Abstract

The invention relates to the field of computer paralanguage information, and particularly relates to a method and a device for recognizing answer voice, so as to solve the problem that the current answer voice recognition method is inaccurate in answer voice recognition. The method of the embodiment of the invention comprises steps: to-be-recognized answer voice is acquired; an answer mode recognition model is used for determining an answer mode corresponding to the to-be-recognized answer voice; if the answer mode is a formal answer mode, the to-be-recognized answer voice is inputted to a first voice recognition system; and if the answer mode is an informal answer mode, the to-be-recognized answer voice is inputted to a second voice recognition system. when the method of the embodiment of the invention recognizes the answer voice, whether the answer voice is the formal answer mode or the informal answer mode is firstly recognized, the answer voice is inputted to a different voice recognition system for recognition for the formal answer mode and the informal answer mode, and the overall voice recognition performance is thus enhanced.

Description

A kind of recognition methodss of response voice and device
Technical field
The present invention relates to computer paralanguage field, more particularly to a kind of recognition methodss of response voice and device.
Background technology
In recent years, computer paralinguistic becomes the study hotspot of speech language process field, and speech recognition technology is sent out Exhibition is to promoting the intelligent, development of the novel human-machine interaction technology of hommization and application with important effect.
Speech recognition is exactly the technology that voice is changed into automatically text using computer, in voice always human lives Interactive important medium, therefore allow machine to realize the identification to voice it is critical that a step.Can make in many occasions at present Voice is recorded with voice recorder, and needs the voice to recording in voice recorder to be analyzed.For example, in flying scene In, the voice on aircraft is recorded using cockpit voice recording instrument, by recognizing the voice in cockpit voice recording instrument to flying after flight terminates Row quality is evaluated.At present, when the voice messaging for recording in voice recorder is identified, use machine automatic Know method for distinguishing, specifically, the voice recorded in voice recorder is divided into into a sentence using end points technology of identification to be identified Response voice, and by response phonetic entry to be identified in speech recognition system, recognized by the speech recognition system to be identified Response voice.As response voice to be identified speaks object and environment is divided into formal response voice and unofficial according to different Response voice, formal response voice are different with the corresponding voice environment of unofficial response voice, and speaker's tone, intonation are equal Differ;And the method that the response phonetic entry speech recognition system for getting is identified directly is tended not to by prior art Response voice is recognized accurately.
In sum, current response audio recognition method is not accurate enough when response voice is recognized.
The content of the invention
The present invention provides a kind of recognition methodss of response voice and device, to solve current response audio recognition method The not accurate enough problem when response voice is recognized.
Based on the problems referred to above, the embodiment of the present invention provides a kind of recognition methodss of response voice, including:
Obtain response voice to be identified;
The corresponding response mode of the response voice to be identified is determined using response mode identification model;Wherein, it is described to answer The mode identification model of answering is the machine learning model for having supervision;
If the response mode is formal response mode, by the first speech recognition of response phonetic entry system to be identified System, so that first speech recognition system recognizes the response voice to be identified, and exports the response voice pair to be identified The text message answered;
If the response mode is unofficial response mode, by second speech recognition of response phonetic entry to be identified System, so that second speech recognition system recognizes the response voice to be identified, and exports the response voice to be identified Corresponding text message;
Wherein, first speech recognition system and second speech recognition system are configured with different parameters.
As the embodiment of the present invention is when response voice is recognized, after obtaining response voice to be identified, using response mode Identification model determines the corresponding response mode of response voice to be identified, for formal response mode and the input of unofficial response mode Different speech recognition systems are identified.As the first speech recognition system is used for recognizing formal response voice, the second voice Identifying system is used for recognizing unofficial response voice, and the first speech recognition system and the second speech recognition system are configured with not Same parameter, is identified using different speech recognition systems for different response modes, so that answering to be identified The identification for answering voice is more accurate.
Optionally, the use response mode identification model determines the corresponding response mode of the response voice to be identified, Specifically include:
The phonetic feature extracted from the response voice to be identified is input into into the response mode identification model;
Obtain the corresponding response mode of response voice described to be identified of the response mode identification model output.
As the embodiment of the present invention carries out response voice to be identified after feature extraction, will be the phonetic feature for extracting defeated Enter response mode identification model, the corresponding response mode of response voice to be identified is determined by response mode identification model.
Optionally, the phonetic feature includes frame level feature, chip level feature and Utterance level feature;
Phonetic feature is extracted from response voice according to following manner:
Using feature extraction tools, moved according to default frame length and frame, the frame level for extracting the response voice to be identified is special Levy;
The frame level feature is done into the disposal of gentle filter, and calculus of differences is done to the frame level feature after smoothing processing, it is determined that The chip level feature of the response voice to be identified;
According to default statistical parameter, process is analyzed to the chip level feature, determines the response voice to be identified Utterance level feature.
As the embodiment of the present invention extracts frame level, chip level, section level phonetic feature from response voice to be identified, so as to protect Card response mode identification model accurately recognizes the corresponding response mode of the response voice to be identified.
Optionally, the response mode identification model is obtained according to following manner:
It is determined that the training set comprising multiple response voices, and the test set comprising multiple response voices;Wherein, the instruction Practice the response voice concentrated different from the response voice in the test set;
For any one response voice in the training set, will be the phonetic feature extracted from the response voice defeated It is trained in entering the response mode identification model to before training;
For any one response voice in the test set, will be the phonetic feature extracted from the response voice defeated Enter in the response mode identification model to after training, and obtain the response voice pair of the response mode identification model output The response mode answered;
In the test set according to the response mode identification model output after training, each response voice is corresponding should Mode is answered, the correct recognition rata of the response mode identification model after the training is determined, if the correct recognition rata is more than setting Threshold value, determines that the training of the response mode identification model after the training is completed, and preserves the response mode identification mould after the training Type.
Due to the embodiment of the present invention using training set in multiple response voices response mode identification model is trained, Whether requirement is met using the response mode identification model after the response phonetic decision training in test set after training, in response When mode identification model recognizes that the correct recognition rata of the response voice in the test set is more than given threshold, the response mode is determined Identification model training is completed, and preserves the response mode identification model after the training;If correct recognition rata is less than given threshold, make It is trained with the response voice in training set again, until the correct recognition rata of response mode identification model is more than setting threshold Value, so that ensure that the response mode identification model for obtaining more accurately recognizes the corresponding response mode of response voice to be identified.
Optionally, the response mode identification model is support vector machines model.
On the other hand, the embodiment of the present invention also provides a kind of identifying device of response voice, including:
Acquisition module, for obtaining response voice to be identified;
Identification module, for determining the corresponding answer party of the response voice to be identified using response mode identification model Formula;Wherein, the response mode identification model is the machine learning model for having supervision;
Judge module, if being formal response mode for the response mode, by the response phonetic entry to be identified First speech recognition system, so that first speech recognition system recognizes the response voice to be identified, and treats described in exporting The corresponding text message of identification response voice;If the response mode is unofficial response mode, by the response to be identified The second speech recognition system of phonetic entry, so that second speech recognition system recognizes the response voice to be identified, and it is defeated Go out the corresponding text message of the response voice to be identified;Wherein, first speech recognition system and second voice are known Other system configuration has different parameters.
Optionally, the identification module, specifically for:
The phonetic feature extracted from the response voice to be identified is input into into the response mode identification model;Obtain institute State the corresponding response mode of response voice described to be identified of response mode identification model output.
Optionally, the phonetic feature includes frame level feature, chip level feature and Utterance level feature;
The identification module, specifically for:
Phonetic feature is extracted from response voice according to following manner:
Using feature extraction tools, moved according to default frame length and frame, the frame level for extracting the response voice to be identified is special Levy;The frame level feature is done into the disposal of gentle filter, and calculus of differences is done to the frame level feature after smoothing processing, it is determined that described treat The chip level feature of identification response voice;According to default statistical parameter, process is analyzed to the chip level feature, it is determined that described The Utterance level feature of response voice to be identified.
Optionally, the acquisition module, is additionally operable to:
The response mode identification model is obtained according to following manner:
It is determined that the training set comprising multiple response voices, and the test set comprising multiple response voices;Wherein, the instruction Practice the response voice concentrated different from the response voice in the test set;For any one response language in the training set Sound, the phonetic feature for extracting is input in the response mode identification model before training is instructed from the response voice Practice;For any one response voice in the test set, the phonetic feature extracted from the response voice is input to In response mode identification model after training, and it is corresponding to obtain the response voice of the response mode identification model output Response mode;In the test set according to the response mode identification model output after training, each response voice is corresponding should Mode is answered, the recognition correct rate of the response mode identification model after the training is determined, if the recognition correct rate is more than setting Threshold value, determines that the training of the response mode identification model after the training is completed, and preserves the response mode identification mould after the training Type.
Optionally, the response mode identification model is support vector machines model.
Description of the drawings
For the technical scheme being illustrated more clearly that in the embodiment of the present invention, below will be to making needed for embodiment description Accompanying drawing is briefly introduced, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for this For the those of ordinary skill in field, without having to pay creative labor, can be obtaining which according to these accompanying drawings His accompanying drawing.
Flow charts of the Fig. 1 for the recognition methodss of embodiment of the present invention response voice;
Fig. 2 is the flow chart that the embodiment of the present invention extracts phonetic feature;
Fig. 3 is the method flow diagram that the embodiment of the present invention obtains response mode identification model;
Fig. 4 is the overall flow figure of the method that the embodiment of the present invention obtains response mode identification model;
Fig. 5 A are the corresponding recognition result accuracy rate schematic diagram of embodiment of the present invention SVM kernel function;
Fig. 5 B are embodiment of the present invention SVM kernel function Performance comparision figure;
Structural representations of the Fig. 6 for the identifying device of embodiment of the present invention response voice.
Specific embodiment
The embodiment of the present invention obtains response voice to be identified;The response to be identified is determined using response mode identification model The corresponding response mode of voice;Wherein, the response mode identification model is the machine learning model for having supervision;If the response Mode is formal response mode, then by first speech recognition system of response phonetic entry to be identified, so that first language Sound identifying system recognizes the response voice to be identified, and exports the corresponding text message of the response voice to be identified;If institute It is unofficial response mode to state response mode, then by second speech recognition system of response phonetic entry to be identified, so that institute State the second speech recognition system and recognize the response voice to be identified, and export the corresponding text envelope of the response voice to be identified Breath;Wherein, first speech recognition system and second speech recognition system are configured with different parameters.
As the embodiment of the present invention is when response voice is recognized, after obtaining response voice to be identified, using response mode Identification model determines the corresponding response mode of response voice to be identified, for formal response mode and the input of unofficial response mode Different speech recognition systems are identified.As the first speech recognition system is used for recognizing formal response voice, the second voice Identifying system is used for recognizing unofficial response voice, and the first speech recognition system and the second speech recognition system are configured with not Same parameter, the embodiment of the present invention recognize that response voice is formal response mode or unofficial response mode first, for difference Response mode be identified using different speech recognition systems, so as to lift overall speech recognition performance, to be identified The identification of response voice is more accurate.
It should be noted that the method for the response mode of the identification response voice of the embodiment of the present invention, can be not only used for The effect of speech recognition system is lifted, other higher-level systems, such as Speaker Recognition System, abnormal sound prison can also be applied to Examining system etc..
In order that the object, technical solutions and advantages of the present invention are clearer, below in conjunction with accompanying drawing the present invention is made into One step ground is described in detail, it is clear that described embodiment is only present invention some embodiments, rather than the enforcement of whole Example.Based on the embodiment in the present invention, what those of ordinary skill in the art were obtained under the premise of creative work is not made All other embodiment, belongs to the scope of protection of the invention.
As shown in figure 1, the recognition methodss of embodiment of the present invention response voice include:
Step 101, acquisition response voice to be identified;
Step 102, the corresponding response mode of the response voice to be identified is determined using response mode identification model;Its In, the response mode identification model is the machine learning model for having supervision;
If step 103, the response mode are formal response mode, by first language of response phonetic entry to be identified Sound identifying system so that first speech recognition system recognizes the response voice to be identified, and export it is described it is to be identified should Answer the corresponding text message of voice;If the response mode is unofficial response mode, will be the response voice to be identified defeated Enter the second speech recognition system, so that second speech recognition system recognizes the response voice to be identified, and export described The corresponding text message of response voice to be identified;Wherein, first speech recognition system and second speech recognition system It is configured with different parameters.
The corresponding response mode of embodiment of the present invention response voice to be identified includes formal response mode and unofficial response Mode;
The embodiment of the present invention is can apply in flying scene, and the response mode of the response voice in flying scene is carried out Identification, recognizes that the response mode of aloft response voice is formal response mode or unofficial response mode.Wherein, formally should The identification voice for answering mode is the indicative dialogue between driver and ground control centre;For example, driver earthward controls Center sends asks for instructions, and ground control centre carries out response for asking for instructions for driver, and earthward control centre replys true to driver Recognize.
The identification voice of unofficial response mode is the dialogue between driver and copilot, or between driver and ground control tower Dialogue;For example, the voice chatted between driver and copilot, between driver and copilot with regard to flight course in guiding language Sound, earthward aircraft state etc. is reported at control tower center to driver.
It should be noted that the embodiment of the present invention is not limited in flying scene, it is available in any linguistic field border Response mode recognition methodss of the embodiment of the present invention, also, in different language contextses, alignment type response mode and unofficial The definition of response mode is also not quite similar.For example, A, B are football match announcer, it is determined that dialog information between A and B During response mode, the dialogue between A and B with regard to the football match is defined as into the dialogue of formal response mode, by A and B it Between the dialogue unrelated with the football match be defined as the dialogue of unofficial response mode.
The embodiment of the present invention is determining the corresponding answer party of the response voice to be identified using response mode identification model During formula, specifically using following method:
Optionally, the phonetic feature extracted from the response voice to be identified is input into into the response mode and recognizes mould Type;Obtain the corresponding response mode of response voice described to be identified of the response mode identification model output.
Wherein, the response mode identification model of the embodiment of the present invention is the machine learning model for having supervision, specifically, this The response mode identification model of bright embodiment is SVM (support vector machine) model.
The embodiment of the present invention using feature extraction tools, is extracted described to be identified after response voice to be identified is got Phonetic feature in response voice.
In enforcement, during phonetic feature of the embodiment of the present invention in response voice to be identified is extracted, using Multi-layer technology Mode extracts the phonetic feature in response voice to be identified.
The phonetic feature of the embodiment of the present invention includes frame level (frame level) feature, chip level feature (segment ) and section level (part level) feature level.
Specifically, the embodiment of the present invention uses openSMILE feature extraction tools, and response voice to be identified is layered Extract, extract the phonetic feature in response voice to be identified.
Optionally, using feature extraction tools, moved according to default frame length and frame, extract the response voice to be identified Frame level feature;The frame level feature is done into the disposal of gentle filter, and calculus of differences is done to the frame level feature after smoothing processing, it is determined that The chip level feature of the response voice to be identified;According to default statistical parameter, process is analyzed to the chip level feature, really The Utterance level feature of the fixed response voice to be identified..
The method that the embodiment of the present invention extracts phonetic feature from response voice to be identified is described in detail below.
The first step, extracts the frame level feature in response voice to be identified.
Wherein, frame level is characterized as the ground floor phonetic feature in response voice to be identified.
In enforcement, using openSMILE feature extraction tools, frame length 20ms, frame move 10ms, altogether comprising 16 dimensional features, tool The frame level characteristic parameter of body as shown in table 1, is specifically included:
RMSenergy (Root Mean Square energy, energy root-mean-square), mfcc (Mel-Frequency Cepstral Coefficient, mel-frequency cepstrum coefficient) 1-12 dimensions, zcr (zero-crossing rate, zero-crossing rate), Voice_prob (voiced sound accounting), F0 (according to the fundamental frequency that cepstrum is calculated).
Table 1
The English of frame level feature is write a Chinese character in simplified form The Chinese explanation of frame level feature
RMSenergy Energy root-mean-square
mfcc(1-12) Mel-frequency cepstrum coefficient 1-12 is tieed up
zcr Zero-crossing rate (frame level)
Voice_prob By autocorrelation calculation voiced sound accounting
F0 According to the fundamental frequency that cepstrum is calculated
Second step, extracts the chip level feature in response voice to be identified.
Wherein, chip level is characterized as the second layer phonetic feature in response voice to be identified.
Specifically, the frame level feature is done into the disposal of gentle filter, and partite transport is made the difference to the frame level feature after smoothing processing Calculate, determine the chip level feature in the response voice to be identified.
In enforcement, the frame sequence to obtaining in the first step carries out the smothing filtering sma (smoothed that length of window is 3 frames by a moving average filter);
After smothing filtering being carried out to frame sequence, first-order difference de (delta is done to the feature after smooth coefficient)。
Wherein, treat that specific chip level feature analysiss function as shown in table 2, is specifically included:
Sma (smothing filtering) and de (first-order difference).
Table 2
The English of chip level feature analysiss function is write a Chinese character in simplified form The Chinese explanation of chip level feature analysiss function
sma Smothing filtering
de First-order difference
After the first step and second step, 16*2=32 dimension phonetic features are being obtained.
3rd step, extracts the Utterance level feature in response voice to be identified.
Wherein, Utterance level feature is the third layer phonetic feature in response voice to be identified.
Specifically, according to default statistical parameter, process is analyzed to the chip level feature, determine it is described it is to be identified should Answer the Utterance level feature in voice.
In enforcement, statistical analysiss are done to the feature of second step output, mainly include 12 statistical parameters, counted according to 12 The feature chip level feature that parameter is exported to second step is analyzed process, obtains the Utterance level feature in response voice to be identified.
Specific default 12 statistical parameters are as shown in table 3, including:
Max (maximum, envelope take maximum), min (minute, envelope take minima), range (envelope variation models Enclose), maxpos (maximum position, maximum value position), (minute position, envelope minima are absolute for minpos Position), amean (Arithmetic mean, envelope count average), linregc1 (the linear approximation slope of envelope), The linregc2 linear approximation of envelope (skew), linregerrQ (root-mean-square of the linear predictor and actual value of envelope), Stddev (standard deviation), skewness (three rank degrees of skewness), kurtosis (quadravalence kurtosis).
Table 3
The English of Utterance level feature statistical parameter is write a Chinese character in simplified form The Chinese explanation of Utterance level feature statistical parameter
max Envelope takes maximum
min Envelope takes minima
range Envelope variation scope
maxpos Maximum value position
minpos Envelope minima absolute position
amean Envelope counts average
linregc1 The linear approximation slope of envelope
linregc2 The linear approximation skew of envelope
linregerrQ The linear predictor of envelope and the root-mean-square of actual value
stddev Standard deviation
skewness Three rank degrees of skewness
kurtosis Quadravalence kurtosis
As shown in Fig. 2 when the embodiment of the present invention extracts the Utterance level feature in response voice to be identified in the third step, being pin Chip level feature to obtaining in second step carries out statistical analysiss, and including default 12 statistical parameters, then through the 3rd step After Utterance level feature is extracted, 16*2*12=384 dimension phonetic features are obtained.
After the embodiment of the present invention extracts the phonetic feature in response voice to be identified by feature extraction tools, will carry The phonetic feature of taking-up is input in response mode identification model, so that the response mode identification model is special according to the voice Levy and recognize the corresponding response mode of the response voice to be identified;And obtain voice of the response mode identification model according to input Feature, the corresponding response mode of the response voice to be identified of output.
It should be noted that the response mode identification model of the embodiment of the present invention be through training in advance, for recognizing The model of response mode.
Due to identification of the embodiment of the present invention to the corresponding response mode of response voice to be identified, mainly by means of answer party Formula identification model, and the response mode identification model is the model through training in advance, therefore, the embodiment of the present invention also includes One important ingredient, that is, train response mode identification model.
The process of response mode identification model is trained the following detailed description of the embodiment of the present invention.
As shown in figure 3, the method that the embodiment of the present invention obtains response mode identification model includes:
Step 301, training set of the determination comprising multiple response voices, and the test set comprising multiple response voices;Its In, the response voice in the training set is different from the response voice in the test set;
Step 302, for any one response voice in the training set, the language that will be extracted from the response voice Sound feature is input in the response mode identification model before training and is trained;
Step 303, for any one response voice in the test set, the language that will be extracted from the response voice Sound feature is input in the response mode identification model after training, and obtain the described of response mode identification model output should Answer the corresponding response mode of voice;
Step 304, according to after training response mode identification model output the test set in each response voice Corresponding response mode, determines the recognition correct rate of the response mode identification model after the training, if the recognition correct rate More than given threshold, determine that the training of the response mode identification model after the training is completed, preserve the answer party after the training Formula identification model.
In step 301, the embodiment of the present invention is it is determined that when training set and test set, choose multiple response languages from corpus Sound, by the multiple response voice composition training sets for selecting or test set.
The corpus of the embodiment of the present invention is the voice prerecorded, and the voice prerecorded includes multiple formally should Answer the response voice of mode and unofficial response mode.
For example, corpus can be the voice of 17.5 hours recorded during practical flight is performed, and record it Afterwards, the voice of 17.5 hours is labeled, it is assumed that mark includes 18 speakers in determining the voice of 17.5 hours altogether, The response voice of 4668 formal response modes, and the response voice of 2257 unofficial response modes are contained wherein, then The ratio of the response voice of the response voice and unofficial response mode of formal response mode is 2.07:1, and all response languages The speech sample frequency of sound is all 16KHz, and quantified precision is 16bit.
Multiple response voices are selected in all response voices from corpus, constitute training set;Preferably, training set In formal response mode response voice and unofficial response mode response voice ratio, be close to formal response in corpus The ratio of the response voice of the response voice of mode and unofficial response mode.
For example, determine two training sets, respectively training set A and training set B, and determine a test set C, wherein, In training set A, B and test set C the formal response voice of response mode and the quantity of the response voice of unofficial response mode and Ratio is as shown in table 4:
The response voice of 1580 formal response modes of selection from corpus, and 1580 unofficial response modes Response voice constitutes training set A, the response voice of the response voice of formal response mode and unofficial response mode in training set A Ratio be 1:1;The response voice of 3270 formal response modes, and 1580 unofficial answer parties are chosen from corpus Response voice composition training set B of formula, the response of the response voice of formal response mode and unofficial response mode in training set B The ratio of voice is 2.07:1;Choose the response voice of 1400 formal response modes from corpus, and 677 unofficially The response voice composition test set C of response mode, the response voice of formal response mode and unofficial response mode in test set C Response voice ratio be 2.07:1.
Table 4
Below by taking training set A, B shown in table 4 and test set C as an example, the method for illustrating to train response mode identification model.
Specifically, the embodiment of the present invention is by each response voice in training set A and training set B, to response mode Identification model is trained, after the completion of training, by the response mode after each the response phonetic entry training in test set C Identification model, if response mode identification model output test set C in the corresponding response mode of response voice correct recognition rata During more than given threshold, determine that the response mode identification model training is completed, and preserve the response mode identification mould that training is completed Type.
Below for any one response voice in training set A, illustrate to train the process of response mode identification model:
1st, feature extraction tools are used, extracts the phonetic feature of the response voice.
The method of the concrete phonetic feature for extracting response voice adopts said method, and in this not go into detail.
2nd, the response voice corresponding phonetic feature is input in response mode identification model and is trained.
Specifically, by response voice corresponding response phonetic entry response mode identification model, and by the response language The corresponding response mode of sound is input into response mode identification model, so that the study of response mode identification model is to the phonetic feature correspondence Response mode.
The embodiment of the present invention is adopted in manner just described, response mode identification model is entered using the response voice in training set Row training, through training set A and training level B in multiple response voices repeatedly trained after, using test set C in should Voice is answered, is judged whether the response mode identification model trains and is completed.
Specifically, using test set C judge response mode identification model whether train complete when, in test set C Any one response voice, perform following operation:
1st, feature extraction tools are used, extracts the phonetic feature of the response voice;
The method of the concrete phonetic feature for extracting response voice adopts said method, and in this not go into detail.
2nd, the response mode identification model that the response voice corresponding phonetic feature is input into after training;
3rd, obtain the corresponding response mode of response voice of the response mode identification model output after training.
Specifically, response mode identification model is preset it is determined that the corresponding response mode of response voice is formal response During mode, response mode identification model output " 1 ";When it is determined that the corresponding response mode of response voice is non-formula response mode, Response mode identification model exports " 0 ".
Response mode identification model of the embodiment of the present invention after using training is to each the response voice in test set C After being judged, each corresponding recognition result of response voice in test set C is determined;Response mode identification model is determined Each corresponding recognition result of response voice in test set C, response mode corresponding with each response voice are compared, Determine the correct recognition rata of the corresponding recognition results of test set C, if the correct recognition rata is more than given threshold, it is determined that the response The training of mode identification model is completed, and preserves the response mode identification model after training;If the correct recognition rata no more than sets threshold Value, then reselect training set and test set, continues training to the response mode identification model, until the response mode recognizes mould Type is more than given threshold to the corresponding correct recognition rata of recognition result of response voice in test set.
As shown in figure 4, the embodiment of the present invention obtains the overall flow figure of the method for response mode identification model.
Step 401, training set of the determination comprising multiple response voices, and the test set comprising multiple response voices;Its In, the response voice in the training set is different from the response voice in the test set;
The following steps 402,403 are for any one the response voice in training set.
Step 402, feature extraction tools are used, extract the phonetic feature in the response voice;
Step 403, by the phonetic feature for extracting, and the corresponding response mode of the response voice is input to answer party It is trained in formula identification model;
The following steps 404,405 are for any one the response voice in training set.
Step 404, feature extraction tools are used, extract the phonetic feature in the response voice;
Step 405, the phonetic feature for extracting is input in response mode identification model it is identified;
Step 406, the recognition result for determining each response voice in the test set;
Step 407, by each response language in the recognition result of each response voice in the test set, with test set The corresponding response mode of sound is compared, and determines the correct recognition rata of the corresponding recognition result of the test set;
Whether step 408, correct judgment discrimination are more than given threshold, if so, execution step 409, if it is not, return to step 401;
Step 409, determine response mode identification model training after the completion of, preserve the response mode identification mould after training Type.
The embodiment of the present invention in two classification problems of identification response mode, employ support suitable for small data quantity to Amount machine SVM classifier is used as response mode identification model, and compared for following kernel function:Linear kernel function, polynomial kernel letter Number, gaussian radial basis function and arc tangent kernel function.
The embodiment of the present invention is respectively adopted linear kernel function, Polynomial kernel function, height based on training set as shown in table 4 This Radial basis kernel function and arc tangent kernel function are tested, the accuracy rate of the recognition result for obtaining as shown in Figure 5A, wherein, When SVM kernel functions are linear kernel function, the accuracy rate of the corresponding recognition result of training set A is 80.30, the corresponding knowledge of training set B The accuracy rate of other result is 81.02;SVM kernel functions are Polynomial kernel function, and during d=2, the corresponding identification knot of training set A The accuracy rate of fruit is 77.95, and the accuracy rate of the corresponding recognition result of training set B is 79.25;SVM kernel functions are polynomial kernel letter Number, and during d=3, the accuracy rate of the corresponding recognition result of training set A is 76.17, the standard of the corresponding recognition result of training set B Really rate is 81.13;SVM kernel functions are Polynomial kernel function, and during d=4, the accuracy rate of the corresponding recognition result of training set A For 63.79, the accuracy rate of the corresponding recognition result of training set B is 63.94;When SVM kernel functions are gaussian radial basis function, instruction The accuracy rate for practicing the corresponding recognition results of collection A is 90.71, and the accuracy rate of the corresponding recognition result of training set B is 91.62;SVM cores When function is arc tangent kernel function, the accuracy rate of the corresponding recognition result of training set A is 84.45, the corresponding identification knot of training set B The accuracy rate of fruit is 89.56;
Also, SVM models are respectively adopted linear kernel function, Polynomial kernel function, gaussian radial basis function and anyway Cut the Performance comparision of kernel function as shown in Figure 5 B.
Based on same inventive concept, a kind of identifying device of response mode is additionally provided in the embodiment of the present invention, due to this The principle of device solve problem is similar to the knowledge method for distinguishing of embodiment of the present invention response mode, therefore the enforcement of the device can be with Referring to the enforcement of method, repeat part and repeat no more.
As shown in fig. 6, the identifying device of embodiment of the present invention response voice, including:
Acquisition module 601, acquisition module, for obtaining response voice to be identified;
Identification module 602, for determining the corresponding response of the response voice to be identified using response mode identification model Mode;Wherein, the response mode identification model is the machine learning model for having supervision;
Judge module 603, if being formal response mode for the response mode, will be the response voice to be identified defeated Enter the first speech recognition system, so that first speech recognition system recognizes the response voice to be identified, and export described The corresponding text message of response voice to be identified;If the response mode be unofficial response mode, will it is described it is to be identified answer The second speech recognition system of phonetic entry is answered, so that second speech recognition system recognizes the response voice to be identified, and Export the corresponding text message of the response voice to be identified;Wherein, first speech recognition system and second voice Identifying system is configured with different parameters.
Optionally, the identification module 602, specifically for:
The phonetic feature extracted from the response voice to be identified is input into into the response mode identification model;Obtain institute State the corresponding response mode of response voice described to be identified of response mode identification model output.
Optionally, the phonetic feature includes frame level feature, chip level feature and Utterance level feature;
The identification module 602, specifically for:
Phonetic feature is extracted from response voice according to following manner:
Using feature extraction tools, moved according to default frame length and frame, the frame level for extracting the response voice to be identified is special Levy;The frame level feature is done into the disposal of gentle filter, and calculus of differences is done to the frame level feature after smoothing processing, it is determined that described treat The chip level feature of identification response voice;According to default statistical parameter, process is analyzed to the chip level feature, it is determined that described The Utterance level feature of response voice to be identified.
Optionally, the acquisition module 601, is additionally operable to:
The response mode identification model is obtained according to following manner:
It is determined that the training set comprising multiple response voices, and the test set comprising multiple response voices;Wherein, the instruction Practice the response voice concentrated different from the response voice in the test set;For any one response language in the training set Sound, the phonetic feature for extracting is input in the response mode identification model before training is instructed from the response voice Practice;For any one response voice in the test set, the phonetic feature extracted from the response voice is input to In response mode identification model after training, and it is corresponding to obtain the response voice of the response mode identification model output Response mode;In the test set according to the response mode identification model output after training, each response voice is corresponding should Mode is answered, the recognition correct rate of the response mode identification model after the training is determined, if the recognition correct rate is more than setting Threshold value, determines that the training of the response mode identification model after the training is completed, and preserves the response mode identification mould after the training Type.
Optionally, the response mode identification model is support vector machines model.
The present invention be with reference to method according to embodiments of the present invention, equipment (system), and computer program flow process Figure and/or block diagram are describing.It should be understood that can be by computer program instructions flowchart and/or each stream in block diagram The combination of journey and/or square frame and flow chart and/or the flow process in block diagram and/or square frame.These computer programs can be provided Instruct the processor of general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices so that A stream being capable of achieving by the instruction of the computing device of the computer or other programmable data processing devices in flow chart The function of specifying in journey or one square frame of multiple flow processs and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in and can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory is produced to be included referring to Make the manufacture of device, the command device realize in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or The function of specifying in multiple square frames.
These computer program instructions can be also loaded in computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented process, so as in computer or The instruction performed on other programmable devices is provided for realizing a flow process or multiple flow processs and/or block diagram in flow chart A square frame or multiple square frames in specify function the step of.
, but those skilled in the art once know basic creation although preferred embodiments of the present invention have been described Property concept, then can make other change and modification to these embodiments.So, claims are intended to be construed to include excellent Select embodiment and fall into the had altered of the scope of the invention and change.
Obviously, those skilled in the art can carry out the essence of various changes and modification without deviating from the present invention to the present invention God and scope.So, if these modifications of the present invention and modification belong to the scope of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to comprising these changes and modification.

Claims (10)

1. a kind of recognition methodss of response voice, it is characterised in that the method includes:
Obtain response voice to be identified;
The corresponding response mode of the response voice to be identified is determined using response mode identification model;Wherein, the answer party Formula identification model is the machine learning model for having supervision;
If the response mode is formal response mode, by first speech recognition system of response phonetic entry to be identified, So that first speech recognition system recognizes the response voice to be identified, and it is corresponding to export the response voice to be identified Text message;
If the response mode is unofficial response mode, by the second speech recognition of response phonetic entry system to be identified System, so that second speech recognition system recognizes the response voice to be identified, and exports the response voice pair to be identified The text message answered;
Wherein, first speech recognition system and second speech recognition system are configured with different parameters.
2. the method for claim 1, it is characterised in that the use response mode identification model determines described to be identified The corresponding response mode of response voice, specifically includes:
The phonetic feature extracted from the response voice to be identified is input into into the response mode identification model;
Obtain the corresponding response mode of response voice described to be identified of the response mode identification model output.
3. method as claimed in claim 2, it is characterised in that the phonetic feature includes frame level feature, chip level feature and section Level feature;
Phonetic feature is extracted from response voice according to following manner:
Using feature extraction tools, moved according to default frame length and frame, extract the frame level feature of the response voice to be identified;
The frame level feature is done into the disposal of gentle filter, and calculus of differences is done to the frame level feature after smoothing processing, it is determined that described The chip level feature of response voice to be identified;
According to default statistical parameter, process is analyzed to the chip level feature, determines the section of the response voice to be identified Level feature.
4. the method for claim 1, it is characterised in that the response mode identification model is obtained according to following manner:
It is determined that the training set comprising multiple response voices, and the test set comprising multiple response voices;Wherein, the training set In response voice it is different from the response voice in the test set;
For any one response voice in the training set, the phonetic feature extracted from the response voice is input to It is trained in response mode identification model before training;
For any one response voice in the test set, the phonetic feature extracted from the response voice is input to In response mode identification model after training, and obtain the response language of the output of the response mode identification model after the training The corresponding response mode of sound;
In the test set according to the response mode identification model output after the training, each response voice is corresponding should Mode is answered, the recognition correct rate of the response mode identification model after the training is determined, if the recognition correct rate is more than setting Threshold value, determines that the training of the response mode identification model after the training is completed, and preserves the response mode identification mould after the training Type.
5. the method as described in Claims 1 to 4 is arbitrary, it is characterised in that the response mode identification model is supporting vector Machine SVM models.
6. a kind of identifying device of response voice, it is characterised in that include:
Acquisition module, for obtaining response voice to be identified;
Identification module, for determining the corresponding response mode of the response voice to be identified using response mode identification model;Its In, the response mode identification model is the machine learning model for having supervision;
Judge module, if being formal response mode for the response mode, by the response phonetic entry to be identified first Speech recognition system, so that first speech recognition system recognizes the response voice to be identified, and exports described to be identified The corresponding text message of response voice;If the response mode is unofficial response mode, by the response voice to be identified The second speech recognition system is input into, so that second speech recognition system recognizes the response voice to be identified, and institute is exported State the corresponding text message of response voice to be identified;Wherein, first speech recognition system and the second speech recognition system It is under unified central planning to be equipped with different parameters.
7. device as claimed in claim 6, it is characterised in that the identification module, specifically for:
The phonetic feature extracted from the response voice to be identified is input into into the response mode identification model;Obtain described answering Answer the corresponding response mode of response voice described to be identified of mode identification model output.
8. device as claimed in claim 7, it is characterised in that the phonetic feature includes frame level feature, chip level feature and section Level feature;
The identification module, specifically for:
Phonetic feature is extracted from response voice according to following manner:
Using feature extraction tools, moved according to default frame length and frame, extract the frame level feature of the response voice to be identified;Will The frame level feature does the disposal of gentle filter, and does calculus of differences to the frame level feature after smoothing processing, determines described to be identified The chip level feature of response voice;According to default statistical parameter, process is analyzed to the chip level feature, it is determined that described wait to know The Utterance level feature of voice is not replied.
9. device as claimed in claim 6, it is characterised in that the acquisition module, is additionally operable to:
The response mode identification model is obtained according to following manner:
It is determined that the training set comprising multiple response voices, and the test set comprising multiple response voices;Wherein, the training set In response voice it is different from the response voice in the test set;For any one response voice in the training set, will The phonetic feature extracted from the response voice is input in the response mode identification model before training and is trained;For Any one response voice in the test set, the phonetic feature extracted from the response voice is input to after training In response mode identification model, and it is corresponding to obtain the response voice of the output of the response mode identification model after the training Response mode;In the test set according to the response mode identification model output after training, each response voice is corresponding should Mode is answered, the recognition correct rate of the response mode identification model after the training is determined, if the recognition correct rate is more than setting Threshold value, determines that the training of the response mode identification model after the training is completed, and preserves the response mode identification mould after the training Type.
10. the device as described in claim 6~9 is arbitrary, it is characterised in that the response mode identification model is supporting vector Machine SVM models.
CN201611081923.XA 2016-11-30 2016-11-30 Method and device for recognizing answer voice Pending CN106531158A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611081923.XA CN106531158A (en) 2016-11-30 2016-11-30 Method and device for recognizing answer voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611081923.XA CN106531158A (en) 2016-11-30 2016-11-30 Method and device for recognizing answer voice

Publications (1)

Publication Number Publication Date
CN106531158A true CN106531158A (en) 2017-03-22

Family

ID=58353579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611081923.XA Pending CN106531158A (en) 2016-11-30 2016-11-30 Method and device for recognizing answer voice

Country Status (1)

Country Link
CN (1) CN106531158A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065076A (en) * 2018-09-05 2018-12-21 深圳追科技有限公司 Setting method, device, equipment and the storage medium of audio tag
CN109308783A (en) * 2018-11-21 2019-02-05 黑龙江大学 It knocks at the door automatic-answering back device anti-theft device and answer method

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101689367A (en) * 2007-05-31 2010-03-31 摩托罗拉公司 Method and system to configure audio processing paths for voice recognition
CN102237087A (en) * 2010-04-27 2011-11-09 中兴通讯股份有限公司 Voice control method and voice control device
CN102789779A (en) * 2012-07-12 2012-11-21 广东外语外贸大学 Speech recognition system and recognition method thereof
CN103971700A (en) * 2013-08-01 2014-08-06 哈尔滨理工大学 Voice monitoring method and device
CN104464756A (en) * 2014-12-10 2015-03-25 黑龙江真美广播通讯器材有限公司 Small speaker emotion recognition system
CN104464724A (en) * 2014-12-08 2015-03-25 南京邮电大学 Speaker recognition method for deliberately pretended voices
CN104517609A (en) * 2013-09-27 2015-04-15 华为技术有限公司 Voice recognition method and device
CN105493179A (en) * 2013-07-31 2016-04-13 微软技术许可有限责任公司 System with multiple simultaneous speech recognizers
CN105529027A (en) * 2015-12-14 2016-04-27 百度在线网络技术(北京)有限公司 Voice identification method and apparatus
CN105719664A (en) * 2016-01-14 2016-06-29 盐城工学院 Likelihood probability fuzzy entropy based voice emotion automatic identification method at tension state
CN105812535A (en) * 2014-12-29 2016-07-27 中兴通讯股份有限公司 Method of recording speech communication information and terminal
CN106033669A (en) * 2015-03-18 2016-10-19 展讯通信(上海)有限公司 Voice identification method and apparatus thereof
CN106153065A (en) * 2014-10-17 2016-11-23 现代自动车株式会社 The method of audio frequency and video navigator, vehicle and control audio frequency and video navigator

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101689367A (en) * 2007-05-31 2010-03-31 摩托罗拉公司 Method and system to configure audio processing paths for voice recognition
CN102237087A (en) * 2010-04-27 2011-11-09 中兴通讯股份有限公司 Voice control method and voice control device
CN102789779A (en) * 2012-07-12 2012-11-21 广东外语外贸大学 Speech recognition system and recognition method thereof
CN105493179A (en) * 2013-07-31 2016-04-13 微软技术许可有限责任公司 System with multiple simultaneous speech recognizers
CN103971700A (en) * 2013-08-01 2014-08-06 哈尔滨理工大学 Voice monitoring method and device
CN104517609A (en) * 2013-09-27 2015-04-15 华为技术有限公司 Voice recognition method and device
CN106153065A (en) * 2014-10-17 2016-11-23 现代自动车株式会社 The method of audio frequency and video navigator, vehicle and control audio frequency and video navigator
CN104464724A (en) * 2014-12-08 2015-03-25 南京邮电大学 Speaker recognition method for deliberately pretended voices
CN104464756A (en) * 2014-12-10 2015-03-25 黑龙江真美广播通讯器材有限公司 Small speaker emotion recognition system
CN105812535A (en) * 2014-12-29 2016-07-27 中兴通讯股份有限公司 Method of recording speech communication information and terminal
CN106033669A (en) * 2015-03-18 2016-10-19 展讯通信(上海)有限公司 Voice identification method and apparatus thereof
CN105529027A (en) * 2015-12-14 2016-04-27 百度在线网络技术(北京)有限公司 Voice identification method and apparatus
CN105719664A (en) * 2016-01-14 2016-06-29 盐城工学院 Likelihood probability fuzzy entropy based voice emotion automatic identification method at tension state

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
唐刚: "飞行员应答语音的自动识别研究", 《工程科技Ⅱ辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065076A (en) * 2018-09-05 2018-12-21 深圳追科技有限公司 Setting method, device, equipment and the storage medium of audio tag
CN109065076B (en) * 2018-09-05 2020-11-27 深圳追一科技有限公司 Audio label setting method, device, equipment and storage medium
CN109308783A (en) * 2018-11-21 2019-02-05 黑龙江大学 It knocks at the door automatic-answering back device anti-theft device and answer method

Similar Documents

Publication Publication Date Title
US10388279B2 (en) Voice interaction apparatus and voice interaction method
CN105374356B (en) Audio recognition method, speech assessment method, speech recognition system and speech assessment system
WO2021128741A1 (en) Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium
CN105632501B (en) A kind of automatic accent classification method and device based on depth learning technology
US9754580B2 (en) System and method for extracting and using prosody features
CN107972028B (en) Man-machine interaction method and device and electronic equipment
CN110570873B (en) Voiceprint wake-up method and device, computer equipment and storage medium
CN110457432A (en) Interview methods of marking, device, equipment and storage medium
CN104573462B (en) Fingerprint and voiceprint dual-authentication method
CN109545243A (en) Pronunciation quality evaluating method, device, electronic equipment and storage medium
CN107767881A (en) A kind of acquisition methods and device of the satisfaction of voice messaging
CN106875943A (en) A kind of speech recognition system for big data analysis
US20180122377A1 (en) Voice interaction apparatus and voice interaction method
Koolagudi et al. Two stage emotion recognition based on speaking rate
CN109377981B (en) Phoneme alignment method and device
CN103177733A (en) Method and system for evaluating Chinese mandarin retroflex suffixation pronunciation quality
CN107886968B (en) Voice evaluation method and system
CN108899033B (en) Method and device for determining speaker characteristics
CN112992191B (en) Voice endpoint detection method and device, electronic equipment and readable storage medium
CN104008752A (en) Speech recognition device and method, and semiconductor integrated circuit device
Selvaraj et al. Human speech emotion recognition
CN106782503A (en) Automatic speech recognition method based on physiologic information in phonation
CN104299612B (en) The detection method and device of imitative sound similarity
CN106782517A (en) A kind of speech audio keyword filter method and device
CN110782902A (en) Audio data determination method, apparatus, device and medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170322