CN104766608A

CN104766608A - Voice control method and voice control device

Info

Publication number: CN104766608A
Application number: CN201410007018.4A
Authority: CN
Inventors: 周宁
Original assignee: Shenzhen ZTE Microelectronics Technology Co Ltd
Current assignee: Shenzhen ZTE Microelectronics Technology Co Ltd
Priority date: 2014-01-07
Filing date: 2014-01-07
Publication date: 2015-07-08
Also published as: WO2015103836A1

Abstract

The invention discloses a voice control method and a voice control device. The voice control method comprises the following steps: acquiring voice data after triggering operation of a user; performing voice recognition on the voice data, carrying out keyword matching according to a preset mode, and obtaining recognized keyword data from the voice data; triggering to send a keyword control command, and taking the recognized keyword data as a control command to respond to operation of the user so as to realize voice control.

Description

A kind of sound control method and device

Technical field

The present invention relates to voice technology, particularly relate to a kind of sound control method and device.

Background technology

Present inventor, in the process realizing the embodiment of the present application technical scheme, at least finds to there is following technical matters in prior art:

For Visual communications scene, along with speech recognition technology is commercially applied on a large scale, user sends control command to by voice, replaces the demand of manual operation control command day by day to strengthen, at present in Visual communications field, control program is function singleness all, and be all be based upon on simple manual operation basis, do not possess novel utility function, lack novelty, for this problem, there is not effective solution in prior art.

Summary of the invention

For solving prior art Problems existing, the embodiment of the present invention desirable to provide a kind of sound control method of one and, send control command by voice, be convenient to user operation, the both hands of user are freed.

A kind of sound control method, described method comprises:

Speech data is obtained after activated user operation;

Speech recognition is carried out to described speech data, carries out keyword match according to predetermined way, from described speech data, obtain the key data identified;

Trigger the transmission of key word control command, the described key data identified is responded described user operation as control command, realizes Voice command.

Preferably, described speech recognition is carried out to described speech data, carries out keyword match according to predetermined way, from described speech data, obtain the key data identified, comprising:

When predetermined way based on Hidden Markov Model (HMM) HMM modeling carries out keyword match, it is MFCC characteristic parameter that described speech data carries out the acoustical characteristic parameters that speech recognition extracts, using the reference data of recognition result as keyword match, obtain the key data identified.

Preferably, described method also comprises: after obtaining the key data identified, and the predetermined way based on bee-line carries out keyword match optimization process.

Preferably, the described predetermined way based on bee-line carries out keyword match optimization process, comprising:

Set up key data sound bank;

The acoustical characteristic parameters of the key data identified described in extraction is MFCC characteristic parameter, and the data clusters using vector quantization (VQ) to carry out in described key data sound bank, obtain the representative vector in each class;

According to the representative vector in each class obtain address the bee-line of the representative vector in the MFCC characteristic parameter of the key data identified and each class;

The key data that described bee-line and empirical value identify after obtaining keyword match optimization process when the match is successful.

Preferably, described method also comprises:

By contrasting the energy information of key data, judging whether control command is finished, if be finished, then terminate current keyword coupling, again speech recognition being carried out to described speech data.

Preferably, described key data comprises: incoming call, breathe out, answer, hang up at least one basic controlling command information.

A kind of phonetic controller, described device comprises:

Voice acquiring unit, for obtaining speech data after activated user operation;

Keyword recognition unit, for carrying out speech recognition to described speech data, carrying out keyword match according to predetermined way, obtaining the key data identified from described speech data;

Speech control unit, for triggering the transmission of key word control command, responding the described key data identified described user operation as control command, realizing Voice command.

Preferably, described keyword recognition unit, when being further used for carrying out keyword match based on the predetermined way of Hidden Markov Model (HMM) HMM modeling, it is MFCC characteristic parameter that described speech data carries out the acoustical characteristic parameters that speech recognition extracts, using the reference data of recognition result as keyword match, obtain the key data identified.

Preferably, described keyword recognition unit, after being further used for the key data obtaining identifying, the predetermined way based on bee-line carries out keyword match optimization process.

Preferably, described keyword recognition unit, when being further used for carrying out keyword match optimization process based on the predetermined way of bee-line, sets up key data sound bank; The acoustical characteristic parameters of the key data identified described in extraction is MFCC characteristic parameter, and the data clusters using vector quantization (VQ) to carry out in described key data sound bank, obtain the representative vector in each class; According to the representative vector in each class obtain address the bee-line of the representative vector in the MFCC characteristic parameter of the key data identified and each class; The key data that described bee-line and empirical value identify after obtaining keyword match optimization process when the match is successful.

Preferably, described keyword recognition unit, is further used for, by contrasting the energy information of key data, judging whether control command is finished, if be finished, then terminate current keyword coupling, again carries out speech recognition to described speech data.

The method of the embodiment of the present invention comprises: obtain speech data after activated user operation; Speech recognition is carried out to described speech data, carries out keyword match according to predetermined way, from described speech data, obtain the key data identified; Trigger the transmission of key word control command, the described key data identified is responded described user operation as control command, realizes Voice command.Because the key data by identifying triggers the transmission of key word control command, described user operation is responded, realize Voice command, therefore, adopt the Auto-matching of embodiment of the present invention control command to send and instead of existing user manual operations, be convenient to user operation, the both hands of user are freed.

Accompanying drawing explanation

Fig. 1 is the method flow diagram of the embodiment of the present invention;

Fig. 2 is the structure drawing of device of the embodiment of the present invention;

Fig. 3 is the process flow diagram of the embodiment of the present invention one application scenarios;

Fig. 4 is the schematic diagram of embodiment of the present invention vector quantization example;

Fig. 5-7 is the realization flow figure that the device basic module of the embodiment of the present invention one application scenarios runs.

Embodiment

Be described in further detail below in conjunction with the enforcement of accompanying drawing to technical scheme.

The scheme of the embodiment of the present invention is that a kind of application speech recognition technology carries out keyword recognition and then realizes voice-operated scheme, may be used for each application scenarioss such as conversing between Visual communications system, terminal device and send short messages mutually, the control command of Auto-matching is obtained by the identification of speech data key word, replace current manual control, the embodiment of the present invention makes user can carry out humanized various control operations as a kind of supplementary means.

The sound control method of the embodiment of the present invention, as shown in Figure 1, comprising:

Speech data is obtained after the operation of step 101, activated user.

Step 102, speech recognition is carried out to described speech data, carry out keyword match according to predetermined way, from described speech data, obtain the key data identified.

The transmission of step 103, triggering key word control command, responds the described key data identified described user operation as control command, realizes Voice command.

Here, described key data comprises: incoming call, breathe out, answer, hang up at least one basic controlling command information.

Here, step 102 carries out speech recognition to described speech data, keyword match is carried out according to predetermined way, if obtain the key data identified from described speech data, then step 103 can be performed, if do not mated, the key data identified cannot be obtained, then speech data can be sent as general data.

The phonetic controller of the embodiment of the present invention, as shown in Figure 2, comprising:

Voice acquiring unit 11, for obtaining speech data after activated user operation.Keyword recognition unit 12, for carrying out speech recognition to described speech data, carrying out keyword match according to predetermined way, obtaining the key data identified from described speech data.Speech control unit 13, for triggering the transmission of key word control command, responding the described key data identified described user operation as control command, realizing Voice command.

The embodiment of the present invention may be used for each application scenarioss such as conversing between Visual communications system, terminal device and send short messages mutually, is specifically addressed below with Visual communications application scenarios.

As shown in Figure 3, the embodiment of the present invention, in Visual communications application scenarios, comprises the following steps:

Step 201, user obtain speech data after triggering Visual communications operation.

Step 202, keyword match identification is carried out to speech data, if coupling, then perform step 203, otherwise, perform step 204.

Step 203, response user operation, send key word control command, realizes the Voice command in Visual communications operation.

Step 204, RTP Packet Generation.

Here it is to be noted, the embodiment of the present invention is mainly before sending RTP VoP, embedded in voice acquiring unit, voice signal can be gathered by voice-input devices such as microphones when performing step 201 to sample, then perform the pre-service of step 202, namely perform keyword match identifying processing through keyword recognition unit, if coupling key data, then user operation is responded, otherwise, this section of speech data packing is sent.That is, as long as the recognition result of this speech data is not the control command meeting key word, just directly send speech data, if recognition result is the control command meeting key word, then trigger the transmission of key word control command, utilize this control command to carry out operation to Visual communications and control.

For keyword recognition unit, in order to realize carrying out keyword match identification to speech data, the content of employing comprises: 1) the predetermined way algorithm of the first level; 2) the predetermined way algorithm combination that the first level and the second level combine is carried out, adopt content 2) be to content 1) and matching optimization process, finally to obtain key word more accurately, and it can be used as final control command to send.Wherein, the predetermined way algorithm of described first level carries out modeling to realize keyword recognition based on the mode of Hidden Markov Model (HMM) (HMM); The predetermined way algorithm of described second level is that bee-line matching way is to realize keyword recognition.Below for content 2) this algorithm situation about combining is specifically addressed:

For keyword recognition unit, in order to promote keyword recognition performance further in the embodiment of the present invention, two-layer algorithm parallelism recognition flow process can be taked and combined, ground floor adopts the existing method based on HMM to carry out modeling, carrying out to speech data the acoustical characteristic parameters that speech recognition extracts is Mel frequency cepstral coefficient (MFCC, Mel Frequency Cepstrum Coefficient) characteristic parameter, using the key data that identifies as ground floor reference, the second layer sets up a key data sound bank set, then the acoustical characteristic parameters of the key data identified described in obtaining through ground floor is extracted, this acoustical characteristic parameters is MFCC characteristic parameter, use vector quantization (VQ, Vector Quantization) carry out in described key data sound bank data clusters, obtain the representative vector in each class (or being called cell), then according to the representative vector in each class (or being called cell), the bee-line of the representative vector in the MFCC characteristic parameter of the key data identified described in trying to achieve and each class (or being called cell), by bee-line with rule of thumb obtain an empirical value and contrast, if meet certain predetermined criterion, then as the final result identified, namely key word more accurately is finally obtained, and it can be used as final control command to send.

The above-mentioned algorithm used for keyword recognition unit is described as follows:

The implementation of speech recognition algorithm generally includes: 1) receive voice signal; 2) parameter extraction; 3) modeling statistical analysis; 4) decision logic; 5) output is identified.

In speech recognition, the different parameters of voice can be extracted, reach best recognition effect.

For the 2nd) for parameter extraction, what the embodiment of the present invention was taked is extract MFCC.To the voice signal received by parameter extraction, after this preconditioning technique of parameter extraction, can remove the inessential redundant information of speech recognition, extract the important information useful to speech recognition.Key step is: the existing steps such as pre-emphasis, framing, windowing, fast fourier transform, triangle strip bandpass filter.

For the 3rd) for modeling statistical analysis, after extracting parameter, by these parameter characteristics of modeling analysis, obtain recognition result.The present embodiment when based on described ground floor with reference to when carrying out modeling analysis, the probability model for describing statistics of random processes characteristic of to be hidden Markov model (HMM, HiddenMarkovModel), HMM the be a kind of Parametric Representation taked.In the Hidden Markov Model (HMM) of speech recognition, each word generates a corresponding HMM, each observation sequence is made up of the voice of a word, and the identification of word is realized by the HMM assessing and then select the pronunciation representated by most possible generation observation sequence.

The vector quantization of the second layer is the further optimization to ground floor reference, also be a kind of existing modeling statistical analysis, the ultimate principle of vector quantization is: several scalar datas are formed a vector (or the eigenvector extracted from a frame speech data) and give overall quantification in hyperspace, thus can when quantity of information loss is less packed data.Here scalar can be understood as the parameter of extraction.

For the vector quantization concrete example of ground floor reference and the second layer, as shown in Figure 4, if the recognition result that ground floor obtains is a1, can go to find corresponding control command b1 in Key word voice storehouse by recognition result a1, this is used for the MFCC that voice-operated control command also has oneself.Then use vector quantization (VQ) to carry out cluster to recognition result a1, control command b1, draw the vector a of corresponding recognition result a1, the vector b of corresponding control command b1, then compare the similarity degree of vector a, vector b, i.e. bee-line.Treating method is that vector a deducts vector b, obtains vector C, if the amplitude of vector C and phase place more level off to 0(180 ° in other words), then illustrate that recognition result a1 is more similar with control command b1.

For the empirical value that the vector quantization of the second layer relates to, this empirical value is the similarity degree of recognition result a1 and control command b1, and this depends on the vector C obtained.The acquisition of this value of vector C needs repeating experiment and could determine repeatedly.

For the predetermined criterion that the vector quantization of the second layer relates to, it also can be understood as decision logic.It is the core of whole speech recognition system, mainly: distance measure, expertise (as word-building rule, syntax rule, semantic rules etc.), calculate the similarity (as matching distance, likelihood probability) between pattern in input feature vector and pattern base, judge the semantic information of input voice, draw recognition result.

In the middle of keyword recognition process, in time detecting with set keyword match, the control response action of just response respective user operation.

In embodiment of the present invention preferred implementation, after needing to judge that triggering key word control command responds, after this control command performs and terminates, how key word is identified, for this reason, in the Visual communications of the embodiment of the present invention, need to gather voice signal by voice-input device, when getting voice signal, we can be identified by keyword recognition unit the voice signal got, when recognizing key data, can by calculating the energy information of the key word recognized, and the energy information of front and back 20 frame contrasts, if the average recognizing the energy of key word is 2 times of front and back 20 frame average energy value, then determine that this is really get to need to carry out the key data that controls and as control command, then just incoming call can be controlled according to the key word arranged, breathe out, the control operation such as to answer and hang up.

The device of the embodiment of the present invention is on the basis of existing module, increased voice acquiring unit and keyword recognition unit newly, realizes keyword match identification, responded by the key word identified as control command mainly through keyword recognition module.

For Visual communications application scenarios, existing module comprises: proxy module 21, Session Initiation Protocol stack module 22, assembly communication function function library integration module 23, signaling control scheduler module 24, media processing scheduler module 25 and media video display module 26.Wherein, proxy module 21 comprises MS Message Agent 211 and good friend's proxy module 212; Session Initiation Protocol stack module 22 is responsible for and the reception of sip server interaction message and transmission, assembly communication function function library integration module 23 contains required function and is used to provide function call, signaling controls scheduler module 24 for the treatment of Call-Control1 order, the input-output device of management audio frequency and video, media processing scheduler module 25 is mainly used in receiving audio, video data acquisition function, and media video display module 26 is mainly used in showing the video data gathered.

Be respectively the realization flow figure that signaling controls the modules operation of scheduler module 24, media processing scheduler module 25 and media video display module 26 as illustrated in figs. 5-7, and proxy module 21, Session Initiation Protocol stack module 22, these modules of assembly communication function function library integration module 23 are mainly used for the transmission of speech data and mutual, the embodiment of the present invention focuses on the process to speech data, therefore, these modules are little with the relation of the embodiment of the present invention, do not stress herein.

Be illustrated in figure 5 the realization flow figure that signaling controls scheduler module 24 operation, signaling controls scheduler module 24 for the treatment of Call-Control1 order, and its realization flow comprises the following steps:

Step 401, receive control command.

Step 402, according to voice control command table, call the software API of respective operations, just can replace realizing manual all operations.

Step 403, call respective operations software API after perform corresponding operation, open various equipment, process concrete affairs or close various equipment.

Here, 1) various equipment can be opened, such as make a phone call (entering the interface of making a phone call), opens camera, the orders such as microphone; 2) concrete issued transaction, such as calls so-and-so, windows exchange etc. some need manual action; 3) close various equipment, such as shut down, standby, close camera, microphone etc.

Be illustrated in figure 6 the realization flow figure that media processing scheduler module 25 runs, media processing scheduler module 25, for gathering voice data, is judged to be the audio frequency of non-controlling order, and its realization flow comprises the following steps:

Step 501, respectively collection voice data and collection video data.

Step 502, respectively to collection voice data and video data use open source software coding.

Step 503, code multiplexing, carry out integration by voice data and video data and obtain audio, video data.

This audio, video data of step 504, Internet Transmission.

Be illustrated in figure 7 the realization flow figure that media video display module 26 runs, media video display module 26, for showing the video data of collection, on the one hand to the video data that media processing scheduler module 25 compresses, is decoded, and delivers to for display after decoding; On the other hand an Audio and Video stream scheme of increasing income freely cross-platform to voice data by third party's open source software FFmpeg(FFmpeg, it provide recording, conversion and the total solution of fluidisation audio frequency and video) decode, deliver to the audio frequency of sound card output for playing out corresponding video data after treatment, its realization flow comprises the following steps:

Step 601, receive the audio, video data after code multiplexing.

Step 602, demultiplexing, resolve to voice data and video data by audio, video data.

Step 603, obtain packets of audio data and video packets of data respectively.

Step 604, packets of audio data and video packets of data to be decoded respectively, use open source software to decode to packets of audio data, obtain pulse code and adjust (PCM) data, use and from grinding chip hardware demoder, video packets of data is decoded, obtain picture.

Step 605, decoded voice data is sent into sound card export, decoded video data is sent into hardware display buffer district (buffer) and waits for that hardware shows.

In basic Visual communications module, also comprise speech recognition keyword message module and key word processing module, by these two module integrations in basic module, the certain operations with Voice command Visual communications can be played, such as incoming call, breathe out, answer and hang up.In the present invention, first several basic command is set in keyword selection: incoming call, breathe out, answer and hang up, then, in call flow, voice can be passed through speech recognition keyword message module, when identifying keyword message, just by key word processing module, judging that the keyword message identified is the need of being for further processing by certain criterion, if needed, then starting to feed back voice-controlled operations, otherwise, do not carry out control operation.By the method, speech recognition technology can be applied accurately in Visual communications, the incoming call of realization Voice command, the operation such as breathe out, answer and hang up.

In sum, the embodiment of the present invention mainly to be sent short messages etc. at Visual communications, mobile phone dialing phone, mobile phone and is applicable in the application scenarios of speech recognition and input, the basis of above-mentioned existing module adds voice acquisition module and keyword recognition unit, after phonetic entry, mainly through this keyword recognition unit, the key word of needs can be detected, then control the control operation in Visual communications according to this key word trigger command, such as breathe out, hang up, incoming call etc.In keyword recognition unit, what take is two-layer speech recognition controlled, one deck is the audio recognition method based on HMM, one deck is bee-line matching method in addition, controlled by two-layer identification, obtain key data accurately, compare according to the energy of key data to front and back 20 frame simultaneously, thus coupling obtain the need of key data.

If module integrated described in the embodiment of the present invention using the form of software function module realize and as independently production marketing or use time, also can be stored in a computer read/write memory medium.Based on such understanding, the technical scheme of the embodiment of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product is stored in a storage medium, comprises some instructions and performs all or part of of method described in each embodiment of the present invention in order to make a computer equipment (can be personal computer, server or the network equipment etc.).And aforesaid storage medium comprises: USB flash disk, portable hard drive, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. various can be program code stored medium.Like this, the embodiment of the present invention is not restricted to any specific hardware and software combination.

Accordingly, the embodiment of the present invention also provides a kind of computer-readable storage medium, wherein stores computer program, and this computer program is for performing the sound control method of the embodiment of the present invention.

The above, be only preferred embodiment of the present invention, be not intended to limit protection scope of the present invention.

Claims

1. a sound control method, is characterized in that, described method comprises:

Speech data is obtained after activated user operation;

2. method according to claim 1, is characterized in that, describedly carries out speech recognition to described speech data, carries out keyword match, obtain the key data identified, comprising from described speech data according to predetermined way:

3. method according to claim 2, is characterized in that, described method also comprises: after obtaining the key data identified, and the predetermined way based on bee-line carries out keyword match optimization process.

4. method according to claim 3, is characterized in that, the described predetermined way based on bee-line carries out keyword match optimization process, comprising:

Set up key data sound bank;

5. the method according to any one of claim 2 to 4, is characterized in that, described method also comprises:

6. method according to claim 1, is characterized in that, described key data comprises: incoming call, breathe out, answer, hang up at least one basic controlling command information.

7. a phonetic controller, is characterized in that, described device comprises:

Voice acquiring unit, for obtaining speech data after activated user operation;

8. device according to claim 7, it is characterized in that, described keyword recognition unit, when being further used for carrying out keyword match based on the predetermined way of Hidden Markov Model (HMM) HMM modeling, it is MFCC characteristic parameter that described speech data carries out the acoustical characteristic parameters that speech recognition extracts, using the reference data of recognition result as keyword match, obtain the key data identified.

9. device according to claim 8, is characterized in that, described keyword recognition unit, and after being further used for the key data obtaining identifying, the predetermined way based on bee-line carries out keyword match optimization process.

10. device according to claim 9, is characterized in that, described keyword recognition unit, when being further used for carrying out keyword match optimization process based on the predetermined way of bee-line, sets up key data sound bank; The acoustical characteristic parameters of the key data identified described in extraction is MFCC characteristic parameter, and the data clusters using vector quantization (VQ) to carry out in described key data sound bank, obtain the representative vector in each class; According to the representative vector in each class obtain address the bee-line of the representative vector in the MFCC characteristic parameter of the key data identified and each class; The key data that described bee-line and empirical value identify after obtaining keyword match optimization process when the match is successful.

Device described in 11. any one of according to Claim 8 to 10, it is characterized in that, described keyword recognition unit, be further used for by contrasting the energy information of key data, judge whether control command is finished, if be finished, then terminate current keyword coupling, again speech recognition is carried out to described speech data.

12. devices according to claim 7, is characterized in that, described key data comprises: incoming call, breathe out, answer, hang up at least one basic controlling command information.