CN110428838A

CN110428838A - A kind of voice information identification method, device and equipment

Info

Publication number: CN110428838A
Application number: CN201910707528.5A
Authority: CN
Inventors: 王夏鸣
Original assignee: Volkswagen Mobvoi Beijing Information Technology Co Ltd
Current assignee: Volkswagen Mobvoi Beijing Information Technology Co Ltd
Priority date: 2019-08-01
Filing date: 2019-08-01
Publication date: 2019-11-08

Abstract

The embodiment of the invention discloses a kind of voice information identification method, device and equipment, method includes: the information to be identified for continuing to monitor and identifying in set environment region；Wherein, the information to be identified includes that environment voice messaging, user's face information, user's sight information and user's lip move information；If moving information according to the user's face information and user's lip, or information is moved according to the user's face information, user's sight information and user's lip, it determines that the environment voice messaging includes the phonetic order information that target user issues, then the phonetic order information is responded.The technical solution of the embodiment of the present invention can be improved interactive voice efficiency.

Description

A kind of voice information identification method, device and equipment

Technical field

The present embodiments relate to technical field of data processing more particularly to a kind of voice information identification method, device and Equipment.

Background technique

Voice signal is identified as corresponding with scheduled instruction by the voice etc. that speech recognition technology is used to input based on user Signal, and can be applied to multiple fields.

The current speech recognition system based on speech recognition technology, usually require user by manually wake up or by The mode that equipment detects wake-up word automatically starts, thus the speech recognition and interaction of triggering following.When voice dialogue task is completed When, system is restored to the state for needing to reawake quickly.

Inventor in the implementation of the present invention, it is found that it is existing existing speech recognition system has following defects that Speech recognition system can not simulated implementation true man exchange scene, every time interaction be both needed to wake up starting mode can reduce interactive effect Rate.

Summary of the invention

The embodiment of the present invention provides a kind of voice information identification method, device and equipment, realizes and improves interactive voice efficiency.

In a first aspect, the embodiment of the invention provides a kind of voice information identification methods, comprising:

It continues to monitor and identifies the information to be identified in set environment region；Wherein, the information to be identified includes environment Voice messaging, user's face information, user's sight information and user's lip move information；

If moving information according to the user's face information and user's lip, or according to the user's face information, institute It states user's sight information and user's lip moves information, determine that the environment voice messaging includes the voice that target user issues Command information then responds the phonetic order information.

Second aspect, the embodiment of the invention also provides a kind of voice messaging identification devices, comprising:

Information monitoring module to be identified, for continuing to monitor and identifying the information to be identified in set environment region；Wherein, The information to be identified includes that environment voice messaging, user's face information, user's sight information and user's lip move information；

Phonetic order information response module, if for moving information according to the user's face information and user's lip, Or information is moved according to the user's face information, user's sight information and user's lip, determine the environment voice Information includes the phonetic order information that target user issues, then responds to the phonetic order information.

The third aspect, the embodiment of the invention also provides a kind of terminal device, the terminal device includes:

One or more processors；

Storage device, for storing one or more programs；

When one or more of programs are executed by one or more of processors, so that one or more of processing Device realizes voice information identification method provided by any embodiment of the invention.

Fourth aspect, the embodiment of the invention also provides a kind of computer storage mediums, are stored thereon with computer program, The program realizes voice information identification method provided by any embodiment of the invention when being executed by processor.

The embodiment of the present invention is by continuing to monitor and identifying the environment voice messaging in set environment region, user's face letter Breath, user's sight information and user's lip move information, according to the dynamic letter of user's face information, user's sight information and user's lip When breath determines that environment voice messaging includes the phonetic order information that target user issues, phonetic order information is responded, is solved Certainly the problem of interactive voice low efficiency existing for existing voice identifying system, realizes and improve interactive voice efficiency.

Detailed description of the invention

Fig. 1 is a kind of flow chart for voice information identification method that the embodiment of the present invention one provides；

Fig. 2 a is a kind of flow chart of voice information identification method provided by Embodiment 2 of the present invention；

Fig. 2 b is a kind of flow chart of voice information identification method provided by Embodiment 2 of the present invention；

Fig. 3 is a kind of schematic diagram for voice messaging identification device that the embodiment of the present invention three provides；

Fig. 4 is a kind of structural schematic diagram for terminal device that the embodiment of the present invention four provides.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.

It also should be noted that only the parts related to the present invention are shown for ease of description, in attached drawing rather than Full content.It should be mentioned that some exemplary embodiments are described before exemplary embodiment is discussed in greater detail At the processing or method described as flow chart.Although operations (or step) are described as the processing of sequence by flow chart, It is that many of these operations can be implemented concurrently, concomitantly or simultaneously.In addition, the sequence of operations can be by again It arranges.The processing can be terminated when its operations are completed, it is also possible to have the additional step being not included in attached drawing. The processing can correspond to method, function, regulation, subroutine, subprogram etc..

Embodiment one

Fig. 1 is a kind of flow chart for voice information identification method that the embodiment of the present invention one provides, and the present embodiment is applicable In carry out speech recognition according to multidimensional information to be identified the case where, this method can be executed by voice messaging identification device, should Device can be realized by the mode of software and/or hardware, and can be generally integrated in terminal device.Correspondingly, such as Fig. 1 institute Show, this method includes following operation:

S110, it continues to monitor and identifies the information to be identified in set environment region；Wherein, the information to be identified includes Environment voice messaging, user's face information, user's sight information and user's lip move information.

Wherein, set environment region can be environmental area applied by speech recognition system, for example, interior or indoor Deng.It may be implemented to identify by speech recognition technology using voice information identification method provided by the embodiment of the present invention in the car Phonetic order may be implemented using voice information identification method provided by the embodiment of the present invention through speech recognition skill indoors Art carries out attendance etc. of checking card, and the embodiment of the present invention is not defined the concrete form in set environment region.Information to be identified It can be the information for the processing to be identified that speech recognition system is recognized, including but not limited to environment voice messaging, user face Portion's information, user's sight information and user's lip move information etc..Environment voice messaging is to collect in set environment region Voice messaging, the including but not limited to voice messaging and the noise information in environment of user, such as audio-frequency information.User face Portion's information is the face information of user, and user's sight information is the information such as the sight angular direction of eyes of user, and user's lip is dynamic Information is the lip motion information of user.

In embodiments of the present invention, speech recognition system no longer carries out speech recognition only in accordance with unique voice messaging, and It is by including that the multidimensional information such as voice, face, sight and lip be dynamic realize accurately speech recognition.Meanwhile in order to avoid user It wakes up manually or equipment detects the modes such as wake-up word automatically and starts, speech recognition system can be continuously monitored and identify setting ring Environment voice messaging, user's face information, user's sight information and user's lip in the region of border move information etc..

If S120, moving information according to the user's face information and user's lip, or believed according to the user's face Breath, user's sight information and user's lip move information, determine that the environment voice messaging includes target user's sending Phonetic order information, then the phonetic order information is responded.

Wherein, target user can be the user that speech recognition system needs to carry out speech recognition, such as interior driver Or personnel that check card before attendance record terminal equipment etc., the embodiment of the present invention do not limit the specific identity type of target user It is fixed.Phonetic order information is the command information of speech form.

Correspondingly, getting the environment voice messaging in set environment region, user's face information, user's sight information And after user's lip moves information, in summary speech recognition system a variety of information to be identified can carry out identifying processings, and according to Identifying processing result determine the environment voice messaging in set environment region whether include target user issue phonetic order letter Breath, and in determining environment voice messaging include target user issue phonetic order information when, to phonetic order information carry out Response.Judge in environment voice messaging specifically, speech recognition system can move information according to user's face information and user's lip The phonetic order information whether issued including target user.If speech recognition system is dynamic according to user's face information and user's lip Information can not judge in environment voice messaging whether to include phonetic order information that target user issues, then can be further combined with User's sight information judges the phonetic order information whether issued including target user in environment voice messaging.

Embodiment two

Fig. 2 a is a kind of flow chart of voice information identification method provided by Embodiment 2 of the present invention, and the present embodiment is with above-mentioned It is embodied based on embodiment, in the present embodiment, gives the information to be identified continued to monitor in set environment region And to the specific embodiment that phonetic order information is responded.Correspondingly, as shown in Figure 2 a, the method for the present embodiment can be with Include:

S210, it continues to monitor and identifies the information to be identified in set environment region.

Wherein, S210 can specifically include following operation:

S211, the environment voice messaging, the user's face information, user's sight information and institute are continued to monitor It states user's lip and moves information, and the environment voice messaging is identified.

In embodiments of the present invention, it in order to improve interactive voice efficiency, when speech recognition system starting, can open simultaneously It opens speech recognition module, face recognition module, eye movement information identification module and lip and moves information identification module, to obtain ring simultaneously Border voice messaging, user's face information, user's sight information and user's lip move information.Meanwhile it can be first to the ring of acquisition Border voice messaging is identified.Optionally, environment voice can be obtained by the hardware devices such as microphone collected sound signal Information acquires facial image by hardware devices such as cameras to obtain user's face information, user's sight information and user Lip moves information.

S212, if it is determined that the environment voice messaging in the set environment region includes user speech information, then it is right The user's face information, user's sight information and user's lip move information and identify.

Wherein, user speech information is the voice messaging that the user in set environment region is issued.

In order to mitigate the data processing amount of speech recognition system, in embodiments of the present invention, only in speech recognition system It recognizes in environment voice messaging when including user speech information, it just can user's face information, user's sight information to acquisition And user's lip moves information and identifies.

If S220, moving information according to the user's face information and user's lip, or believed according to the user's face Breath, user's sight information and user's lip move information, determine that the environment voice messaging includes target user's sending Phonetic order information, then the phonetic order information is responded.

Wherein, S220 can specifically include following operation:

If S221, set according to the recognition result determination of the user's face information and the dynamic information of user's lip There are the target users in the specific environment region for determining in environmental area, then carry out to the user speech information of the target user Speech recognition.

Wherein, specific environment region can be specific one piece of coordinates regional in set environment region.For example, specific environment Region can be interior master and drive region, alternatively, specific environment region can also be indoor application speech recognition system equipment Front region, the embodiment of the present invention is to this and is not limited.

It is understood that there may be at least one users in set environment region.Therefore, the friendship of user in order to prevent The identification of the false triggerings speech recognition systems such as the noise of what is said or talked about and outside includes user speech letter recognizing environment voice messaging After breath, it can be judged in set environment region according to the recognition result that system moves information to user's face information and user's lip Specific environment region whether there is target user.If it is determined that there are target use for the specific environment region in set environment region Family then further the user speech information to target user can carry out speech recognition.

S222, determined according to speech recognition result the environment voice messaging whether include target user issue voice refer to Enable information.

S223, if it is determined that the environment voice messaging include target user issue phonetic order information, then to described Phonetic order information is responded.

Correspondingly, after the user speech information to target user carries out speech recognition, it can be according to speech recognition result Determine the phonetic order information whether issued including target user in environment voice messaging.Include when determining in environment voice messaging When the phonetic order information that target user issues, phonetic order information is responded.

In an alternate embodiment of the present invention where, described to believe according to the user's face information and user's lip are dynamic The recognition result of breath determines that the specific environment region in the set environment region may include: to work as there are the target user When determining that the only user's face information in the specific environment region including the target user and user's lip move information, determine Only in the specific environment region, there are the target users；It is described that the environment voice messaging is determined according to speech recognition result The phonetic order information whether issued including target user, if may include: that institute's speech recognition result refers to default voice Collection is enabled to match, it is determined that the environment voice messaging includes the phonetic order information that the target user issues.

Wherein, presetting phonetic order collection can be the phonetic order according to involved in speech recognition system concrete application scene Collection.Illustratively, if speech recognition system is applied in automotive field, default phonetic order collection be can include but is not limited to " music, address list are opened in navigation " etc.；If speech recognition system is applied in field of attendance, presetting phonetic order collection can be with Including but not limited to " working checks card, comes off duty and check card " etc..

In embodiments of the present invention, if it only includes target user's in specific environment region that speech recognition system, which determines, User's face information and user's lip move information, show that only there are target users in specific environment region, and target user has greatly Probability has issued phonetic order to speech recognition system.At this point it is possible to be preset in the speech recognition result and system that will acquire Phonetic order collection is matched, and successful match determines that target user has issued phonetic order information, and system can be directly in response to.

In an alternate embodiment of the present invention where, described to believe according to the user's face information and user's lip are dynamic The recognition result of breath determines the specific environment region in the set environment region there are the target user, may include: as Fruit determines the user's face information in the set environment region including at least two users, and the specific environment region Zhong Bao The user's face information and user's lip for including the target user move information, it is determined that there are the targets in the specific environment region User；It is described that the ring is determined according to the dynamic information of the user's face information, user's sight information and user's lip Border voice messaging includes the phonetic order information that target user issues, if may include: user's sight of the target user Information matches with the first default visibility region, it is determined that the environment voice messaging includes the voice that the target user issues Command information.

Wherein, the first default visibility region can be visibility region set according to actual needs, it is preferred that first is pre- If visibility region can be set according to the relative position of target user and speech recognition system.Such as with driver eye's coordinate On the basis of, utilize the interior middle control region etc. of angle coordinate system calibration.

It is understood that when, there are when multiple users, speech recognition system can identify more in set environment region The user's face information of a user.At this time if speech recognition system recognized in specific environment region user's face information and User's lip moves information, can also determine that there are target users in specific environment region.But due to the presence of multiple users, speech recognition System can not determine that target user is to send phonetic order to system or talking with other users.Therefore, speech recognition System can be identified further combined with the sight information of user.

Specifically, when speech recognition system determines and there is multiple users including target user, it can be by the mesh of identification User's sight information of mark user is matched with the first default visibility region.If it is determined that mark user user's sight information with First default visibility region matches, and shows that the sight of target user is located in the first default visibility region, can determine at this time Target user is sending phonetic order to speech recognition system.

In an alternate embodiment of the present invention where, if the method can also include: the user of the target user Sight information matches with the second default visibility region, it is determined that the environment voice messaging does not include that the target user issues Phonetic order information；Or, match if user's sight information of the target user presets visibility region with third, it is right The user speech information of the target user carries out speech recognition, and determines the environment voice messaging according to speech recognition result The phonetic order information whether issued including target user.

Wherein, the second default visibility region and third preset visibility region equally can be it is set according to actual needs Visibility region.Optionally, the second default visibility region, which can be, deviates farther away region with the first default visibility region, and second is pre- If the quantity of visibility region can be one, it is also possible to multiple, the embodiment of the present invention is to this and is not limited.Third is default Visibility region, which can be, deviates closer region, such as corresponding region when driver's head-up with the first default visibility region.

Correspondingly, illustrating that target is used if user's sight information of target user matches with the second default visibility region Family does not issue phonetic order to speech recognition system, such as target user and other users talk scene, then can determine ring Border voice messaging does not include the phonetic order information that target user issues.If the user's sight information and third of target user are pre- If visibility region matches, then speech recognition system can not accurately determine whether target user has issued phonetic order.At this point, language Sound identifying system can user speech information to target user carry out speech recognition, and environment is determined according to speech recognition result Voice messaging whether include target user issue phonetic order information.

In an alternate embodiment of the present invention where, the method can also include: when determining only in the set environment When moving information including user's face information and user's lip in the nonspecific environmental area in region, without to the user speech information Carry out speech recognition.

Wherein, nonspecific environmental area is the region in set environment region in addition to specific environment region, such as vehicle Interior copilot region and back seat region etc..

Correspondingly, if it only includes using in the nonspecific environmental area in set environment region that speech recognition system, which recognizes, Family facial information and user's lip move information, are shown to be the other users in addition to target user and are issuing voice, such as interior back seat User talking.At this point, speech recognition system can the direct user speech that arrives of shield detection, not identify, to realize Effective filtering to external noise.

In an alternate embodiment of the present invention where, the default phonetic order collection include pre-set business phonetic order collection with And default starting phonetic order collection；It is described that the phonetic order information is responded, if may include: the speech recognition As a result match with the pre-set business phonetic order collection, it is determined that the environment voice messaging includes that the target user issues Phonetic order information, and the phonetic order information is directly responded；If institute's speech recognition result with it is described pre- If starting phonetic order collection matches, it is determined that the environment voice messaging includes the phonetic order letter that the target user issues Breath, and when determining that the phonetic order information and the pre-set business phonetic order collection match, the phonetic order is believed Breath is directly responded；Otherwise, it prompts the target user to re-enter voice according to the pre-set business phonetic order collection to refer to Enable information.

Wherein, pre-set business phonetic order collection can be the common phonetic order in speech recognition system applied business field Collection, default starting phonetic order collection can be the phonetic order collection for waking up starting speech recognition system, and such as " hello, XX ".

In embodiments of the present invention, if speech recognition system determines speech recognition result and pre-set business phonetic order collection Match, show that target user issues phonetic order to speech recognition system, directly phonetic order information can be carried out Response.If speech recognition system determines that speech recognition result matches with default starting phonetic order collection, show target user It is try to starting speech recognition system, speech recognition system can automatically wake up to respond target user at this time.Meanwhile if Speech recognition system determines that the phonetic order information to match with pre-set business phonetic order collection refers to pre-set business voice simultaneously It enables collection match, directly phonetic order information can be responded.If speech recognition system determination and pre-set business voice The phonetic order information that instruction set matches does not match with pre-set business phonetic order collection, can prompt target user system institute The business scope range of the speech recognition of support, and guiding target user re-enters voice according to pre-set business phonetic order collection Command information.

In an alternate embodiment of the present invention where, the set environment region is environment inside car region.

Optionally, voice information identification method provided by the embodiment of the present invention can be applied to vehicle-mounted voice identification system System moves information to the environment voice messaging in environment inside car region, user's face information, user's sight information and user's lip and carries out It continues to monitor and identifies, and identify and respond according to the voice messaging that recognition result issues target user.Wherein, target User can be the main driver for driving region.

Fig. 2 b is a kind of flow chart of voice information identification method provided by Embodiment 2 of the present invention, in a specific example In son, as shown in Figure 2 b, speech recognition system is applied under interior scene, the natural way of human-computer interaction, system knot are passed through Speech recognition, recognition of face, Eye-controlling focus and semantic understanding various dimensions information are closed, intelligence point is carried out to the speech act of user The full-time interactive voice for exempting to wake up is realized in analysis, and driver does not need wake-up system can carry out interactive voice at any time, and will not be by Interior normal talk and noise misrecognition, to further decrease security risks.Detailed process is as follows:

Step 1: speech recognition system starts, microphone starts to acquire environment inside car voice messaging, and by collected sound Frequency stream is input to voice dictation engine, wherein voice dictation engine is for identifying the environment voice messaging of acquisition；Simultaneously Start camera, starts to carry out face recognition, Eye-controlling focus and Lip Movement Recognition.

Step 2: if voice dictation engine start output character as a result, if to face recognition and Lip Movement Recognition result into Row compares；Once voice dictation engine stop output character as a result, if be simultaneously stopped the comparison that face recognition and lip move result.

There is face if detecting and only driving region in the master of picture, is determined as interior only driver, and driver has lip It is dynamic, then enter step three.Wherein, main region of driving is after camera deployment in the car is fixed, in camera video picture area The one piece of preferred coordinates region demarcated in advance, this regional location can be driver head region in picture.

If it is dynamic to detect that driver has lip, and detects in addition to also having face in main other regions for driving region, then sentence It is set to interior other occupants having other than driver, enters step four.

If detecting that the face in only other regions has lip to move, speech recognition system ignores the output of voice dictation engine Text results are not responding to.

It is managed to semanteme Step 3: result will be dictated from beginning with text and be output to the entire text input of text end of input It solves and carries out semantic calibration in model, and whether match comprising waking up word X.

If semantic calibration result is one of N number of business scope that speech recognition system is supported, then it is assumed that driver is not It is talking to onself, is being to issue instruction with speech recognition system, system needs to respond instruction.

If semantic calibration result is not in N number of business scope that speech recognition system is supported, then it is assumed that driver is not It is instructed being issued to system, system is without response.

If retrieving semantic calibration result includes to wake up word X and exact matching, regardless of whether semantic determine in system In the business scope of support, speech recognition system requires to give corresponding response.

Specifically, when semantic calibration result include wake up word X and N number of business scope for supporting for speech recognition system it One, system directly in response to；Semantic calibration result includes the N number of business scope for waking up word X but being not belonging to speech recognition system support Territory that is interior, prompting user speech identifying system to support, guidance user re-enter phonetic order.

Step 4: output character detects the view of driver to terminating to export in this period since voice dictation engine Line deflection, deflection are defined as (α, beta, gamma), wherein α, and beta, gamma is laterally y-axis, hangs down with using longitudinal direction of car as x-axis respectively For histogram to the angle of three axis for z-axis, coordinate origin is located at driver head center.

If detecting that pilot's line of vision deflection coordinate is located at interior middle control region A, wherein A=[(α 1, β 1, γ 1), (α 2, β 2, γ 2), (α 3, β 3, γ 3), (α 4, β 4, γ 4)], then it is assumed that driver is needed issuing phonetic order to system It responds.Wherein, A is the speech recognition systems such as the middle control vehicle device demarcated in advance or middle control vehicle-mounted speech robot people in angle coordinate Carrier zones in system, correspondingly, the corresponding four groups of vertex region A can respectively correspond (α 1, β 1, γ 1), (α 2, β 2, γ 2), (α 3, β 3, γ 3) and (α 4, β 4, γ 4) four groups of coordinates, four groups of coordinate values can size according to vehicle and middle control region A Specific location in the car carries out adaptive settings.

If detecting pilot's line of vision towards interior copilot region B and heel row region C, then it is assumed that driver is not It is to issue phonetic order with interior speech recognition system, there is no need to respond.Wherein, B=[(α 5, β 5, γ 5), (α 6, β 6, γ 6), (α 7, β 7, γ 7), (α 8, β 8, γ 8)], similarly, the corresponding four groups of vertex copilot region B can respectively correspond (α 5, β 5, γ 5), (α 6, β 6, γ 6), (α 7, β 7, γ 7) and (α 8, β 8, γ 8) four groups of coordinates, C=[(α 9, β 9, γ 9), (α 10, β 10, γ 10), (α 11, β 11, γ 11), (α 12, β 12, γ 12)], similarly, the corresponding four groups of vertex heel row region C equally can be with Respectively correspond (α 9, β 9, γ 9), (α 10, β 10, γ 10), (α 11, β 11, γ 11) and (α 12, β 12, γ 12) four groups of coordinates.Its In, four groups of coordinate values of region B and region C can be according to the size of vehicle and copilot region and heel row region in the car Specific location adaptive settings.

If detecting that the main pilot's line of vision for driving region is horizontally toward front region D, at this time speech recognition system without Method determines whether driver is issuing phonetic order to system, needs to identify that voice determines according to step 3.Wherein, D=[(α 13, β 13, γ 13), (α 14, β 14, γ 14), (α 15, β 15, γ 15), (α 16, β 16, γ 16)], similarly, front region D is corresponding Four groups of vertex can equally respectively correspond (α 13, β 13, γ 13), (α 14, β 14, γ 14), (α 15, β 15, γ 15) and (α 16, β 16, γ 16) four groups of coordinates, coordinate value specifically can adaptive settings according to actual needs.The embodiment of the present invention simultaneously misaligns The specific coordinate value for controlling region A, copilot region B, heel row region C and front region D is defined.

It can be seen that voice information identification method provided by the embodiment of the present invention is applied to automotive field, can integrate Vision and auditory information determine intentions of speaking of driver, while can effectively filter the non-voice intention of environment inside car, can be with It realizes the similar interactive voice for not needing to wake up word exchanged with people, interactive efficiency and experience is substantially improved, and reduce because of interaction Security risks caused by low efficiency.

It should be noted that in the above various embodiments between each technical characteristic arbitrary arrangement combination also belong to it is of the invention Protection scope.

Embodiment three

Fig. 3 is a kind of schematic diagram for voice messaging identification device that the embodiment of the present invention three provides, as shown in figure 3, described Device includes: information monitoring module 310 to be identified and phonetic order information response module 320, in which:

Information monitoring module 310 to be identified, for continuing to monitor and identifying the information to be identified in set environment region；Its In, the information to be identified includes that environment voice messaging, user's face information, user's sight information and user's lip move information；

Phonetic order information response module 320, if for according to the user's face information and the dynamic letter of user's lip Breath, or information is moved according to the user's face information, user's sight information and user's lip, determine the environment language Message breath includes the phonetic order information that target user issues, then responds to the phonetic order information.

Optionally, information monitoring module 310 to be identified, comprising: information monitoring unit, for continuing to monitor the environment language Message breath, the user's face information, user's sight information and user's lip move information, and to the environment voice Information is identified；Information identificating unit, for if it is determined that the environment voice messaging in the set environment region includes User speech information is then moved information to the user's face information, user's sight information and user's lip and is known Not；Phonetic order information response module 320, comprising: voice recognition unit, if for according to the user's face information and The recognition result that user's lip moves information determines that there are the targets to use for the specific environment region in the set environment region Family then carries out speech recognition to the user speech information of the target user；Command information determination unit, for being known according to voice Other result determines whether the environment voice messaging includes phonetic order information that the target user issues.

Optionally, voice recognition unit is specifically used for when determining only in the specific environment region including the target When the user's face information and user's lip of user moves information, determine that only there are the target users in the specific environment region； Command information determination unit, if matched specifically for institute's speech recognition result and default phonetic order collection, it is determined that institute Stating environment voice messaging includes the phonetic order information that the target user issues.

Optionally, voice recognition unit is specifically used for if it is determined that including at least two use in the set environment region The user's face information at family, and include that the user's face information of the target user and user's lip move in the specific environment region Information, it is determined that there are the target users in the specific environment region；Command information determination unit, if be specifically used for described User's sight information of target user matches with the first default visibility region, it is determined that the environment voice messaging includes described The phonetic order information that target user issues.

Optionally, described device further include: command information determining module, if user's sight for the target user Information matches with the second default visibility region, it is determined that the environment voice messaging does not include the language that the target user issues Sound command information；Or, matching if user's sight information of the target user presets visibility region with third, to described The user speech information of target user carries out speech recognition, and whether determines the environment voice messaging according to speech recognition result The phonetic order information issued including target user.

Optionally, described device further include: voice messaging identification module is determined for working as only in the set environment region Nonspecific environmental area in when including that user's face information and user's lip move information, without being carried out to the user speech information Speech recognition.

Optionally, the default phonetic order collection includes pre-set business phonetic order collection and default starting phonetic order Collection；Phonetic order information response module 320, if being specifically used for institute's speech recognition result and the pre-set business phonetic order Collection matches, it is determined that the environment voice messaging includes the phonetic order information that the target user issues, and to institute's predicate Sound command information is directly responded；If institute's speech recognition result matches with the default starting phonetic order collection, It determines that the environment voice messaging includes the phonetic order information that the target user issues, and is determining the phonetic order letter When breath matches with the pre-set business phonetic order collection, the phonetic order information is directly responded；Otherwise, institute is prompted It states target user and phonetic order information is re-entered according to the pre-set business phonetic order collection.

Optionally, the set environment region is environment inside car region.

Voice information identification method provided by any embodiment of the invention can be performed in above-mentioned voice messaging identification device, tool The standby corresponding functional module of execution method and beneficial effect.The not technical detail of detailed description in the present embodiment, reference can be made to this The voice information identification method that invention any embodiment provides.

Since above-mentioned introduced voice messaging identification device is that can execute the voice messaging in the embodiment of the present invention to know The device of other method, so based on voice information identification method described in the embodiment of the present invention, the affiliated technology people in this field Member can understand the specific embodiment and its various change form of the voice messaging identification device of the present embodiment, so herein How voice information identification method in the embodiment of the present invention is realized if being no longer discussed in detail for the voice messaging identification device.Only It wants those skilled in the art to implement device used by voice information identification method in the embodiment of the present invention, belongs to this Shen The range that please be protect.

Example IV

Fig. 4 is a kind of structural schematic diagram for terminal device that the embodiment of the present invention four provides.Fig. 4, which is shown, to be suitable for being used in fact The block diagram of the terminal device 412 of existing embodiment of the present invention.The terminal device 412 that Fig. 4 is shown is only an example, should not be right The function and use scope of the embodiment of the present invention bring any restrictions.

As shown in figure 4, terminal device 412 is showed in the form of universal computing device.The component of terminal device 412 can wrap Include but be not limited to: one or more processor 416, storage device 428 connect different system components (including storage device 428 With processor 416) bus 418.

Bus 418 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (Industry Standard Architecture, ISA) bus, microchannel architecture (Micro Channel Architecture, MCA) bus, enhancing Type isa bus, Video Electronics Standards Association (Video Electronics Standards Association, VESA) local Bus and peripheral component interconnection (Peripheral Component Interconnect, PCI) bus.

Terminal device 412 typically comprises a variety of computer system readable media.These media can be it is any can be by The usable medium that terminal device 412 accesses, including volatile and non-volatile media, moveable and immovable medium.

Storage device 428 may include the computer system readable media of form of volatile memory, such as arbitrary access Memory (Random Access Memory, RAM) 430 and/or cache memory 432.Terminal device 412 can be into one Step includes other removable/nonremovable, volatile/non-volatile computer system storage mediums.Only as an example, it stores System 434 can be used for reading and writing immovable, non-volatile magnetic media (Fig. 4 do not show, commonly referred to as " hard disk drive "). Although not shown in fig 4, the disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") can be provided, And to removable anonvolatile optical disk (such as CD-ROM (Compact Disc-Read Only Memory, CD-ROM), Digital video disk (Digital Video Disc-Read Only Memory, DVD-ROM) or other optical mediums) read-write light Disk drive.In these cases, each driver can pass through one or more data media interfaces and 418 phase of bus Even.Storage device 428 may include at least one program product, which has one group of (for example, at least one) program mould Block, these program modules are configured to perform the function of various embodiments of the present invention.

Program 436 with one group of (at least one) program module 426, can store in such as storage device 428, this The program module 426 of sample includes but is not limited to operating system, one or more application program, other program modules and program It may include the realization of network environment in data, each of these examples or certain combination.Program module 426 usually executes Function and/or method in embodiment described in the invention.

Terminal device 412 can also be (such as keyboard, sensing equipment, camera, aobvious with one or more external equipments 414 Show device 424 etc.) communication, the equipment interacted with the terminal device 412 can be also enabled a user to one or more to be communicated, and/ Or (such as network interface card is adjusted with any equipment for enabling the terminal device 412 to be communicated with one or more of the other calculating equipment Modulator-demodulator etc.) communication.This communication can be carried out by input/output (Input/Output, I/O) interface 422.And And terminal device 412 can also pass through network adapter 420 and one or more network (such as local area network (Local Area Network, LAN), wide area network Wide Area Network, WAN) and/or public network, such as internet) communication.As schemed Show, network adapter 420 is communicated by bus 418 with other modules of terminal device 412.It should be understood that although not showing in figure Out, other hardware and/or software module can be used in conjunction with terminal device 412, including but not limited to: microcode, device drives Device, redundant processing unit, external disk drive array, disk array (Redundant Arrays of Independent Disks, RAID) system, tape drive and data backup storage system etc..

The program that processor 416 is stored in storage device 428 by operation, thereby executing various function application and number According to processing, such as realize voice information identification method provided by the above embodiment of the present invention.

That is, the processing unit is realized when executing described program: continue to monitor and identify in set environment region to Identification information；Wherein, the information to be identified includes environment voice messaging, user's face information, user's sight information and use Family lip moves information；If information is moved according to the user's face information and user's lip, or according to the user's face information, User's sight information and user's lip move information, determine that the environment voice messaging includes the language that target user issues Sound command information then responds the phonetic order information.

Embodiment five

The embodiment of the present invention five also provides a kind of computer storage medium for storing computer program, the computer program When being executed by computer processor for executing any voice information identification method of the above embodiment of the present invention: continuing It monitors and identifies the information to be identified in set environment region；Wherein, the information to be identified includes environment voice messaging, user Facial information, user's sight information and user's lip move information；If dynamic according to the user's face information and user's lip Information, or information is moved according to the user's face information, user's sight information and user's lip, determine the environment Voice messaging includes the phonetic order information that target user issues, then responds to the phonetic order information.

The computer storage medium of the embodiment of the present invention, can be using any of one or more computer-readable media Combination.Computer-readable medium can be computer-readable signal media or computer readable storage medium.It is computer-readable Storage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device or Device, or any above combination.The more specific example (non exhaustive list) of computer readable storage medium includes: tool There are electrical connection, the portable computer diskette, hard disk, random access memory (RAM), read-only memory of one or more conducting wires (Read Only Memory, ROM), erasable programmable read only memory ((Erasable Programmable Read Only Memory, EPROM) or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic Memory device or above-mentioned any appropriate combination.In this document, computer readable storage medium, which can be, any includes Or the tangible medium of storage program, which can be commanded execution system, device or device use or in connection make With.

Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.

The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In wireless, electric wire, optical cable, radio frequency (Radio Frequency, RF) etc. or above-mentioned any appropriate combination.

The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, Further include conventional procedural programming language --- such as " C " language or similar programming language.Program code can Fully to execute, partly execute on the user computer on the user computer, be executed as an independent software package, Part executes on the remote computer or executes on a remote computer or server completely on the user computer for part. In situations involving remote computers, remote computer can pass through the network of any kind --- including local area network (LAN) Or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as utilize Internet service Provider is connected by internet).

Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims

1. a kind of voice information identification method characterized by comprising

It continues to monitor and identifies the information to be identified in set environment region；Wherein, the information to be identified includes environment voice Information, user's face information, user's sight information and user's lip move information；

If moving information according to the user's face information and user's lip, or according to the user's face information, the use Family sight information and user's lip move information, determine that the environment voice messaging includes the phonetic order that target user issues Information then responds the phonetic order information.

2. the method according to claim 1, wherein it is described continue to monitor and identify in set environment region to Identification information, comprising:

Continue to monitor the environment voice messaging, the user's face information, user's sight information and user's lip Dynamic information, and the environment voice messaging is identified；

If it is determined that the environment voice messaging in the set environment region includes user speech information, then to the user face Portion's information, user's sight information and user's lip move information and identify；

It is described to determine that the environment voice messaging includes that target is used according to the user's face information and the dynamic information of user's lip The phonetic order information that family issues, comprising:

If determining the set environment area according to the recognition result that the user's face information and user's lip move information There are the target users in specific environment region in domain, then carry out voice knowledge to the user speech information of the target user Not；

Determine whether the environment voice messaging includes phonetic order letter that the target user issues according to speech recognition result Breath.

3. according to the method described in claim 2, it is characterized in that, described according to the user's face information and the user The recognition result that lip moves information determines that there are the target users for the specific environment region in the set environment region, comprising:

When the determining only user's face information in the specific environment region including the target user and user's lip move information When, determine that only there are the target users in the specific environment region；

It is described according to speech recognition result determine the environment voice messaging whether include target user issue phonetic order believe Breath, comprising:

If institute's speech recognition result matches with default phonetic order collection, it is determined that the environment voice messaging includes described The phonetic order information that target user issues.

4. according to the method described in claim 2, it is characterized in that, described according to the user's face information and the user The recognition result that lip moves information determines that there are the target users for the specific environment region in the set environment region, comprising:

If it is determined that including the user's face information of at least two users, and the specific environment area in the set environment region User's face information and user's lip in domain including the target user move information, it is determined that there are institutes in the specific environment region State target user；

It is described that the environment is determined according to the dynamic information of the user's face information, user's sight information and user's lip Voice messaging includes the phonetic order information that target user issues, comprising:

If user's sight information of the target user matches with the first default visibility region, it is determined that the environment voice Information includes the phonetic order information that the target user issues.

5. according to the method described in claim 4, it is characterized in that, the method also includes:

If user's sight information of the target user matches with the second default visibility region, it is determined that the environment voice Information does not include the phonetic order information that the target user issues；Or

Match if user's sight information of the target user presets visibility region with third, to the target user's User speech information carries out speech recognition, and determines whether the environment voice messaging includes that target is used according to speech recognition result The phonetic order information that family issues.

6. according to the method described in claim 2, it is characterized in that, the method also includes:

It only include user's face information and the dynamic letter of user's lip in the nonspecific environmental area in the set environment region when determining When breath, without carrying out speech recognition to the user speech information.

7. according to any method of claim 2-6, which is characterized in that the default phonetic order collection includes pre-set business Phonetic order collection and default starting phonetic order collection；

It is described that the phonetic order information is responded, comprising:

If institute's speech recognition result matches with the pre-set business phonetic order collection, it is determined that the environment voice messaging Including the phonetic order information that the target user issues, and the phonetic order information is directly responded；

If institute's speech recognition result matches with the default starting phonetic order collection, it is determined that the environment voice messaging Including the phonetic order information that the target user issues, and determining the phonetic order information and the pre-set business voice When instruction set matches, the phonetic order information is directly responded；Otherwise, prompt the target user according to described pre- If business phonetic order collection re-enters phonetic order information.

8. a kind of voice messaging identification device characterized by comprising

Information monitoring module to be identified, for continuing to monitor and identifying the information to be identified in set environment region；Wherein, described Information to be identified includes that environment voice messaging, user's face information, user's sight information and user's lip move information；

Phonetic order information response module, if for moving information or root according to the user's face information and user's lip Information is moved according to the user's face information, user's sight information and user's lip, determines the environment voice messaging Including the phonetic order information that target user issues, then the phonetic order information is responded.

9. device according to claim 8, which is characterized in that the information monitoring module to be identified includes:

Information monitoring unit, for continuing to monitor the environment voice messaging, the user's face information, user's sight letter Breath and user's lip move information, and identify to the environment voice messaging；

Information identificating unit, for if it is determined that the environment voice messaging in the set environment region includes user speech letter Breath then moves information to the user's face information, user's sight information and user's lip and identifies；

Phonetic order information response module includes:

Voice recognition unit, if the recognition result for moving information according to the user's face information and user's lip is true There are the target users in specific environment region in the fixed set environment region, then to the user speech of the target user Information carries out speech recognition；

Command information determination unit, for determining whether the environment voice messaging includes the target according to speech recognition result The phonetic order information that user issues.

10. device according to claim 9, which is characterized in that the voice recognition unit is specifically used for:

Described instruction information determination unit is specifically used for:

11. device according to claim 9, which is characterized in that the voice recognition unit is specifically used for:

Described instruction information determination unit is specifically used for:

12. device according to claim 11, which is characterized in that described device further include:

Command information determining module, if user's sight information and the second default visibility region phase for the target user Match, it is determined that the environment voice messaging does not include the phonetic order information that the target user issues；Or

13. a kind of terminal device, which is characterized in that the equipment includes:

One or more processors；

Storage device, for storing one or more programs；

When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now voice information identification method as described in any in claim 1-7.