CN107369449A

CN107369449A - A kind of efficient voice recognition methods and device

Info

Publication number: CN107369449A
Application number: CN201710573521.XA
Authority: CN
Inventors: 蒋化冰; 蔡汉嘉; 廖凯; 齐鹏举; 方园; 米万珠; 舒剑; 吴琨; 管伟; 罗璇
Original assignee: Shanghai Muye Robot Technology Co Ltd
Current assignee: Noah robot technology (Shanghai) Co.,Ltd.
Priority date: 2017-07-14
Filing date: 2017-07-14
Publication date: 2017-11-21
Anticipated expiration: 2037-07-14
Also published as: CN107369449B

Abstract

The embodiment of the present invention provides a kind of method and device of efficient voice identification, and methods described includes：The speech data of sound object is recorded, while obtains the face image data of the sound object；ASR identifications are carried out to the speech data, obtain ASR recognition results, the ASR recognition results include some voice contents and corresponding voice record time point；Feature recognition of dehiscing is carried out to the face image data of the sound object, some frames is obtained and dehisces image and described to dehisce image acquisition time point corresponding to image per frame；And compare corresponding to every voice content in the front and rear preset time range at voice record time point, if having the image acquisition time point of corresponding image of dehiscing within this range；If so, voice content corresponding to record is efficient voice.By this method and device, effective speech data can be identified from ASR recognition results, effectively improves the application value of ASR recognition results.

Description

A kind of efficient voice recognition methods and device

Technical field

The invention belongs to multimedia technology field, more particularly to a kind of efficient voice recognition methods and device.

Background technology

With the rapid development of modern science and technology, various electronic equipments, such as：Mobile phone, iPad, intelligent robot etc. are respectively provided with Recording and the function of automatic speech recognition (Automatic Speech Recognition, ASR).However, simple recording, often Background noise, environmental noise, echo etc. can be recorded while putting down sound, non-genuine voice can also be recorded unavoidably Come, it is inevitable simultaneously comprising effective speech data and invalid in its ASR recognition result by the ASR of recording data identification Speech data.So, how the efficient voice in ASR recognition results is identified to be a problem for needing to solve.

The content of the invention

In summary, the embodiment of the present invention provides a kind of efficient voice recognition methods and device, can be from ASR recognition results The middle effective speech data of identification, effectively improve the application value of ASR recognition results.

In a first aspect, the embodiment of the present invention provides a kind of efficient voice recognition methods, it is characterised in that including：Record sound The speech data of source object, while obtain the face image data of the sound object；ASR knowledges are carried out to the speech data Not, ASR recognition results are obtained, the ASR recognition results include some voice contents and corresponding voice record time point；It is right The face image data of the sound object carries out feature recognition of dehiscing, and obtains some frames and dehisces image and described to dehisce figure per frame The image acquisition time point as corresponding to；And compare the front and rear default of voice record time point corresponding to every voice content In time range, if having the image acquisition time point of corresponding image of dehiscing within this range；If so, voice corresponding to record Content is efficient voice.

Further, voice record time point corresponding to every voice content is：Record every voice content The time point of beginning, record time point among every voice content or, record what every voice content terminated Time point.

Further, the face image data for obtaining the sound object specifically includes：Camera detects described The face of sound object；The face is focused on, face scope is occupied the preset value of the cam lens；Obtain the sound source The face image data of object.

Further, the face image data to the sound object feature recognition that dehisce specifically includes：It is fixed The nozzle type feature of the position face image data；And judge nozzle type dehisce height and lip height ratio whether exceed or Equal to default ratio；When than or equal to it is image of dehiscing to identify described face image data；Wherein, the height of dehiscing It is highly under upper lip top edge and lower lip to spend for the distance between upper lip lower edge and lower lip top edge, the lip The distance between edge.

Further, the front and rear preset time range at voice record time point is described corresponding to every voice content Front and rear 1 second of voice record time point.

Second aspect, the embodiment of the present invention provide a kind of efficient voice identification device, including：Recording device, for recording The speech data of sound object；Camera device, for the face image with sound object described in the recording device synchronous recording Data；ASR identification devices, for carrying out ASR identifications to the speech data, obtain ASR recognition results, the ASR identifications knot Fruit includes some voice contents and corresponding voice record time point；Image arrangement for detecting, for the sound object Face image data carries out feature recognition of dehiscing, and obtains some frames and dehisces image and described to obtain per frame image corresponding to image of dehiscing Take time point；And efficient voice extraction element, for comparing voice record time point corresponding to every voice content In front and rear preset time range, if having the image acquisition time point of corresponding image of dehiscing within this range；If so, record pair The voice content answered is efficient voice.

Further, the camera device is specifically used for：Detect the face of the sound object；The face is focused on, Face scope is set to occupy the preset value of the camera head lens；Obtain the face image data of the sound object.

Further, described image arrangement for detecting is specifically used for：Position the nozzle type feature of the face image data；And Judge the ratio dehisced highly with lip height of nozzle type whether than or equal to default ratio；When than or equal to identification Described face image data is image of dehiscing；Wherein, it is described dehisce height for upper lip lower edge and lower lip top edge it Between distance, the lip is highly the distance between upper lip top edge and lower lip lower edge.

By efficient voice recognition methods provided in an embodiment of the present invention and device, in the speech data of record sound object While obtain the face image data of sound object, with reference to image of dehiscing, identify effective in speech data ASR recognition results Voice, background noise, environmental noise, the voice content of non-sound object in ASR recognition results can be accurately filtered out, effectively Improve the application value of ASR recognition results.

Brief description of the drawings

, below will be to embodiment or description of the prior art in order to illustrate more clearly of scheme of the invention or of the prior art In the required accompanying drawing that uses make one and simple introduce, it should be apparent that, drawings in the following description are some realities of the present invention Example is applied, for those of ordinary skill in the art, on the premise of not paying creative work, can also be according to these accompanying drawings Obtain other accompanying drawings.

A kind of schematic flow sheet for efficient voice recognition methods that Fig. 1 is provided by the embodiment of the present invention；

A kind of face image data display schematic diagram for sound object that Fig. 2 is provided by the embodiment of the present invention；

A kind of ASR recognition results schematic diagram that Fig. 3 is provided by the embodiment of the present invention；

A kind of image acquisition time point list signal corresponding to image of being dehisced per frame that Fig. 4 is provided by the embodiment of the present invention Figure；

One kind that Fig. 5 is provided by embodiment of the present invention image recognition data flow of dehiscing is combined to compare and sentenced with ASR data flows The schematic diagram of disconnected efficient voice；

Another ASR recognition results schematic diagram that Fig. 6 is provided by the embodiment of the present invention；

Fig. 7 forms structural representation by a kind of efficient voice identification device that the embodiment of the present invention provides.

Embodiment

In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described.Obviously, described embodiment is only Part of the embodiment of the present invention, rather than whole embodiments, presently preferred embodiments of the present invention is given in accompanying drawing.The present invention can To realize in many different forms, however it is not limited to embodiment described herein, on the contrary, provide the mesh of these embodiments Be to make the understanding more thorough and comprehensive to the disclosure.Based on the embodiment in the present invention, the common skill in this area The every other embodiment that art personnel are obtained under the premise of creative work is not made, belong to the model that the present invention protects Enclose.

Unless otherwise defined, all of technologies and scientific terms used here by the article is with belonging to technical field of the invention The implication that technical staff is generally understood that is identical.Term used in the description of the invention herein is intended merely to description tool The purpose of the embodiment of body, it is not intended that in the limitation present invention.In description and claims of this specification and above-mentioned accompanying drawing Term " first ", " second " etc. be to be used to distinguish different objects, rather than for describing particular order.In addition, term " bag Include " and " having " and their any deformations, it is intended that cover non-exclusive include.Such as contain series of steps or list The step of process, method, system, product or the equipment of member are not limited to list or unit, but alternatively also include not The step of listing or unit, or alternatively also include for other intrinsic steps of these processes, method, product or equipment or Unit.

Referenced herein " embodiment " is it is meant that the special characteristic, structure or the characteristic that describe can wrap in conjunction with the embodiments In at least one embodiment of the present invention.Each position in the description occur the phrase might not each mean it is identical Embodiment, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art explicitly and Implicitly understand, embodiment described herein can be combined with other embodiments.

Embodiment one

The embodiment of the present invention one provides a kind of recognition methods of efficient voice, in the speech data of record sound object While obtain sound object face image data, with reference to the image of dehiscing of face, identify speech data ASR recognition results In efficient voice, filter out voice content that background noise, environmental noise and non-sound object are sent etc..

Refering to Fig. 1, it is illustrated that be a kind of schematic flow sheet of efficient voice recognition methods provided in an embodiment of the present invention.

Step S1001：The speech data of sound object is recorded, while obtains the face image data of the sound object.

In the embodiment of the present invention, the speech data of sound object is recorded, while also can be by the various sound in natural environment Record.

Step S1002：ASR identifications are carried out to the speech data, obtain ASR recognition results, the ASR recognition results bag Include some voice contents and corresponding voice record time point.

Refering to Fig. 3, a kind of ASR recognition results schematic diagram provided by the embodiment of the present invention.In step S2001, note The speech data of sound object has been recorded, ASR identifications then are carried out to described speech data, has obtained ASR recognition results, such as Fig. 2 It is shown, in the ASR recognition results, including three voice contents, and voice record time point corresponding to every voice content.Three Bar voice content be respectively "！", " how is Shanghai weather" and " Where is the toilet”；Corresponding voice record time point is “2017/06/07 00:00:02.133”、“2017/06/07 00:00:05.791 " and " 2,017,/06,/07 00:00: 09.999”.Above-mentioned voice record time point can be record time point that every voice content starts, record it is described every Time point among bar voice content or, record the time point that every voice content terminates.

Step S1003：Feature recognition of dehiscing is carried out to the face image data of the sound object, some frames is obtained and dehisces Image and the image acquisition time point corresponding to image of being dehisced per frame.

In embodiments of the present invention, after the face image data of sound object is obtained, further to the sound source pair The face image data of elephant carries out feature recognition of dehiscing.Refering to Fig. 2, first, the nozzle type feature of the face image data is positioned； Then, judge the ratio dehisced highly with lip height of nozzle type whether than or equal to default ratio；When than or equal to, The described face image data of identification is image of dehiscing.Wherein, the height of dehiscing is on upper lip lower edge 61 and lower lip The distance between edge 62, the lip are highly the distance between upper lip top edge 51 and lower lip lower edge 62.Generally In the case of, can pre-set dehisce height and lip height ratio, such as：1/4.When than or equal to the ratio value, recognizing It is the state that speaks for the sound object.Certainly, in other embodiments of the present invention, other rational methods can also be used Identify whether image is image of dehiscing.Finally, after image of dehiscing is identified, it is also necessary to obtain and dehisce to scheme corresponding to image per frame As obtaining time point.Dehisced image acquisition time point list corresponding to image as shown in figure 4, giving every frame.

Especially, it should be noted that in embodiments of the present invention, step S1002 and step S1003 execution sequence are not Priority execution sequence is limited, but the speech data and face image data to sound object respectively is carried out at analysis side by side Reason.

Step S1004：Compare the front and rear preset time range at voice record time point corresponding to every voice content It is interior, if to have the image acquisition time point of corresponding image of dehiscing within this rangeIf so, perform step S1005；If it is not, hold Row step S1006.

In the embodiment of the present invention, by comparing the whether sound source object of voice record time point corresponding to every voice content Dehisce image, to judge whether this voice content is efficient voice that sound object is truly sent.Then, the speed of image recognition Degree and ASR recognition speeds are unable to Complete Synchronization, therefore, it is allowed to pre- before and after voice record time point corresponding to every voice content If time range in, such as：Front and rear 1 second, to determine whether the image of dehiscing of sound object.As shown in figure 5, it is figure of dehiscing As identification data stream and the schematic diagram of ASR data flow combination contrast judgement efficient voices.Situation A is image recognition data flow of dehiscing The situation of speed.Situation B is the slower situation of image recognition speed data stream of dehiscing.Such as：In the voice of ASR data flows " how is Shanghai weather for content" voice start recording time point 00:00:Front and rear 1 second 00 of 03.938:00:03.281 is (corresponding Situation A) or 00:00:03.968 (corresponding situation B) has image of dehiscing, and it is efficient voice to illustrate this voice content.Or In the voice contents of ASR data flows, " how is Shanghai weather" voice terminate record time point 00:00:Front and rear 1 second of 05.791 00:00:05.027 (corresponding situation A) or 00:00:05.987 (corresponding situation B) has image of dehiscing, and illustrates this voice content For efficient voice.Conversely, it is invalid voice.

In the embodiment of the present invention, by comparing ASR recognition results list as shown in Figure 3 and every frame as shown in Figure 4 Image acquisition time point list corresponding to mouth image can obtain whether corresponding voice content is efficient voice.

First voice content as shown in Figure 3 "！" voice record time point be " 2,017,/06,/07 00:00: 02.133 ", further, Fig. 4 is searched at the time point " 2,017,/06,/07 00:00:Whether front and rear one second of 02.133 ", which has, is dehisced The image acquisition time point of image is within this rangeResult is that do not have, and it is invalid voice to illustrate this voice content.

" how is Shanghai weather for Article 2 voice content as shown in Figure 3" voice record time point be " 2017/06/07 00:00:05.791 ", further, Fig. 4 is searched at the time point " 2,017,/06,/07 00:00:Whether front and rear one second of 05.791 " There is the image acquisition time point for image of dehiscing within this rangeResult is that have the dehisce image acquisition time point of image of some frames to exist In the range of this, such as：“2017/06/07 00:00:06.397”、“2017/06/07 00:00:06.527”、“2017/06/07 00:00:06.647 ", it is efficient voice to illustrate this voice content.

" Where is the toilet for Article 3 voice content as shown in Figure 3" voice record time point be " 2017/06/07 00:00:09.999 ", further, Fig. 4 is searched at the time point " 2,017,/06,/07 00:00:Whether front and rear one second of 09.999 " There is the image acquisition time point for image of dehiscing within this rangeResult is that do not have, and it is invalid voice to illustrate this voice content.

Step S1005：In the ASR recognition results, it is efficient voice to record this voice content.

Step S1006：In the ASR recognition results, it is invalid voice to record this voice content.

Above-mentioned steps S1004 to step S1006 is performed and has obtained ASR recognition results as shown in Figure 6, arranged and identify in remarks It is efficient voice either invalid voice to have gone out corresponding every voice content.

By efficient voice recognition methods provided in an embodiment of the present invention, while the speech data of sound object is recorded The face image data of sound object is obtained, with reference to image of dehiscing, identifies the efficient voice in speech data ASR recognition results, Background noise, environmental noise, the voice content of non-sound object in ASR recognition results can be accurately filtered out, is effectively improved The application value of ASR recognition results.

Embodiment two

The embodiment of the present invention two provides the identification device of an efficient voice, in the speech data of record sound object The face image data of sound object is obtained simultaneously, with reference to the image of dehiscing of face, is identified in speech data ASR recognition results Efficient voice, filter out voice content that background noise, environmental noise and non-sound object are sent etc..

Refering to Fig. 7, it is illustrated that be a kind of composition structural representation of efficient voice identification device provided in an embodiment of the present invention. The device can realize the audio recognition method that above-described embodiment one is provided completely or partially.

As shown in fig. 7, the efficient voice identification device that the embodiment of the present invention is provided includes：Recording device 701, ASR Identification device 702, camera device 801, image arrangement for detecting 802, and efficient voice extraction element 901.

Recording device 701, for recording the speech data of sound object.

Camera device 801, for the face image data with sound object described in the recording device synchronous recording.

While recording speech data, the face of the sound object is detected by camera device 801；Focus on the people Face, face scope is set to occupy the preset value of the cam lens, such as：Face scope occupies more than the 60% of camera lens scope； Obtain the face image data of the sound object.

ASR identification devices 702, for carrying out ASR identifications to the speech data, obtain ASR recognition results, the ASR Recognition result includes some voice contents and corresponding voice record time point.Above-mentioned voice record time point can be note Record time point that every voice content starts, record time point among every voice content or, described in record The time point that every voice content terminates.

Image arrangement for detecting 802, for carrying out feature recognition of dehiscing to the face image data of the sound object, obtain Some frames dehisce image and described to dehisce image acquisition time point corresponding to image per frame.

In embodiments of the present invention, after the face image data of sound object is obtained, further to the sound source pair The face image data of elephant carries out feature recognition of dehiscing.First, the nozzle type feature of the face image data is positioned；Then, sentence Whether the height of dehiscing of disconnected nozzle type is with the ratio of lip height than or equal to default ratio；When than or equal to identification institute The face image data stated is image of dehiscing.Wherein, the height of dehiscing is between upper lip lower edge and lower lip top edge Distance, the lip is highly the distance between upper lip top edge and lower lip lower edge.Under normal circumstances, can set in advance The ratio of dehisce height and lip height is put, such as：1/4.When than or equal to the ratio value, it is believed that the sound object is Speak state.Certainly, in other embodiments of the present invention, it can also identify whether image is with other rational methods Mouth image.Finally, after image of dehiscing is identified, it is also necessary to obtain image acquisition time point corresponding to image of being dehisced per frame.

Efficient voice extraction element 901, before comparing voice record time point corresponding to every voice content Afterwards in preset time range, if having the image acquisition time point of corresponding image of dehiscing within this range；If so, record is corresponding Voice content be efficient voice.

In the embodiment of the present invention, by comparing the whether sound source object of voice record time point corresponding to every voice content Dehisce image, to judge whether this voice content is efficient voice that sound object is truly sent.Then, the speed of image recognition Degree and ASR recognition speeds are unable to Complete Synchronization, therefore, it is allowed to pre- before and after voice record time point corresponding to every voice content If time range in, such as：Front and rear 1 second, to determine whether the image of dehiscing of sound object.In the embodiment of the present invention, lead to Cross and compare ASR recognition results list as shown in Figure 3 and every frame as shown in Figure 4 is dehisced image acquisition time point corresponding to image List can obtain whether corresponding voice content is efficient voice.It is specific to compare and know method for distinguishing in above-described embodiment one Middle step S1004 to step S1006 has been described in detail, will not be repeated here.

By efficient voice identification device provided in an embodiment of the present invention, in the voice number of recording device record sound object According to while obtain by camera device the face image data of sound object, with reference to image of dehiscing, identification speech data ASR knows Efficient voice in other result, it can accurately filter out background noise, environmental noise, non-sound object in ASR recognition results Voice content, effectively improve the application values of ASR recognition results.Predictable, the embodiment of the present invention is provided effective Speech recognition equipment can be applied or be integrated on the various electronic equipments with video and audio recording function, including and is not limited to：Intelligence Energy mobile phone, iPad, intelligent robot etc..

In above-described embodiment provided by the present invention, it should be understood that disclosed apparatus and method, it can be passed through Its mode is realized.For example, device embodiment described above is only schematical, for example, the division of the module, only Only a kind of division of logic function, can there is other dividing mode when actually realizing, for example, multiple module or components can be tied Another system is closed or is desirably integrated into, or some features can be ignored, or do not perform.

The module illustrated as separating component can be or may not be physically separate, show as module The part shown can be or may not be physical module, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of module therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.

Embodiments of the invention are these are only, the scope of the claims of the present invention are not intended to limit, although with reference to the foregoing embodiments The present invention is described in detail, for those skilled in the art comes, it still can be to foregoing each specific real Apply the technical scheme described in mode to modify, or equivalence replacement is carried out to which part technical characteristic.It is every to utilize this The equivalent structure that description of the invention and accompanying drawing content are done, other related technical areas are directly or indirectly used in, similarly Within scope of patent protection of the present invention.

Claims

A kind of 1. efficient voice recognition methods, it is characterised in that including：

The speech data of sound object is recorded, while obtains the face image data of the sound object；

ASR identifications are carried out to the speech data, obtain ASR recognition results, the ASR recognition results are included in some voices Appearance and corresponding voice record time point；

Feature recognition of dehiscing is carried out to the face image data of the sound object, some frames is obtained and dehisces image and described per frame Dehisce image acquisition time point corresponding to image；And

Compare corresponding to every voice content in the front and rear preset time range at voice record time point, if having corresponding Dehisce image image acquisition time point within this range；If so, voice content corresponding to record is efficient voice.
2. according to the method for claim 1, it is characterised in that voice record time point corresponding to every voice content For：Record time point that every voice content starts, record time point among every voice content or, note Record the time point that every voice content terminates.
3. according to the method for claim 1, it is characterised in that the face image data tool for obtaining the sound object Body includes：

Camera detects the face of the sound object；

The face is focused on, face scope is occupied the preset value of the cam lens；And

Obtain the face image data of the sound object.
4. according to the method for claim 1, it is characterised in that the face image data to the sound object is carried out Feature recognition of dehiscing specifically includes：

Position the nozzle type feature of the face image data；And

Judge the ratio dehisced highly with lip height of nozzle type whether than or equal to default ratio；When than or equal to, The described face image data of identification is image of dehiscing；Wherein, the height of dehiscing is upper lip lower edge and lower lip top The distance between edge, the lip are highly the distance between upper lip top edge and lower lip lower edge.
5. according to the method for claim 1, it is characterised in that voice record time point corresponding to every voice content Front and rear preset time range be front and rear 1 second of the voice record time point.
A kind of 6. efficient voice identification device, it is characterised in that including：

Recording device, for recording the speech data of sound object；

Camera device, for the face image data with sound object described in the recording device synchronous recording；

ASR identification devices, for carrying out ASR identifications to the speech data, obtain ASR recognition results, the ASR recognition results Including some voice contents and corresponding voice record time point；

Image arrangement for detecting, for carrying out feature recognition of dehiscing to the face image data of the sound object, obtain some frames Dehisce image and described to dehisce image acquisition time point corresponding to image per frame；And

Efficient voice extraction element, when being preset for comparing before and after voice record time point corresponding to every voice content Between in the range of, if having the image acquisition time point of corresponding image of dehiscing within this range；If so, corresponding to record in voice Hold for efficient voice.
7. device according to claim 6, it is characterised in that voice record time point corresponding to every voice content For：Record time point that every voice content starts, record time point among every voice content or, note Record the time point that every voice content terminates.
8. device according to claim 6, it is characterised in that the camera device is specifically used for：Detect the sound source The face of object；The face is focused on, face scope is occupied the preset value of the camera head lens；Obtain the sound source pair The face image data of elephant.
9. device according to claim 6, it is characterised in that described image arrangement for detecting is specifically used for：Position the face The nozzle type feature of portion's view data；And judge the ratio dehisced highly with lip height of nozzle type whether than or equal to default Ratio；When than or equal to it is image of dehiscing to identify described face image data；Wherein, the height of dehiscing is upper mouth The distance between lip lower edge and lower lip top edge, the lip is highly between upper lip top edge and lower lip lower edge Distance.
10. device according to claim 6, it is characterised in that voice record time corresponding to every voice content The front and rear preset time range of point is front and rear 1 second of the voice record time point.