CN107369449A - A kind of efficient voice recognition methods and device - Google Patents

A kind of efficient voice recognition methods and device Download PDF

Info

Publication number
CN107369449A
CN107369449A CN201710573521.XA CN201710573521A CN107369449A CN 107369449 A CN107369449 A CN 107369449A CN 201710573521 A CN201710573521 A CN 201710573521A CN 107369449 A CN107369449 A CN 107369449A
Authority
CN
China
Prior art keywords
voice
time point
image
dehiscing
sound object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710573521.XA
Other languages
Chinese (zh)
Other versions
CN107369449B (en
Inventor
蒋化冰
蔡汉嘉
廖凯
齐鹏举
方园
米万珠
舒剑
吴琨
管伟
罗璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Noah robot technology (Shanghai) Co.,Ltd.
Original Assignee
Shanghai Muye Robot Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Muye Robot Technology Co Ltd filed Critical Shanghai Muye Robot Technology Co Ltd
Priority to CN201710573521.XA priority Critical patent/CN107369449B/en
Publication of CN107369449A publication Critical patent/CN107369449A/en
Application granted granted Critical
Publication of CN107369449B publication Critical patent/CN107369449B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating

Abstract

The embodiment of the present invention provides a kind of method and device of efficient voice identification, and methods described includes:The speech data of sound object is recorded, while obtains the face image data of the sound object;ASR identifications are carried out to the speech data, obtain ASR recognition results, the ASR recognition results include some voice contents and corresponding voice record time point;Feature recognition of dehiscing is carried out to the face image data of the sound object, some frames is obtained and dehisces image and described to dehisce image acquisition time point corresponding to image per frame;And compare corresponding to every voice content in the front and rear preset time range at voice record time point, if having the image acquisition time point of corresponding image of dehiscing within this range;If so, voice content corresponding to record is efficient voice.By this method and device, effective speech data can be identified from ASR recognition results, effectively improves the application value of ASR recognition results.

Description

A kind of efficient voice recognition methods and device
Technical field
The invention belongs to multimedia technology field, more particularly to a kind of efficient voice recognition methods and device.
Background technology
With the rapid development of modern science and technology, various electronic equipments, such as:Mobile phone, iPad, intelligent robot etc. are respectively provided with Recording and the function of automatic speech recognition (Automatic Speech Recognition, ASR).However, simple recording, often Background noise, environmental noise, echo etc. can be recorded while putting down sound, non-genuine voice can also be recorded unavoidably Come, it is inevitable simultaneously comprising effective speech data and invalid in its ASR recognition result by the ASR of recording data identification Speech data.So, how the efficient voice in ASR recognition results is identified to be a problem for needing to solve.
The content of the invention
In summary, the embodiment of the present invention provides a kind of efficient voice recognition methods and device, can be from ASR recognition results The middle effective speech data of identification, effectively improve the application value of ASR recognition results.
In a first aspect, the embodiment of the present invention provides a kind of efficient voice recognition methods, it is characterised in that including:Record sound The speech data of source object, while obtain the face image data of the sound object;ASR knowledges are carried out to the speech data Not, ASR recognition results are obtained, the ASR recognition results include some voice contents and corresponding voice record time point;It is right The face image data of the sound object carries out feature recognition of dehiscing, and obtains some frames and dehisces image and described to dehisce figure per frame The image acquisition time point as corresponding to;And compare the front and rear default of voice record time point corresponding to every voice content In time range, if having the image acquisition time point of corresponding image of dehiscing within this range;If so, voice corresponding to record Content is efficient voice.
Further, voice record time point corresponding to every voice content is:Record every voice content The time point of beginning, record time point among every voice content or, record what every voice content terminated Time point.
Further, the face image data for obtaining the sound object specifically includes:Camera detects described The face of sound object;The face is focused on, face scope is occupied the preset value of the cam lens;Obtain the sound source The face image data of object.
Further, the face image data to the sound object feature recognition that dehisce specifically includes:It is fixed The nozzle type feature of the position face image data;And judge nozzle type dehisce height and lip height ratio whether exceed or Equal to default ratio;When than or equal to it is image of dehiscing to identify described face image data;Wherein, the height of dehiscing It is highly under upper lip top edge and lower lip to spend for the distance between upper lip lower edge and lower lip top edge, the lip The distance between edge.
Further, the front and rear preset time range at voice record time point is described corresponding to every voice content Front and rear 1 second of voice record time point.
Second aspect, the embodiment of the present invention provide a kind of efficient voice identification device, including:Recording device, for recording The speech data of sound object;Camera device, for the face image with sound object described in the recording device synchronous recording Data;ASR identification devices, for carrying out ASR identifications to the speech data, obtain ASR recognition results, the ASR identifications knot Fruit includes some voice contents and corresponding voice record time point;Image arrangement for detecting, for the sound object Face image data carries out feature recognition of dehiscing, and obtains some frames and dehisces image and described to obtain per frame image corresponding to image of dehiscing Take time point;And efficient voice extraction element, for comparing voice record time point corresponding to every voice content In front and rear preset time range, if having the image acquisition time point of corresponding image of dehiscing within this range;If so, record pair The voice content answered is efficient voice.
Further, voice record time point corresponding to every voice content is:Record every voice content The time point of beginning, record time point among every voice content or, record what every voice content terminated Time point.
Further, the camera device is specifically used for:Detect the face of the sound object;The face is focused on, Face scope is set to occupy the preset value of the camera head lens;Obtain the face image data of the sound object.
Further, described image arrangement for detecting is specifically used for:Position the nozzle type feature of the face image data;And Judge the ratio dehisced highly with lip height of nozzle type whether than or equal to default ratio;When than or equal to identification Described face image data is image of dehiscing;Wherein, it is described dehisce height for upper lip lower edge and lower lip top edge it Between distance, the lip is highly the distance between upper lip top edge and lower lip lower edge.
Further, the front and rear preset time range at voice record time point is described corresponding to every voice content Front and rear 1 second of voice record time point.
By efficient voice recognition methods provided in an embodiment of the present invention and device, in the speech data of record sound object While obtain the face image data of sound object, with reference to image of dehiscing, identify effective in speech data ASR recognition results Voice, background noise, environmental noise, the voice content of non-sound object in ASR recognition results can be accurately filtered out, effectively Improve the application value of ASR recognition results.
Brief description of the drawings
, below will be to embodiment or description of the prior art in order to illustrate more clearly of scheme of the invention or of the prior art In the required accompanying drawing that uses make one and simple introduce, it should be apparent that, drawings in the following description are some realities of the present invention Example is applied, for those of ordinary skill in the art, on the premise of not paying creative work, can also be according to these accompanying drawings Obtain other accompanying drawings.
A kind of schematic flow sheet for efficient voice recognition methods that Fig. 1 is provided by the embodiment of the present invention;
A kind of face image data display schematic diagram for sound object that Fig. 2 is provided by the embodiment of the present invention;
A kind of ASR recognition results schematic diagram that Fig. 3 is provided by the embodiment of the present invention;
A kind of image acquisition time point list signal corresponding to image of being dehisced per frame that Fig. 4 is provided by the embodiment of the present invention Figure;
One kind that Fig. 5 is provided by embodiment of the present invention image recognition data flow of dehiscing is combined to compare and sentenced with ASR data flows The schematic diagram of disconnected efficient voice;
Another ASR recognition results schematic diagram that Fig. 6 is provided by the embodiment of the present invention;
Fig. 7 forms structural representation by a kind of efficient voice identification device that the embodiment of the present invention provides.
Embodiment
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described.Obviously, described embodiment is only Part of the embodiment of the present invention, rather than whole embodiments, presently preferred embodiments of the present invention is given in accompanying drawing.The present invention can To realize in many different forms, however it is not limited to embodiment described herein, on the contrary, provide the mesh of these embodiments Be to make the understanding more thorough and comprehensive to the disclosure.Based on the embodiment in the present invention, the common skill in this area The every other embodiment that art personnel are obtained under the premise of creative work is not made, belong to the model that the present invention protects Enclose.
Unless otherwise defined, all of technologies and scientific terms used here by the article is with belonging to technical field of the invention The implication that technical staff is generally understood that is identical.Term used in the description of the invention herein is intended merely to description tool The purpose of the embodiment of body, it is not intended that in the limitation present invention.In description and claims of this specification and above-mentioned accompanying drawing Term " first ", " second " etc. be to be used to distinguish different objects, rather than for describing particular order.In addition, term " bag Include " and " having " and their any deformations, it is intended that cover non-exclusive include.Such as contain series of steps or list The step of process, method, system, product or the equipment of member are not limited to list or unit, but alternatively also include not The step of listing or unit, or alternatively also include for other intrinsic steps of these processes, method, product or equipment or Unit.
Referenced herein " embodiment " is it is meant that the special characteristic, structure or the characteristic that describe can wrap in conjunction with the embodiments In at least one embodiment of the present invention.Each position in the description occur the phrase might not each mean it is identical Embodiment, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art explicitly and Implicitly understand, embodiment described herein can be combined with other embodiments.
Embodiment one
The embodiment of the present invention one provides a kind of recognition methods of efficient voice, in the speech data of record sound object While obtain sound object face image data, with reference to the image of dehiscing of face, identify speech data ASR recognition results In efficient voice, filter out voice content that background noise, environmental noise and non-sound object are sent etc..
Refering to Fig. 1, it is illustrated that be a kind of schematic flow sheet of efficient voice recognition methods provided in an embodiment of the present invention.
Step S1001:The speech data of sound object is recorded, while obtains the face image data of the sound object.
In the embodiment of the present invention, the speech data of sound object is recorded, while also can be by the various sound in natural environment Record.
Step S1002:ASR identifications are carried out to the speech data, obtain ASR recognition results, the ASR recognition results bag Include some voice contents and corresponding voice record time point.
Refering to Fig. 3, a kind of ASR recognition results schematic diagram provided by the embodiment of the present invention.In step S2001, note The speech data of sound object has been recorded, ASR identifications then are carried out to described speech data, has obtained ASR recognition results, such as Fig. 2 It is shown, in the ASR recognition results, including three voice contents, and voice record time point corresponding to every voice content.Three Bar voice content be respectively "!", " how is Shanghai weather" and " Where is the toilet”;Corresponding voice record time point is “2017/06/07 00:00:02.133”、“2017/06/07 00:00:05.791 " and " 2,017,/06,/07 00:00: 09.999”.Above-mentioned voice record time point can be record time point that every voice content starts, record it is described every Time point among bar voice content or, record the time point that every voice content terminates.
Step S1003:Feature recognition of dehiscing is carried out to the face image data of the sound object, some frames is obtained and dehisces Image and the image acquisition time point corresponding to image of being dehisced per frame.
In embodiments of the present invention, after the face image data of sound object is obtained, further to the sound source pair The face image data of elephant carries out feature recognition of dehiscing.Refering to Fig. 2, first, the nozzle type feature of the face image data is positioned; Then, judge the ratio dehisced highly with lip height of nozzle type whether than or equal to default ratio;When than or equal to, The described face image data of identification is image of dehiscing.Wherein, the height of dehiscing is on upper lip lower edge 61 and lower lip The distance between edge 62, the lip are highly the distance between upper lip top edge 51 and lower lip lower edge 62.Generally In the case of, can pre-set dehisce height and lip height ratio, such as:1/4.When than or equal to the ratio value, recognizing It is the state that speaks for the sound object.Certainly, in other embodiments of the present invention, other rational methods can also be used Identify whether image is image of dehiscing.Finally, after image of dehiscing is identified, it is also necessary to obtain and dehisce to scheme corresponding to image per frame As obtaining time point.Dehisced image acquisition time point list corresponding to image as shown in figure 4, giving every frame.
Especially, it should be noted that in embodiments of the present invention, step S1002 and step S1003 execution sequence are not Priority execution sequence is limited, but the speech data and face image data to sound object respectively is carried out at analysis side by side Reason.
Step S1004:Compare the front and rear preset time range at voice record time point corresponding to every voice content It is interior, if to have the image acquisition time point of corresponding image of dehiscing within this rangeIf so, perform step S1005;If it is not, hold Row step S1006.
In the embodiment of the present invention, by comparing the whether sound source object of voice record time point corresponding to every voice content Dehisce image, to judge whether this voice content is efficient voice that sound object is truly sent.Then, the speed of image recognition Degree and ASR recognition speeds are unable to Complete Synchronization, therefore, it is allowed to pre- before and after voice record time point corresponding to every voice content If time range in, such as:Front and rear 1 second, to determine whether the image of dehiscing of sound object.As shown in figure 5, it is figure of dehiscing As identification data stream and the schematic diagram of ASR data flow combination contrast judgement efficient voices.Situation A is image recognition data flow of dehiscing The situation of speed.Situation B is the slower situation of image recognition speed data stream of dehiscing.Such as:In the voice of ASR data flows " how is Shanghai weather for content" voice start recording time point 00:00:Front and rear 1 second 00 of 03.938:00:03.281 is (corresponding Situation A) or 00:00:03.968 (corresponding situation B) has image of dehiscing, and it is efficient voice to illustrate this voice content.Or In the voice contents of ASR data flows, " how is Shanghai weather" voice terminate record time point 00:00:Front and rear 1 second of 05.791 00:00:05.027 (corresponding situation A) or 00:00:05.987 (corresponding situation B) has image of dehiscing, and illustrates this voice content For efficient voice.Conversely, it is invalid voice.
In the embodiment of the present invention, by comparing ASR recognition results list as shown in Figure 3 and every frame as shown in Figure 4 Image acquisition time point list corresponding to mouth image can obtain whether corresponding voice content is efficient voice.
First voice content as shown in Figure 3 "!" voice record time point be " 2,017,/06,/07 00:00: 02.133 ", further, Fig. 4 is searched at the time point " 2,017,/06,/07 00:00:Whether front and rear one second of 02.133 ", which has, is dehisced The image acquisition time point of image is within this rangeResult is that do not have, and it is invalid voice to illustrate this voice content.
" how is Shanghai weather for Article 2 voice content as shown in Figure 3" voice record time point be " 2017/06/07 00:00:05.791 ", further, Fig. 4 is searched at the time point " 2,017,/06,/07 00:00:Whether front and rear one second of 05.791 " There is the image acquisition time point for image of dehiscing within this rangeResult is that have the dehisce image acquisition time point of image of some frames to exist In the range of this, such as:“2017/06/07 00:00:06.397”、“2017/06/07 00:00:06.527”、“2017/06/07 00:00:06.647 ", it is efficient voice to illustrate this voice content.
" Where is the toilet for Article 3 voice content as shown in Figure 3" voice record time point be " 2017/06/07 00:00:09.999 ", further, Fig. 4 is searched at the time point " 2,017,/06,/07 00:00:Whether front and rear one second of 09.999 " There is the image acquisition time point for image of dehiscing within this rangeResult is that do not have, and it is invalid voice to illustrate this voice content.
Step S1005:In the ASR recognition results, it is efficient voice to record this voice content.
Step S1006:In the ASR recognition results, it is invalid voice to record this voice content.
Above-mentioned steps S1004 to step S1006 is performed and has obtained ASR recognition results as shown in Figure 6, arranged and identify in remarks It is efficient voice either invalid voice to have gone out corresponding every voice content.
By efficient voice recognition methods provided in an embodiment of the present invention, while the speech data of sound object is recorded The face image data of sound object is obtained, with reference to image of dehiscing, identifies the efficient voice in speech data ASR recognition results, Background noise, environmental noise, the voice content of non-sound object in ASR recognition results can be accurately filtered out, is effectively improved The application value of ASR recognition results.
Embodiment two
The embodiment of the present invention two provides the identification device of an efficient voice, in the speech data of record sound object The face image data of sound object is obtained simultaneously, with reference to the image of dehiscing of face, is identified in speech data ASR recognition results Efficient voice, filter out voice content that background noise, environmental noise and non-sound object are sent etc..
Refering to Fig. 7, it is illustrated that be a kind of composition structural representation of efficient voice identification device provided in an embodiment of the present invention. The device can realize the audio recognition method that above-described embodiment one is provided completely or partially.
As shown in fig. 7, the efficient voice identification device that the embodiment of the present invention is provided includes:Recording device 701, ASR Identification device 702, camera device 801, image arrangement for detecting 802, and efficient voice extraction element 901.
Recording device 701, for recording the speech data of sound object.
In the embodiment of the present invention, the speech data of sound object is recorded, while also can be by the various sound in natural environment Record.
Camera device 801, for the face image data with sound object described in the recording device synchronous recording.
While recording speech data, the face of the sound object is detected by camera device 801;Focus on the people Face, face scope is set to occupy the preset value of the cam lens, such as:Face scope occupies more than the 60% of camera lens scope; Obtain the face image data of the sound object.
ASR identification devices 702, for carrying out ASR identifications to the speech data, obtain ASR recognition results, the ASR Recognition result includes some voice contents and corresponding voice record time point.Above-mentioned voice record time point can be note Record time point that every voice content starts, record time point among every voice content or, described in record The time point that every voice content terminates.
Image arrangement for detecting 802, for carrying out feature recognition of dehiscing to the face image data of the sound object, obtain Some frames dehisce image and described to dehisce image acquisition time point corresponding to image per frame.
In embodiments of the present invention, after the face image data of sound object is obtained, further to the sound source pair The face image data of elephant carries out feature recognition of dehiscing.First, the nozzle type feature of the face image data is positioned;Then, sentence Whether the height of dehiscing of disconnected nozzle type is with the ratio of lip height than or equal to default ratio;When than or equal to identification institute The face image data stated is image of dehiscing.Wherein, the height of dehiscing is between upper lip lower edge and lower lip top edge Distance, the lip is highly the distance between upper lip top edge and lower lip lower edge.Under normal circumstances, can set in advance The ratio of dehisce height and lip height is put, such as:1/4.When than or equal to the ratio value, it is believed that the sound object is Speak state.Certainly, in other embodiments of the present invention, it can also identify whether image is with other rational methods Mouth image.Finally, after image of dehiscing is identified, it is also necessary to obtain image acquisition time point corresponding to image of being dehisced per frame.
Efficient voice extraction element 901, before comparing voice record time point corresponding to every voice content Afterwards in preset time range, if having the image acquisition time point of corresponding image of dehiscing within this range;If so, record is corresponding Voice content be efficient voice.
In the embodiment of the present invention, by comparing the whether sound source object of voice record time point corresponding to every voice content Dehisce image, to judge whether this voice content is efficient voice that sound object is truly sent.Then, the speed of image recognition Degree and ASR recognition speeds are unable to Complete Synchronization, therefore, it is allowed to pre- before and after voice record time point corresponding to every voice content If time range in, such as:Front and rear 1 second, to determine whether the image of dehiscing of sound object.In the embodiment of the present invention, lead to Cross and compare ASR recognition results list as shown in Figure 3 and every frame as shown in Figure 4 is dehisced image acquisition time point corresponding to image List can obtain whether corresponding voice content is efficient voice.It is specific to compare and know method for distinguishing in above-described embodiment one Middle step S1004 to step S1006 has been described in detail, will not be repeated here.
By efficient voice identification device provided in an embodiment of the present invention, in the voice number of recording device record sound object According to while obtain by camera device the face image data of sound object, with reference to image of dehiscing, identification speech data ASR knows Efficient voice in other result, it can accurately filter out background noise, environmental noise, non-sound object in ASR recognition results Voice content, effectively improve the application values of ASR recognition results.Predictable, the embodiment of the present invention is provided effective Speech recognition equipment can be applied or be integrated on the various electronic equipments with video and audio recording function, including and is not limited to:Intelligence Energy mobile phone, iPad, intelligent robot etc..
In above-described embodiment provided by the present invention, it should be understood that disclosed apparatus and method, it can be passed through Its mode is realized.For example, device embodiment described above is only schematical, for example, the division of the module, only Only a kind of division of logic function, can there is other dividing mode when actually realizing, for example, multiple module or components can be tied Another system is closed or is desirably integrated into, or some features can be ignored, or do not perform.
The module illustrated as separating component can be or may not be physically separate, show as module The part shown can be or may not be physical module, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of module therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
Embodiments of the invention are these are only, the scope of the claims of the present invention are not intended to limit, although with reference to the foregoing embodiments The present invention is described in detail, for those skilled in the art comes, it still can be to foregoing each specific real Apply the technical scheme described in mode to modify, or equivalence replacement is carried out to which part technical characteristic.It is every to utilize this The equivalent structure that description of the invention and accompanying drawing content are done, other related technical areas are directly or indirectly used in, similarly Within scope of patent protection of the present invention.

Claims (10)

  1. A kind of 1. efficient voice recognition methods, it is characterised in that including:
    The speech data of sound object is recorded, while obtains the face image data of the sound object;
    ASR identifications are carried out to the speech data, obtain ASR recognition results, the ASR recognition results are included in some voices Appearance and corresponding voice record time point;
    Feature recognition of dehiscing is carried out to the face image data of the sound object, some frames is obtained and dehisces image and described per frame Dehisce image acquisition time point corresponding to image;And
    Compare corresponding to every voice content in the front and rear preset time range at voice record time point, if having corresponding Dehisce image image acquisition time point within this range;If so, voice content corresponding to record is efficient voice.
  2. 2. according to the method for claim 1, it is characterised in that voice record time point corresponding to every voice content For:Record time point that every voice content starts, record time point among every voice content or, note Record the time point that every voice content terminates.
  3. 3. according to the method for claim 1, it is characterised in that the face image data tool for obtaining the sound object Body includes:
    Camera detects the face of the sound object;
    The face is focused on, face scope is occupied the preset value of the cam lens;And
    Obtain the face image data of the sound object.
  4. 4. according to the method for claim 1, it is characterised in that the face image data to the sound object is carried out Feature recognition of dehiscing specifically includes:
    Position the nozzle type feature of the face image data;And
    Judge the ratio dehisced highly with lip height of nozzle type whether than or equal to default ratio;When than or equal to, The described face image data of identification is image of dehiscing;Wherein, the height of dehiscing is upper lip lower edge and lower lip top The distance between edge, the lip are highly the distance between upper lip top edge and lower lip lower edge.
  5. 5. according to the method for claim 1, it is characterised in that voice record time point corresponding to every voice content Front and rear preset time range be front and rear 1 second of the voice record time point.
  6. A kind of 6. efficient voice identification device, it is characterised in that including:
    Recording device, for recording the speech data of sound object;
    Camera device, for the face image data with sound object described in the recording device synchronous recording;
    ASR identification devices, for carrying out ASR identifications to the speech data, obtain ASR recognition results, the ASR recognition results Including some voice contents and corresponding voice record time point;
    Image arrangement for detecting, for carrying out feature recognition of dehiscing to the face image data of the sound object, obtain some frames Dehisce image and described to dehisce image acquisition time point corresponding to image per frame;And
    Efficient voice extraction element, when being preset for comparing before and after voice record time point corresponding to every voice content Between in the range of, if having the image acquisition time point of corresponding image of dehiscing within this range;If so, corresponding to record in voice Hold for efficient voice.
  7. 7. device according to claim 6, it is characterised in that voice record time point corresponding to every voice content For:Record time point that every voice content starts, record time point among every voice content or, note Record the time point that every voice content terminates.
  8. 8. device according to claim 6, it is characterised in that the camera device is specifically used for:Detect the sound source The face of object;The face is focused on, face scope is occupied the preset value of the camera head lens;Obtain the sound source pair The face image data of elephant.
  9. 9. device according to claim 6, it is characterised in that described image arrangement for detecting is specifically used for:Position the face The nozzle type feature of portion's view data;And judge the ratio dehisced highly with lip height of nozzle type whether than or equal to default Ratio;When than or equal to it is image of dehiscing to identify described face image data;Wherein, the height of dehiscing is upper mouth The distance between lip lower edge and lower lip top edge, the lip is highly between upper lip top edge and lower lip lower edge Distance.
  10. 10. device according to claim 6, it is characterised in that voice record time corresponding to every voice content The front and rear preset time range of point is front and rear 1 second of the voice record time point.
CN201710573521.XA 2017-07-14 2017-07-14 A kind of efficient voice recognition methods and device Active CN107369449B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710573521.XA CN107369449B (en) 2017-07-14 2017-07-14 A kind of efficient voice recognition methods and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710573521.XA CN107369449B (en) 2017-07-14 2017-07-14 A kind of efficient voice recognition methods and device

Publications (2)

Publication Number Publication Date
CN107369449A true CN107369449A (en) 2017-11-21
CN107369449B CN107369449B (en) 2019-11-26

Family

ID=60307217

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710573521.XA Active CN107369449B (en) 2017-07-14 2017-07-14 A kind of efficient voice recognition methods and device

Country Status (1)

Country Link
CN (1) CN107369449B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154878A (en) * 2017-12-12 2018-06-12 北京小米移动软件有限公司 Control the method and device of monitoring device
CN109697976A (en) * 2018-12-14 2019-04-30 北京葡萄智学科技有限公司 A kind of pronunciation recognition methods and device
WO2019227552A1 (en) * 2018-06-01 2019-12-05 深圳市鹰硕技术有限公司 Behavior recognition-based speech positioning method and device
CN115250373A (en) * 2022-06-30 2022-10-28 北京随锐会见科技有限公司 Method for synchronously calibrating audio and video stream and related product

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0254409B1 (en) * 1986-07-25 1991-10-30 Smiths Industries Public Limited Company Speech recognition apparatus and methods
CN102013103A (en) * 2010-12-03 2011-04-13 上海交通大学 Method for dynamically tracking lip in real time
CN202329640U (en) * 2011-08-19 2012-07-11 广东好帮手电子科技股份有限公司 System for applying auxiliary voice recognition technology by mouth shape in vehicular navigation
CN103218842A (en) * 2013-03-12 2013-07-24 西南交通大学 Voice synchronous-drive three-dimensional face mouth shape and face posture animation method
CN104409075A (en) * 2014-11-28 2015-03-11 深圳创维-Rgb电子有限公司 Voice identification method and system
CN106155707A (en) * 2015-03-23 2016-11-23 联想(北京)有限公司 Information processing method and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0254409B1 (en) * 1986-07-25 1991-10-30 Smiths Industries Public Limited Company Speech recognition apparatus and methods
CN102013103A (en) * 2010-12-03 2011-04-13 上海交通大学 Method for dynamically tracking lip in real time
CN202329640U (en) * 2011-08-19 2012-07-11 广东好帮手电子科技股份有限公司 System for applying auxiliary voice recognition technology by mouth shape in vehicular navigation
CN103218842A (en) * 2013-03-12 2013-07-24 西南交通大学 Voice synchronous-drive three-dimensional face mouth shape and face posture animation method
CN104409075A (en) * 2014-11-28 2015-03-11 深圳创维-Rgb电子有限公司 Voice identification method and system
CN106155707A (en) * 2015-03-23 2016-11-23 联想(北京)有限公司 Information processing method and electronic equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154878A (en) * 2017-12-12 2018-06-12 北京小米移动软件有限公司 Control the method and device of monitoring device
WO2019227552A1 (en) * 2018-06-01 2019-12-05 深圳市鹰硕技术有限公司 Behavior recognition-based speech positioning method and device
CN109697976A (en) * 2018-12-14 2019-04-30 北京葡萄智学科技有限公司 A kind of pronunciation recognition methods and device
CN115250373A (en) * 2022-06-30 2022-10-28 北京随锐会见科技有限公司 Method for synchronously calibrating audio and video stream and related product

Also Published As

Publication number Publication date
CN107369449B (en) 2019-11-26

Similar Documents

Publication Publication Date Title
CN107369449A (en) A kind of efficient voice recognition methods and device
US7219062B2 (en) Speech activity detection using acoustic and facial characteristics in an automatic speech recognition system
CN104937519B (en) Device and method for controlling augmented reality equipment
CN105100356B (en) The method and system that a kind of volume automatically adjusts
CN106782585A (en) A kind of sound pick-up method and system based on microphone array
US11227638B2 (en) Method, system, medium, and smart device for cutting video using video content
CN106294774A (en) User individual data processing method based on dialogue service and device
CN109410957A (en) Positive human-computer interaction audio recognition method and system based on computer vision auxiliary
CN109754801A (en) A kind of voice interactive system and method based on gesture identification
CN109905764A (en) Target person voice intercept method and device in a kind of video
CN109558788B (en) Silence voice input identification method, computing device and computer readable medium
CN107545887A (en) Phonetic order processing method and processing device
CN109935226A (en) A kind of far field speech recognition enhancing system and method based on deep neural network
CN106157957A (en) Audio recognition method, device and subscriber equipment
WO2017219450A1 (en) Information processing method and device, and mobile terminal
CN107863098A (en) A kind of voice identification control method and device
CN106648760A (en) Terminal and method thereof for cleaning background application programs based on face recognition
Valbonesi et al. Multimodal signal analysis of prosody and hand motion: Temporal correlation of speech and gestures
CN105869636A (en) Speech recognition apparatus and method thereof, smart television set and control method thereof
US11819996B2 (en) Expression feedback method and smart robot
JPH1173297A (en) Recognition method using timely relation of multi-modal expression with voice and gesture
CN107274895A (en) A kind of speech recognition apparatus and method
Gebre et al. The gesturer is the speaker
CN108600511A (en) The control system and method for intelligent sound assistant's equipment
CN105227744B (en) The method and apparatus of recording call content in communication terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 402, Building 33 Guangshun Road, Changning District, Shanghai, 2003

Applicant after: SHANGHAI MROBOT TECHNOLOGY Co.,Ltd.

Address before: Room 402, Building 33 Guangshun Road, Changning District, Shanghai, 2003

Applicant before: SHANGHAI MUYE ROBOT TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 200335 402 rooms, No. 33, No. 33, Guang Shun Road, Shanghai

Patentee after: Shanghai zhihuilin Medical Technology Co.,Ltd.

Address before: 200335 402 rooms, No. 33, No. 33, Guang Shun Road, Shanghai

Patentee before: Shanghai Zhihui Medical Technology Co.,Ltd.

Address after: 200335 402 rooms, No. 33, No. 33, Guang Shun Road, Shanghai

Patentee after: Shanghai Zhihui Medical Technology Co.,Ltd.

Address before: 200335 402 rooms, No. 33, No. 33, Guang Shun Road, Shanghai

Patentee before: SHANGHAI MROBOT TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210812

Address after: 200335 Room 401, floor 4, building 2, No. 33, Guangshun Road, Changning District, Shanghai

Patentee after: Noah robot technology (Shanghai) Co.,Ltd.

Address before: 200335 402 rooms, No. 33, No. 33, Guang Shun Road, Shanghai

Patentee before: Shanghai zhihuilin Medical Technology Co.,Ltd.