WO2019206186A1 - 唇语识别方法及其装置、增强现实设备以及存储介质 - Google Patents
唇语识别方法及其装置、增强现实设备以及存储介质 Download PDFInfo
- Publication number
- WO2019206186A1 WO2019206186A1 PCT/CN2019/084109 CN2019084109W WO2019206186A1 WO 2019206186 A1 WO2019206186 A1 WO 2019206186A1 CN 2019084109 W CN2019084109 W CN 2019084109W WO 2019206186 A1 WO2019206186 A1 WO 2019206186A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sequence
- identified
- lip
- image
- language recognition
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 84
- 230000003190 augmentative effect Effects 0.000 title claims abstract description 38
- 230000033001 locomotion Effects 0.000 title claims abstract description 19
- 238000004590 computer program Methods 0.000 claims description 13
- 230000009471 action Effects 0.000 claims description 7
- 108010001267 Protein Subunits Proteins 0.000 claims 1
- 238000013519 translation Methods 0.000 abstract description 6
- 230000001815 facial effect Effects 0.000 abstract description 4
- 238000012545 processing Methods 0.000 description 27
- 238000004891 communication Methods 0.000 description 16
- 230000006854 communication Effects 0.000 description 16
- 230000006870 function Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 6
- 238000003384 imaging method Methods 0.000 description 6
- 230000008859 change Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000013145 classification model Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 239000000470 constituent Substances 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 230000001953 sensory effect Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 208000016354 hearing loss disease Diseases 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000007175 bidirectional communication Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 229910044991 metal oxide Inorganic materials 0.000 description 1
- 150000004706 metal oxides Chemical class 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/20—Scenes; Scene-specific elements in augmented reality scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G02—OPTICS
- G02B—OPTICAL ELEMENTS, SYSTEMS OR APPARATUS
- G02B27/00—Optical systems or apparatus not provided for by any of the groups G02B1/00 - G02B26/00, G02B30/00
- G02B27/01—Head-up displays
- G02B27/017—Head mounted
Definitions
- At least one embodiment of the present disclosure is directed to a lip language recognition method and apparatus thereof, an augmented reality device, and a storage medium.
- Augmented Reality (AR) technology is a new technology that integrates physical and virtual information in a real environment. It is characterized by applying virtual information to the real environment, and can bring physical and virtual information in the real environment. Blend into the same picture or space to achieve a sensory experience that transcends reality.
- the existing virtual reality system mainly simulates a virtual three-dimensional world through a high-performance computing system with a central processing unit, and provides the user with a sensory experience of sight, hearing, etc., so that the user is as immersive as the immersive Human-computer interaction is also possible.
- At least one embodiment of the present disclosure provides a lip language recognition method, including: acquiring a face image sequence of an object to be identified; performing lip language recognition based on the face image sequence to determine a lip motion corresponding to a face image The semantic information of the speech content of the object to be identified; the semantic information is used for presentation.
- lip language recognition is performed based on the sequence of face images to determine semantic information of the speech content of the object to be recognized corresponding to the lip motion in the face image, including Sending the sequence of face images to the server, and performing lip language recognition by the server to determine semantic information of the speech content of the object to be recognized corresponding to the lip motion in the face image.
- the lip language recognition method before the semantic information is used for presentation, the lip language recognition method further includes: receiving the semantic information sent by the server.
- the semantic information is semantic text information and/or semantic audio information.
- the method provided by at least one embodiment of the present disclosure further includes displaying the semantic information.
- Displaying the semantic information includes displaying the semantic text information in a range of a user's field of view wearing the augmented reality device or playing the semantic audio information according to the presentation mode instruction.
- acquiring a sequence of a face image of the object to be identified includes: acquiring an image sequence including the object to be identified; and positioning an orientation of the object to be identified; Determining an orientation of the object to be identified, determining a position of the face region of the object to be identified in each frame image in the image sequence, and extracting an image of the face region of the object to be identified from the frame images The sequence of face images.
- positioning the orientation of the object to be identified includes: positioning an orientation of the object to be identified according to a voice signal sent when the object to be recognized speaks.
- the method further includes: saving the sequence of the face images.
- sending the sequence of face images to a server includes: transmitting the saved sequence of face images to the server when receiving a sending instruction.
- At least one embodiment of the present disclosure also provides a lip language recognition apparatus, including: a face image sequence acquisition unit, a transmission unit, and a reception unit.
- the face image sequence obtaining unit is configured to acquire the face image sequence of the object to be identified;
- the sending unit is configured to send the face image sequence to the server, and the server performs lip language recognition to determine the face image
- the receiving unit is configured to receive the semantic information sent by the server.
- the lip language recognition apparatus provided in at least one embodiment of the present disclosure further includes: a display unit configured to display the semantic information.
- the display unit includes: a presentation mode instruction generation subunit configured to generate a presentation mode instruction, where the presentation mode instruction includes a display mode instruction and an audio mode instruction. .
- the semantic information is semantic text information and/or semantic audio information
- the display unit further includes a display subunit and a play subunit.
- a display subunit configured to display the semantic text information in a field of view of a user wearing the augmented reality device when receiving the display mode instruction; and the playing subunit configured to receive the audio mode instruction, Playing the semantic audio information.
- the face image sequence acquisition unit includes an image sequence acquisition subunit, a positioning subunit, and a face image sequence generation subunit; and an image sequence acquisition subunit,
- the image sequence is configured to acquire the image to be identified;
- the positioning subunit is configured to locate the orientation of the object to be identified;
- the face image sequence generating subunit is configured to determine the to wait according to the located orientation of the object to be identified Identifying a position of a face region of the object in each frame image of the image sequence, and extracting an image of the face region of the object to be identified from the frame images to generate the face image sequence.
- At least one embodiment of the present disclosure also provides a lip language recognition apparatus comprising: a processor; a machine readable storage medium storing one or more computer program modules; the one or more computer program modules being stored in the The machine readable storage medium is configured to be executed by the processor, the one or more computer program modules comprising instructions for performing a lip language recognition method provided by any of the embodiments of the present disclosure.
- At least one embodiment of the present disclosure also provides an augmented reality device, including the lip language recognition device provided by any embodiment of the present disclosure.
- the augmented reality device provided in at least one embodiment of the present disclosure further includes an imaging device, a display device, or a playback device.
- the camera device is configured to collect an image of the object to be identified;
- the display device is configured to display the semantic information;
- the playback device is configured to play the semantic information.
- At least one embodiment of the present disclosure further provides a lip language recognition method, including: receiving a face image sequence of an object to be recognized transmitted by an augmented reality device; performing lip language recognition based on the face image sequence to determine a face image The lip action corresponding to the semantic information of the speech content of the object to be identified; the semantic information is sent to the augmented reality device.
- At least one embodiment of the present disclosure also provides a storage medium that non-transitoryly stores computer readable instructions that can perform lip language recognition provided by any of the embodiments of the present disclosure when the non-transitory computer readable instructions are executed by a computer method.
- FIG. 1 is a flowchart of a lip language recognition method according to at least one embodiment of the present disclosure
- FIG. 2A is a flowchart of another lip language recognition method according to at least one embodiment of the present disclosure.
- FIG. 2B is a flowchart of still another lip language recognition method according to at least one embodiment of the present disclosure
- 2C is a system flowchart of a lip language recognition method according to at least one embodiment of the present disclosure
- FIG. 3A is a schematic block diagram of a lip language recognition apparatus according to at least one embodiment of the present disclosure
- FIG. 3B is a schematic block diagram of the display unit 304 shown in FIG. 3A;
- FIG. 3C is a schematic block diagram of the face image sequence obtaining unit 301 shown in FIG. 3A;
- FIG. 3D is a schematic block diagram of another lip language recognition device according to at least one embodiment of the present disclosure.
- FIG. 3E is a schematic block diagram of an augmented reality device according to at least one embodiment of the present disclosure.
- FIG. 3F is a schematic block diagram of an augmented reality device provided by at least one embodiment of the present disclosure.
- FIG. 4 is a schematic structural diagram of an augmented reality device according to at least one embodiment of the present disclosure.
- the AR device may be provided with an imaging device, and the imaging device may collect the real object in the real environment in real time, and calculate the position and angle of the physical object, and then corresponding image processing, thereby achieving fusion with the virtual information.
- the imaging device may collect the real object in the real environment in real time, and calculate the position and angle of the physical object, and then corresponding image processing, thereby achieving fusion with the virtual information.
- At least one embodiment of the present disclosure provides a lip language recognition method, including: acquiring a face image sequence of an object to be recognized; performing lip language recognition based on a face image sequence to determine a lip motion corresponding to a face image to be recognized The semantic information of the object's speech content; the semantic information is used for presentation.
- At least one embodiment of the present disclosure also provides a lip language recognition device, an augmented reality device, and a storage medium corresponding to the lip language recognition method described above.
- the lip language recognition method provided by at least one embodiment of the present disclosure may, on the one hand, determine a speech content of an object to be identified, display a lip language of the object to be identified, and implement a lip language translation of the object to be identified;
- the lip language recognition method can be implemented by using components of the existing AR device, and the hardware is not separately added, so that the AR device can be extended without increasing the cost. Features that further enhance the user experience.
- At least one embodiment of the present disclosure provides a lip language recognition method, which can further expand the function of the augmented reality device and improve the user experience of the device.
- the lip language identification method can be used for an AR device or a VR (Virtual Reality, VR for short) device, etc., and the embodiment of the present disclosure does not limit this.
- the lip recognition method can be implemented at least partially in software and loaded and executed by a processor in the AR device, or at least partially implemented in hardware or firmware, to extend the functionality of the augmented reality device, and to enhance the device. user experience.
- FIG. 1 is a flowchart of a lip language recognition method according to at least one embodiment of the present disclosure. As shown in FIG. 1, the lip language recognition method includes steps S10 to S30. The steps S10 to S30 of the lip language recognition method and their respective exemplary implementations are respectively described below.
- Step S10 Acquire a sequence of face images of the object to be identified.
- Step S20 performing lip language recognition based on the face image sequence to determine semantic information corresponding to the lip motion in the face image.
- Step S30 The semantic information is used for presentation.
- an augmented reality AR device is a head-mounted wearable smart device that utilizes augmented reality technology to enhance a sensory experience that can transcend reality.
- AR devices combine technologies such as image display, image processing, multi-sensor fusion, and 3D modeling to be used in medical, gaming, network video communications, and exhibitions.
- Current AR devices typically include an imaging device (such as a camera), an optical projection device (a device consisting of optical elements such as various lenses, which can project an image into the field of view of the user wearing the AR device) and a sound collection device (such as a speaker or Mike, etc.), has a scalable space in function.
- an imaging device such as a camera
- an optical projection device a device consisting of optical elements such as various lenses, which can project an image into the field of view of the user wearing the AR device
- a sound collection device such as a speaker or Mike, etc.
- the image pickup device may include, for example, a CMOS (Complementary Metal Oxide Semiconductor) sensor, a CCD (Charge Coupled Device) sensor, an infrared camera, or the like.
- CMOS Complementary Metal Oxide Semiconductor
- CCD Charge Coupled Device
- the camera device can be placed in the plane in which the OLED display is located, such as on the bezel of the AR device.
- an image can be acquired using an imaging device in an AR device. After the user wears the AR device, the camera device can collect images within its field of view. If the user needs to communicate with other objects, for example, during a meeting or when the user talks with other objects, the object that needs to communicate is usually faced. At this time, the camera device can acquire an image of an AC object located within its field of view, and the image includes an image of the AC object.
- the object to be identified described above refers to an object in an image acquired by an image pickup device of the AR device.
- the object may be a person with whom it communicates, a person who is in a video, or the like, and the embodiment of the present disclosure does not limit this.
- a plurality of frames of images continuously captured by the camera device may be combined into an image sequence. Since the image captured by the camera device includes the object to be identified, the area of the face of the object to be identified may also be included, and the multi-frame image of the area where the face of the object to be recognized is located may be the face image sequence of the object to be identified.
- a face image sequence acquisition unit may be provided, and a face image sequence of the object to be identified is acquired by the face image sequence acquisition unit; for example, by a central processing unit (CPU), an image processor (GPU), a tensor A processor (TPU), a field programmable gate array (FPGA), or other form of processing unit with data processing capabilities and/or instruction execution capabilities, and corresponding computer instructions to implement a face image sequence acquisition unit.
- the processing unit may be a general purpose processor or a dedicated processor, and may be an X86 or ARM architecture based processor or the like.
- step S20 for example, in one example, it may be performed by a central processing unit (CPU), an image processor (GPU), a field programmable logic gate array (FPGA) in an AR device, or with data processing capabilities and/or instructions
- CPU central processing unit
- GPU image processor
- FPGA field programmable logic gate array
- a sequence of face images can also be sent to the server.
- the server can be a local server, or a server set on a local area network, or a cloud server, so that the face sequence can be processed by the server (for example, a processing unit in the server, etc.) for lip language recognition to determine the person.
- the lip action in the face image corresponds to the semantic information of the speech content of the object to be recognized.
- the face image sequence can be transmitted to the server via wireless communication methods such as Bluetooth or Wi-Fi.
- the server may perform lip language recognition according to the received sequence of face images.
- the face image of each frame in the face image sequence includes the area of the face of the object to be identified, and the area where the face is located includes the lips of the person, and the server may
- the face recognition algorithm is used to recognize the face from each frame of the face image. Due to the plurality of multi-frame continuous images in the face image sequence, the object to be recognized (ie, the person) can be further extracted according to the recognized face.
- the lip shape change feature can input the lip change characteristics into the lip language recognition model, identify the corresponding pronunciation, and further determine a statement or a phrase that can express the semantics composed of each pronunciation according to the recognized pronunciations.
- the statement or the word may be sent as the semantic information to the augmented reality device, and the augmented reality device may display the semantic information, and then the user wearing the AR device may know the content or meaning of the object to be recognized according to the displayed semantic information.
- the lip recognition model described above may be based on a network model of deep learning, such as a Convolutional Neural Network (CNN) model or a Multi-Return Neural Network (RNN) model. And using the network model to identify corresponding pronunciations according to the lip shape change characteristics when the object to be recognized is speaking, and matching each pronunciation in a database using a plurality of preset pronunciations and sentences or phrase correspondences, and determining A sentence or phrase that can be expressed by each pronunciation.
- CNN Convolutional Neural Network
- RNN Multi-Return Neural Network
- the above semantic information does not necessarily identify all the pronunciations represented by the lip shape changes when the object to be recognized speaks, and the key semantic information or key semantic information of the speech content of the object to be identified may be identified.
- a sentence or phrase that consists of a pronunciation may be the one or the phrase that is most likely to be determined.
- a transmitting unit may be provided, and a sequence of face images is transmitted to the server through the transmitting unit to perform lip language recognition by the server; for example, through a central processing unit (CPU), an image processor (GPU), and a tensor A processor (TPU), a field programmable gate array (FPGA), or other form of processing unit with data processing capabilities and/or instruction execution capabilities, and corresponding computer instructions to implement the transmitting unit.
- CPU central processing unit
- GPU image processor
- TPU tensor A processor
- FPGA field programmable gate array
- the identification unit may also be directly disposed in the AR device, and the recognition unit may perform lip language recognition; for example, it may be through a central processing unit (CPU), an image processor (GPU), a tensor processor (TPU), or the field.
- CPU central processing unit
- GPU image processor
- TPU tensor processor
- FPGA programming logic gate array
- step S30 for example, after the speech content of the object to be recognized is determined based on the lip language recognition method, the lip language of the object to be recognized may be displayed, thereby implementing lip language translation of the object to be recognized.
- the components of the existing AR device can be utilized, and the functions of the AR device are expanded without further increasing the cost without further increasing the cost, thereby further improving the user experience.
- the algorithm and model for recognizing the lip language need chip or hardware support with complex data processing capability and operation speed. Therefore, the above-mentioned lip language recognition algorithm and model may not be set on the AR device, for example, for example. Through server processing, this does not affect the portability of the AR device, nor does it increase the hardware cost of the AR device.
- the processing unit in the AR device can also implement the above-mentioned lip language recognition algorithm and model without affecting the portability and hardware cost of the AR device, thereby improving the market for the AR device.
- Competitiveness embodiments of the present disclosure do not limit this. The following is an example of implementing the lip language recognition method by a server, but the embodiment of the present disclosure does not limit this.
- the semantic information may be semantic text information in text form or semantic audio information in audio form, or both semantic text information and semantic audio information.
- the lip language recognition method further includes displaying semantic information.
- the server may send voice text information and/or to the AR device, and a display mode button or menu may be set on the AR device.
- the display mode may include a display mode and an audio mode, and the user may select a display mode according to the need, and the user may select a display mode instruction after selecting the display mode.
- the AR device may display the semantic language information according to the instruction in the user's field of vision wearing the augmented reality device; when the presentation mode instruction is an audio mode instruction, the augmented reality device plays the Semantic audio information.
- a presentation unit can be provided and the semantic information can be presented through the presentation unit; for example, through a central processing unit (CPU), an image processor (GPU), a tensor processor (TPU), a field programmable logic gate array ( The FPGA) or other form of processing unit with data processing capabilities and/or instruction execution capabilities and corresponding computer instructions to implement the presentation unit.
- CPU central processing unit
- GPU image processor
- TPU tensor processor
- the FPGA field programmable logic gate array
- the lip language setting method can convert the recognized lip language of the object to be recognized into text or audio, and realize translation of the lip language, which can help people with special needs to better with others. communicate with. For example, when a person with hearing impairment or an elderly person cannot hear the voice of another person, or is inconvenient to communicate with others, it brings inconvenience to his life, but by wearing the AR device, others can speak. The content is converted into text to help communicate with others.
- the voice of the participants may be small, and others may not hear the speaker's speech more clearly; or In the large lecture hall, participants who are far away from the speaker cannot hear the speaker's speech more clearly; or, when communicating in a place with loud noise, the communication personnel cannot hear the speech more clearly.
- the content of the person's speech For example, in the above cases, the required person can wear the AR device, convert the lip language of the speaker as the object to be recognized into text or audio, realize translation of the lip language, and effectively improve the fluency of the communication.
- FIG. 2A is a flowchart of acquiring a sequence of face images according to at least one embodiment of the present disclosure. That is, FIG. 2A is a flowchart of some examples of step S10 shown in FIG. 1. In some embodiments, as shown in FIG. 2A, the step of acquiring the face image of the object to be identified, which is described in step S10 above, includes steps S11 to S13.
- Step S11 Acquire an image sequence including the object to be identified.
- Step S12 Locating the orientation of the object to be identified.
- Step S13 determining, according to the located orientation of the object to be identified, the position of the face region of the object to be identified in each frame image in the image sequence, and intercepting the image of the face region of the object to be identified from each frame image to generate a face image sequence.
- step S12 may be performed first, and then step S11 is performed, that is, the orientation of the object to be identified is determined first, and then the image sequence of the object to be identified in the orientation is acquired.
- the sequence of the face image may be directly collected.
- step S11 may be performed first, and then step S12 is performed, that is, the image sequence including the object to be identified is acquired first, and then the face image sequence of the object to be identified is accurately and quickly obtained according to the determined orientation of the object to be identified.
- the video of the object to be identified may be collected by the camera device of the AR device, the video is composed of consecutive multi-frame images, or the camera continuously captures images of multiple objects to be recognized, and the multi-frame image may constitute the image sequence, and each frame image Each includes an object to be identified, and also includes a face region of the object to be identified, and the image sequence can be directly used as a sequence of face images.
- the image in the image sequence may be an original image directly acquired by the camera device, or may be an image obtained after pre-processing the original image, which is not limited in the embodiment of the present disclosure.
- the image pre-processing operation can eliminate extraneous information or noise information in the original image in order to better perform face detection on the acquired image.
- the image pre-processing operation may include image scaling, compression or format conversion, color gamut conversion, gamma correction, image enhancement, or noise reduction filtering on the acquired image.
- the part of the face area of the object to be recognized may be intercepted from each frame image to generate a face.
- Image sequence includes a multi-frame face image, and each frame face image is a partial image taken from the entire image of the object to be recognized, the partial image including a face region.
- the orientation of the object to be identified that is, the orientation of the face region of the object to be identified in the space in which the user wearing the AR device is located.
- the user wearing the AR device is in a conference room, and the object to be identified is located at a certain position in the conference room.
- the location of the object to be identified may be captured by the AR device.
- the central axis of the field of view of the device is a reference position, the angle between the position of the object to be identified and the central axis is taken as the orientation of the object to be identified, and then the image of the face region of the object to be identified is further positioned according to the orientation of the object to be identified.
- the user wearing the AR device faces the recognition object, and the angle between the object to be recognized and the central axis of the field of view of the camera device of the AR device is 30 degrees to the right side, and the 30 degrees is the orientation of the object to be identified, according to
- the orientation may initially determine that the position of the object to be identified in the image is within a certain distance from the center of the image, and then the face recognition may be performed on the area, and the face area is further located, and the part of the image is intercepted as a face. image.
- a large number (for example, 10,000 sheets or more) of images including faces may be collected in advance as a sample library, and feature extraction is performed on images in the sample library. Then, using the images in the sample library and the extracted feature points, the classification model is trained and tested by machine learning (such as deep learning, or local feature-based regression algorithm) to obtain a classification model of the user's face image.
- the classification model may also be implemented by other conventional algorithms in the art, such as a support vector machine (SVM), etc., which is not limited by the embodiments of the present disclosure.
- SVM support vector machine
- the machine learning algorithm can be implemented by using a conventional method in the art, and details are not described herein again.
- the input of the classification model is an acquired image
- the output is an image of a user's face, so that face recognition can be realized.
- an infrared sensor can be disposed on the AR device, and the infrared sensor can sense the object to be identified, and then locate the orientation of the object to be identified.
- the infrared sensor can sense the orientation of multiple objects to be identified, but if Only one of the objects to be recognized is speaking. For the lip language recognition, only the face image of the object to be recognized that needs to be recognized needs to be recognized, and no other objects to be recognized without speech are needed.
- the orientation of the object to be recognized can be located by means of sound localization, that is, according to a voice signal emitted when the object to be recognized speaks.
- a microphone array can be disposed on the AR device, and the microphone array is a cluster of microphones, which is a set of multiple microphones, and the position of the sound source can be located through the microphone array.
- the speech signal to be recognized by the object (person) to be recognized is also a vocal sound source, and accordingly, the orientation of the object to be recognized that is speaking can be identified. If there are multiple objects to be recognized at the same time, it is also possible to locate the positions of the plurality of objects to be recognized that are being spoken. The above positioning does not require accurate positioning of the accurate position of the object to be identified, as long as the approximate orientation is located.
- the lip recognition method is feasible.
- the lip shape of the object to be recognized without speaking is basically unchanged, and therefore, for the waiting
- semantic information is not determined, and only the semantic information of the object to be recognized of the speech is determined.
- the user can select the lip language to be recognized in real time, and the camera device of the AR device can collect the image of the object to be recognized in real time.
- the AR device obtains a sequence of face images, and sends the sequence of face images to the server in real time.
- the server returns the semantic information according to the lip language recognition, and the AR device displays the semantic information after receiving.
- the user can also select not to perform lip language recognition in real time according to the need, and the image device of the AR device still collects the image of the object to be recognized in real time.
- the sequence of the face image may be generated by parsing the video directly collected by the camera device (the video is composed of consecutive multi-frame images), or the multi-frame captured by the camera device by using the capture method. Face image generation.
- the sequence of face images is saved.
- the sequence of face images can be saved in an AR device (eg, stored in a register of the AR device).
- a sending button or a menu may be set on the AR device, and the user may select a timing for performing lip language recognition on the saved face image sequence according to the need.
- the user operates a send button or menu to generate a sending command according to the
- the sending instruction AR device sends the saved sequence of face images to the server, and the server returns the semantic information according to the lip language recognition, and the AR device receives the semantic information and displays it.
- the above manner of not performing real-time lip language recognition can be applied to a scene for wearing an AR device for real-time bidirectional communication with an object to be recognized.
- the users at the venue do not have hearing impairments. They can normally hear the speaker or report the speech of the presenter.
- the AR device can be worn. The acquired sequence of face images is saved first, and then sent to the server for lip language recognition when needed.
- At least one embodiment of the present disclosure also provides a lip language recognition method, for example, the lip language recognition method is implemented by a server.
- the lip recognition method can be implemented at least in part in software and loaded and executed by a processor in the server, or at least partially implemented in hardware or firmware to extend the functionality of the augmented reality device and enhance the user of the device. Experience.
- FIG. 2B is a flowchart of still another lip language recognition method according to at least one embodiment of the present disclosure.
- the lip language recognition method includes steps S100 to S300.
- the steps S100 to S300 of the lip language recognition method and their respective exemplary implementations are respectively described below.
- Step S100 Receive a sequence of face images of the object to be identified sent by the augmented reality device.
- the server receives a sequence of face images of an object to be identified, for example, transmitted by the AR device.
- a sequence of face images of an object to be identified for example, transmitted by the AR device.
- Step S200 performing lip language recognition based on the sequence of face images to determine semantic information of the speech content of the object to be recognized corresponding to the lip motion in the face image.
- lip language recognition can be performed by a processing unit in the server based on a sequence of face images.
- the specific implementation method of the lip language recognition may refer to the related description of step S20, and details are not described herein again.
- Step S300 Send semantic information to the augmented reality device.
- the semantic information is semantic text information and/or semantic audio information.
- the semantic information is sent by the server to, for example, an AR device such that the semantic information can be displayed or played on the AR device.
- FIG. 2C is a system flowchart of a lip recognition method according to at least one embodiment of the present disclosure.
- a lip recognition method provided by at least one embodiment of the present disclosure is systematically described below with reference to FIG. 2C.
- the orientation of the object to be identified can be located according to the infrared sensor or the microphone, and the face image can be acquired by the camera.
- the captured face image can be uploaded in real time for lip recognition, or can be uploaded in non-real time.
- the face image sequence line can be saved to a register in the AR device and read according to the sending instruction. Take a sequence of face images to send them to the server.
- the position of the lip can be located in the face image based on the face image located at the orientation, so that the semantic information can be acquired according to the action of the recognition lip.
- lip action matching can be performed on the server side to perform text conversion or audio conversion on the semantic information corresponding to the lip action to respectively obtain semantic text information or semantic audio information.
- the semantic text information can be displayed on the AR device or played for voice; the semantic audio information can be played back.
- FIG. 3A is a schematic block diagram of a lip language recognition apparatus according to at least one embodiment of the present disclosure.
- the lip language recognition device 03 includes a face image sequence acquisition unit 301, a transmission unit 302, and a reception unit 303.
- the lip recognition device 03 further includes a display unit 304.
- the face image sequence acquisition unit 301 is configured to acquire a sequence of face images of the object to be identified.
- the face image sequence obtaining unit 301 can implement the step S10, and the specific implementation method can refer to the related description of step S10, and details are not described herein again.
- the sending unit 302 is configured to be configured to send the sequence of face images to the server, and the lip language recognition by the server determines the semantic information corresponding to the lip motion in the face image.
- the face image sequence can be transmitted to the server via wireless communication methods such as Bluetooth or Wi-Fi.
- the sending unit 302 can implement the step S20, and the specific implementation method can refer to the related description of step S20, and details are not described herein again.
- the receiving unit 303 is configured to receive semantic information transmitted by the server; the presentation unit 304 is configured to present semantic information.
- the receiving unit 303 and the displaying unit 304 may implement the step S30, and the specific implementation method may refer to the related description of step S30, and details are not described herein again.
- the semantic information is semantic text information and/or semantic audio information.
- presentation unit 304 can include presentation mode instruction generation sub-unit 3041; in other examples, presentation unit 304 can also include display sub-unit 3042 and play sub-unit 3043.
- the presentation mode instruction generation sub-unit 3041 is configured to generate a presentation mode instruction.
- the presentation mode instructions include display mode instructions and audio mode instructions.
- the display sub-unit 3042 is configured to display the semantic text information within the field of view of the user wearing the augmented reality device upon receiving the display mode command.
- Play subunit 3043 is configured to play semantic audio information upon receiving an audio mode command.
- the face image sequence acquisition unit 301 includes an image sequence acquisition sub-unit 3011, a positioning sub-unit 3012, and a face image sequence generation sub-unit 3013.
- the image sequence acquisition sub-unit 3011 is configured to acquire an image sequence of the object to be identified.
- the positioning sub-unit 3012 is configured to locate the orientation of the object to be identified.
- the face image sequence generation sub-unit 3013 is configured to determine, according to the located orientation of the object to be identified, the position of the face region of the object to be identified in each frame image of the image sequence, and intercept the face region of the object to be identified from each frame image.
- the image generates a sequence of face images.
- the AR device-based identification device provided by the embodiment of the present disclosure can determine the speech content of the object to be identified, and display the lip language of the object to be identified.
- the translation of the lip language of the object to be identified, and the components of the existing AR device can be utilized, and the function of the AR device is expanded without further increasing the cost without further increasing the cost, thereby further improving the user experience.
- the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, ie may be located in one place, or may be distributed over multiple network elements; The above units may be combined into one unit, or may be further split into a plurality of subunits.
- each unit in the apparatus of this embodiment may be implemented by means of software, or by software and hardware, and of course, by general hardware.
- the technical solution provided by the embodiments of the present disclosure may be embodied in the form of a software product in essence or in the form of a software product, and the software implementation is taken as an example.
- the corresponding computer program instructions in the non-volatile memory are read into the memory by the processor included in the AR device to which the device is applied.
- the lip language recognition device may include more or less circuits, and the connection relationship between the respective circuits is not limited, and may be determined according to actual needs.
- the specific configuration of each circuit is not limited, and may be composed of an analog device according to the circuit principle, a digital chip, or other suitable manner.
- FIG. 3D is a schematic block diagram of another lip language recognition apparatus according to at least one embodiment of the present disclosure.
- the lip recognition device 200 includes a processor 210, a machine readable storage medium 220, and one or more computer program modules 221.
- processor 210 is coupled to machine readable storage medium 220 via bus system 230.
- one or more computer program modules 221 are stored in machine readable storage medium 220.
- one or more computer program modules 221 include instructions for performing the lip language recognition method provided by any of the embodiments of the present disclosure.
- instructions in one or more computer program modules 221 can be executed by processor 210.
- the bus system 230 can be a conventional serial, parallel communication bus, etc., and embodiments of the present disclosure do not limit this.
- the processor 210 can be a central processing unit (CPU), an image processor (GPU), or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and can be a general purpose processor or a dedicated processor, and Other components in the lip recognition device 200 can be controlled to perform the desired functions.
- CPU central processing unit
- GPU image processor
- Other components in the lip recognition device 200 can be controlled to perform the desired functions.
- Machine-readable storage medium 220 can include one or more computer program products, which can include various forms of computer-readable storage media, such as volatile memory and/or nonvolatile memory.
- the volatile memory may include, for example, random access memory (RAM) and/or cache or the like.
- the nonvolatile memory may include, for example, a read only memory (ROM), a hard disk, a flash memory, or the like.
- One or more computer program instructions can be stored on a computer readable storage medium, and the processor 210 can execute the program instructions to implement the functions (implemented by the processor 210) and/or other desired functions in the disclosed embodiments. For example, lip recognition methods and the like.
- Various applications and various data such as a sequence of face images and various data used and/or generated by the application, etc., may also be stored in the computer readable storage medium.
- the embodiment of the present disclosure does not give all the constituent elements of the lip recognition device 200.
- those skilled in the art can provide and set other constituent units not shown according to specific needs, which is not limited by the embodiments of the present disclosure.
- At least one embodiment of the present disclosure also provides an augmented reality device.
- 3E-4 are schematic block diagrams of an augmented reality device according to at least one embodiment of the present disclosure.
- the augmented reality device 1 includes the lip language recognition device 100/200 provided by any embodiment of the present disclosure, and the lip language recognition device 100/200 may specifically refer to the correlation of FIG. 3A to FIG. 3D. Description, no longer repeat here.
- the augmented reality device 1 further includes an image pickup device, a display device, or a playback device.
- a camera for collecting an image of an object to be identified a display device for displaying semantic text information, and a playback device for playing semantic audio information.
- the playback device may be a speaker, a speaker, or the like, and the following is an example in which the playback device is a speaker. The embodiment of the present disclosure does not limit this.
- the augmented reality device 1 can be worn in the eye of a person to implement a lip language recognition function of the object to be recognized as needed.
- the AR device 1 includes an imaging device 101, (eg, a camera for acquiring an image of an object to be recognized), a display device 102 (for displaying semantic text information), and a speaker 103 ( Input/output (I/O) devices for playing semantic audio information.
- an imaging device 101 eg, a camera for acquiring an image of an object to be recognized
- a display device 102 for displaying semantic text information
- a speaker 103 Input/output (I/O) devices for playing semantic audio information.
- I/O Input/output
- the AR device 1 further includes a machine readable storage medium 104, a processor 105, a communication interface 106, and a bus 107.
- the camera 101, the display device 102, the speaker 103, the machine readable storage medium 104, the processor 105, and the communication interface 106 complete communication with each other via the bus 107.
- the processor 105 can perform the lip language recognition method described above by reading and executing machine executable instructions in the machine readable storage medium 104 corresponding to the control logic of the lip recognition method.
- the communication interface 106 is coupled to a communication device (not shown).
- the communication device can communicate with the network and other devices via wireless communication, such as the Internet, an intranet, and/or a wireless network such as a cellular telephone network, a wireless local area network (LAN), and/or a metropolitan area network (MAN) ).
- Wireless communication can use any of a variety of communication standards, protocols, and technologies including, but not limited to, Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), Wideband Code Division Multiple Access (W-CDMA).
- GSM Global System for Mobile Communications
- EDGE Enhanced Data GSM Environment
- W-CDMA Wideband Code Division Multiple Access
- CDMA Code Division Multiple Access
- TDMA Time Division Multiple Access
- Wi-Fi eg based on IEEE 802.11a, IEEE 802.11b, IEEE 802.11g and/or IEEE 802.11n standards
- VoIP Internet Protocol-based voice transmission
- Wi-MAX protocols for email, instant messaging, and/or short message service (SMS), or any other suitable communication protocol.
- the machine-readable storage medium 104 referred to in the embodiments of the present disclosure may be any electronic, magnetic, optical, or other physical storage device, and may contain or store information such as executable instructions, data, and the like.
- the machine-readable storage medium may be: RAM (Radom Access Memory), volatile memory, non-volatile memory, flash memory, storage drive (such as a hard disk drive), any type of storage disk (such as a disk). , dvd, etc.), or a similar storage medium, or a combination thereof.
- the non-volatile medium 108 can be a non-volatile memory, a flash memory, a storage drive (such as a hard drive), any type of storage disk (such as a compact disc, dvd, etc.), or a similar non-volatile storage medium, or a combination thereof. .
- the embodiment of the present disclosure does not give all the constituent elements of the AR device 1.
- those skilled in the art can provide and set other component units not shown according to specific needs, which is not limited by the embodiments of the present disclosure.
- An embodiment of the present disclosure also provides a storage medium.
- the storage medium non-transitoryly stores computer readable instructions, and when the non-transitory computer readable instructions are executed by a computer (including a processor), the lip language recognition method provided by any of the embodiments of the present disclosure can be performed.
- the storage medium may be any combination of one or more computer readable storage media, such as a computer readable storage medium containing computer readable program code for obtaining a sequence of facial images of an object to be identified, another computer readable
- the storage medium contains computer readable program code that exhibits semantic information.
- the computer can execute the program code stored in the computer storage medium to perform a lip language recognition method such as provided in any embodiment of the present disclosure.
- the storage medium may include a memory card of a smart phone, a storage unit of a tablet, a hard disk of a personal computer, a random access memory (RAM), a read only memory (ROM), an erasable programmable read only memory (EPROM), Portable compact disk read only memory (CD-ROM), flash memory, or any combination of the above storage media may be other suitable storage media.
- RAM random access memory
- ROM read only memory
- EPROM erasable programmable read only memory
- CD-ROM Portable compact disk read only memory
- flash memory or any combination of the above storage media may be other suitable storage media.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Theoretical Computer Science (AREA)
- Oral & Maxillofacial Surgery (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- User Interface Of Digital Computer (AREA)
- Processing Or Creating Images (AREA)
Abstract
Description
Claims (19)
- 一种唇语识别方法,包括:获取待识别对象的人脸图像序列;基于所述人脸图像序列进行唇语识别以确定出人脸图像中的唇部动作对应的所述待识别对象讲话内容的语义信息;将所述语义信息用于展示。
- 根据权利要求1所述的方法,其中,基于所述人脸图像序列进行唇语识别以确定出人脸图像中的唇部动作对应的所述待识别对象讲话内容的语义信息,包括:将所述人脸图像序列发送给服务器,由所述服务器进行唇语识别确定出所述人脸图像中的唇部动作对应的待识别对象讲话内容的语义信息。
- 根据权利要求2所述的方法,其中,将所述语义信息用于展示之前,所述唇语识别方法还包括:接收所述服务器发送的所述语义信息。
- 根据权利要求1-3任一所述的方法,其中,所述语义信息为语义文字信息和/或语义音频信息。
- 根据权利要求4所述的方法,还包括展示所述语义信息,其中,展示所述语义信息包括:根据展示模式指令将所述语义文字信息显示在佩戴增强现实设备的用户视野范围内或播放所述语义音频信息。
- 根据权利要求1-5任一所述的方法,其中,获取所述待识别对象的人脸图像序列,包括:获取包括所述待识别对象的图像序列;定位所述待识别对象的方位;根据定位出的待识别对象的方位确定所述待识别对象的人脸区域在所述图像序列中各帧图像中的位置,从所述各帧图像中截取所述待识别对象的人脸区域的图像生成所述人脸图像序列。
- 根据权利要求6所述的方法,其中,定位所述待识别对象的方位,包括:根据所述待识别对象讲话时发出的语音信号定位所述待识别对象的方位。
- 根据权利要求2-7任一项所述的方法,其中,在获取所述待识别对象的人脸图像序列之后,还包括:保存所述人脸图像序列。
- 根据权利要求8所述的方法,其中,将所述人脸图像序列发送给服务器,包括:在接收到发送指令时将保存的所述人脸图像序列发送给所述服务器。
- 一种唇语识别装置,包括:人脸图像序列获取单元,配置为获取所述待识别对象的人脸图像序列;发送单元,配置为将所述人脸图像序列发送给服务器,由所述服务器进行唇语识别确定出人脸图像中的唇部动作对应的语义信息;接收单元,配置为接收服务器发送的所述语义信息。
- 根据权利要求10所述的唇语识别装置,还包括:展示单元,配置为展示所述语义信息。
- 根据权利要求11所述的唇语识别装置,其中,所述展示单元包括:展示模式指令生成子单元,配置为生成展示模式指令,所述展示模式指令包括显示模式指令和音频模式指令。
- 根据权利要求12所述的唇语识别装置,其中,所述语义信息为语义文字信息和/或语义音频信息,所述展示单元还包括:显示子单元,配置为在接收到所述显示模式指令时,将所述语义文字信息显示在佩戴增强现实设备的用户视野范围内;播放子单元,配置为在接收到所述音频模式指令时,播放所述语义音频信息。
- 根据权利要求10-13任一所述的唇语识别装置,其中,所述人脸图像序列获取单元包括:图像序列获取子单元,配置为获取所述待识别对象的图像序列;定位子单元,配置为定位所述待识别对象的方位;人脸图像序列生成子单元,配置为根据定位出的待识别对象的方位确定所述待识别对象的人脸区域在所述图像序列各帧图像中的位置,从所述各帧 图像中截取所述待识别对象的人脸区域的图像生成所述人脸图像序列。
- 一种唇语识别装置,包括:处理器;机器可读存储介质,存储有一个或多个计算机程序模块;其中,所述一个或多个计算机程序模块被存储在所述机器可读存储介质中并被配置为由所述处理器执行,所述一个或多个计算机程序模块包括用于执行实现权利要求1-9任一所述的唇语识别方法的指令。
- 一种增强现实设备,包括如权利要求10-15任一所述的唇语识别装置。
- 根据权利要求16所述的增强现实设备,还包括摄像装置、显示装置或播放装置;其中,所述摄像装置配置为采集所述待识别对象的图像;所述显示装置配置为显示所述语义信息;所述播放装置配置为播放所述语义信息。
- 一种唇语识别方法,包括:接收增强现实设备发送的待识别对象的人脸图像序列;基于所述人脸图像序列进行唇语识别以确定出人脸图像中的唇部动作对应的所述待识别对象讲话内容的语义信息;向增强现实设备发送所述语义信息。
- 一种存储介质,非暂时性地存储计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时可以执行根据权利要求1-9任一所述的唇语识别方法或权利要求18所述的唇语识别方法的指令。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/610,254 US11527242B2 (en) | 2018-04-26 | 2019-04-24 | Lip-language identification method and apparatus, and augmented reality (AR) device and storage medium which identifies an object based on an azimuth angle associated with the AR field of view |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810384886.2A CN108596107A (zh) | 2018-04-26 | 2018-04-26 | 基于ar设备的唇语识别方法及其装置、ar设备 |
CN201810384886.2 | 2018-04-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019206186A1 true WO2019206186A1 (zh) | 2019-10-31 |
Family
ID=63609654
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/084109 WO2019206186A1 (zh) | 2018-04-26 | 2019-04-24 | 唇语识别方法及其装置、增强现实设备以及存储介质 |
Country Status (3)
Country | Link |
---|---|
US (1) | US11527242B2 (zh) |
CN (1) | CN108596107A (zh) |
WO (1) | WO2019206186A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111125437A (zh) * | 2019-12-24 | 2020-05-08 | 四川新网银行股份有限公司 | 对视频中唇语图片识别的方法 |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108596107A (zh) | 2018-04-26 | 2018-09-28 | 京东方科技集团股份有限公司 | 基于ar设备的唇语识别方法及其装置、ar设备 |
CN111063344B (zh) * | 2018-10-17 | 2022-06-28 | 青岛海信移动通信技术股份有限公司 | 一种语音识别方法、移动终端以及服务器 |
WO2021076349A1 (en) * | 2019-10-18 | 2021-04-22 | Google Llc | End-to-end multi-speaker audio-visual automatic speech recognition |
CN111738100A (zh) * | 2020-06-01 | 2020-10-02 | 广东小天才科技有限公司 | 一种基于口型的语音识别方法及终端设备 |
CN111739534B (zh) * | 2020-06-04 | 2022-12-27 | 广东小天才科技有限公司 | 一种辅助语音识别的处理方法、装置、电子设备及存储介质 |
CN112672021B (zh) * | 2020-12-25 | 2022-05-17 | 维沃移动通信有限公司 | 语言识别方法、装置及电子设备 |
CN114842846A (zh) * | 2022-04-21 | 2022-08-02 | 歌尔股份有限公司 | 头戴设备的控制方法、装置及计算机可读存储介质 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102298443A (zh) * | 2011-06-24 | 2011-12-28 | 华南理工大学 | 结合视频通道的智能家居语音控制系统及其控制方法 |
CN107340859A (zh) * | 2017-06-14 | 2017-11-10 | 北京光年无限科技有限公司 | 多模态虚拟机器人的多模态交互方法和系统 |
CN108227903A (zh) * | 2016-12-21 | 2018-06-29 | 深圳市掌网科技股份有限公司 | 一种虚拟现实语言交互系统与方法 |
CN108596107A (zh) * | 2018-04-26 | 2018-09-28 | 京东方科技集团股份有限公司 | 基于ar设备的唇语识别方法及其装置、ar设备 |
Family Cites Families (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5680481A (en) * | 1992-05-26 | 1997-10-21 | Ricoh Corporation | Facial feature extraction method and apparatus for a neural network acoustic and visual speech recognition system |
US20030028380A1 (en) | 2000-02-02 | 2003-02-06 | Freeland Warwick Peter | Speech system |
US20040243416A1 (en) | 2003-06-02 | 2004-12-02 | Gardos Thomas R. | Speech recognition |
CN101101752B (zh) * | 2007-07-19 | 2010-12-01 | 华中科技大学 | 基于视觉特征的单音节语言唇读识别系统 |
US20100332229A1 (en) | 2009-06-30 | 2010-12-30 | Sony Corporation | Apparatus control based on visual lip share recognition |
KR101092820B1 (ko) * | 2009-09-22 | 2011-12-12 | 현대자동차주식회사 | 립리딩과 음성 인식 통합 멀티모달 인터페이스 시스템 |
US8836638B2 (en) | 2010-09-25 | 2014-09-16 | Hewlett-Packard Development Company, L.P. | Silent speech based command to a computing device |
CN102004549B (zh) * | 2010-11-22 | 2012-05-09 | 北京理工大学 | 一种适用于中文的自动唇语识别系统 |
US8743244B2 (en) * | 2011-03-21 | 2014-06-03 | HJ Laboratories, LLC | Providing augmented reality based on third party information |
JP5776255B2 (ja) * | 2011-03-25 | 2015-09-09 | ソニー株式会社 | 端末装置、物体識別方法、プログラム及び物体識別システム |
KR101920020B1 (ko) | 2012-08-07 | 2019-02-11 | 삼성전자 주식회사 | 단말기 상태 전환 제어 방법 및 이를 지원하는 단말기 |
CN103853190A (zh) * | 2012-12-03 | 2014-06-11 | 联想(北京)有限公司 | 一种控制电子设备的方法及电子设备 |
US20140129207A1 (en) * | 2013-07-19 | 2014-05-08 | Apex Technology Ventures, LLC | Augmented Reality Language Translation |
US20150302651A1 (en) * | 2014-04-18 | 2015-10-22 | Sam Shpigelman | System and method for augmented or virtual reality entertainment experience |
CN104409075B (zh) | 2014-11-28 | 2018-09-04 | 深圳创维-Rgb电子有限公司 | 语音识别方法和系统 |
CN205430338U (zh) | 2016-03-11 | 2016-08-03 | 依法儿环球有限公司 | 带vr内容采集组件的智能手机或便携式电子通讯装置 |
CN106529502B (zh) * | 2016-08-01 | 2019-09-24 | 深圳奥比中光科技有限公司 | 唇语识别方法以及装置 |
US10817066B2 (en) | 2016-12-05 | 2020-10-27 | Google Llc | Information privacy in virtual reality |
WO2018107489A1 (zh) * | 2016-12-16 | 2018-06-21 | 深圳前海达闼云端智能科技有限公司 | 一种聋哑人辅助方法、装置以及电子设备 |
JP2018109924A (ja) | 2017-01-06 | 2018-07-12 | ソニー株式会社 | 情報処理装置、情報処理方法、及びプログラム |
US10657361B2 (en) | 2017-01-18 | 2020-05-19 | International Business Machines Corporation | System to enforce privacy in images on an ad-hoc basis |
-
2018
- 2018-04-26 CN CN201810384886.2A patent/CN108596107A/zh active Pending
-
2019
- 2019-04-24 US US16/610,254 patent/US11527242B2/en active Active
- 2019-04-24 WO PCT/CN2019/084109 patent/WO2019206186A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102298443A (zh) * | 2011-06-24 | 2011-12-28 | 华南理工大学 | 结合视频通道的智能家居语音控制系统及其控制方法 |
CN108227903A (zh) * | 2016-12-21 | 2018-06-29 | 深圳市掌网科技股份有限公司 | 一种虚拟现实语言交互系统与方法 |
CN107340859A (zh) * | 2017-06-14 | 2017-11-10 | 北京光年无限科技有限公司 | 多模态虚拟机器人的多模态交互方法和系统 |
CN108596107A (zh) * | 2018-04-26 | 2018-09-28 | 京东方科技集团股份有限公司 | 基于ar设备的唇语识别方法及其装置、ar设备 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111125437A (zh) * | 2019-12-24 | 2020-05-08 | 四川新网银行股份有限公司 | 对视频中唇语图片识别的方法 |
CN111125437B (zh) * | 2019-12-24 | 2023-06-09 | 四川新网银行股份有限公司 | 对视频中唇语图片识别的方法 |
Also Published As
Publication number | Publication date |
---|---|
US20200058302A1 (en) | 2020-02-20 |
CN108596107A (zh) | 2018-09-28 |
US11527242B2 (en) | 2022-12-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019206186A1 (zh) | 唇语识别方法及其装置、增强现实设备以及存储介质 | |
US10136043B2 (en) | Speech and computer vision-based control | |
CN108933915B (zh) | 视频会议装置与视频会议管理方法 | |
US10971188B2 (en) | Apparatus and method for editing content | |
US11343446B2 (en) | Systems and methods for implementing personal camera that adapts to its surroundings, both co-located and remote | |
JP5456832B2 (ja) | 入力された発話の関連性を判定するための装置および方法 | |
US10083710B2 (en) | Voice control system, voice control method, and computer readable medium | |
WO2019140161A1 (en) | Systems and methods for decomposing a video stream into face streams | |
US8411130B2 (en) | Apparatus and method of video conference to distinguish speaker from participants | |
US10887548B2 (en) | Scaling image of speaker's face based on distance of face and size of display | |
JP7100824B2 (ja) | データ処理装置、データ処理方法及びプログラム | |
JP7427408B2 (ja) | 情報処理装置、情報処理方法、及び情報処理プログラム | |
US20200380959A1 (en) | Real time speech translating communication system | |
WO2021232875A1 (zh) | 一种驱动数字人的方法、装置及电子设备 | |
CN114556469A (zh) | 数据处理方法、装置、电子设备和存储介质 | |
JP2015126451A (ja) | 画像の記録方法、電子機器およびコンピュータ・プログラム | |
CN110673811B (zh) | 基于声音信息定位的全景画面展示方法、装置及存储介质 | |
AU2013222959B2 (en) | Method and apparatus for processing information of image including a face | |
JP7400364B2 (ja) | 音声認識システム及び情報処理方法 | |
US8913142B2 (en) | Context aware input system for focus control | |
US20150116198A1 (en) | Device and method for displaying multimedia content | |
JP2022120164A (ja) | 音声認識システム、音声認識方法、及び音声処理装置 | |
WO2019082648A1 (ja) | 電子機器、制御装置、制御プログラム及び電子機器の動作方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19792122 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19792122 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19792122 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 14/05/2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19792122 Country of ref document: EP Kind code of ref document: A1 |