WO2019206186A1 - 唇语识别方法及其装置、增强现实设备以及存储介质 - Google Patents

唇语识别方法及其装置、增强现实设备以及存储介质 Download PDF

Info

Publication number
WO2019206186A1
WO2019206186A1 PCT/CN2019/084109 CN2019084109W WO2019206186A1 WO 2019206186 A1 WO2019206186 A1 WO 2019206186A1 CN 2019084109 W CN2019084109 W CN 2019084109W WO 2019206186 A1 WO2019206186 A1 WO 2019206186A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
identified
lip
image
language recognition
Prior art date
Application number
PCT/CN2019/084109
Other languages
English (en)
French (fr)
Inventor
武乃福
马希通
寇立欣
冯莎
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Priority to US16/610,254 priority Critical patent/US11527242B2/en
Publication of WO2019206186A1 publication Critical patent/WO2019206186A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/20Scenes; Scene-specific elements in augmented reality scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G02OPTICS
    • G02BOPTICAL ELEMENTS, SYSTEMS OR APPARATUS
    • G02B27/00Optical systems or apparatus not provided for by any of the groups G02B1/00 - G02B26/00, G02B30/00
    • G02B27/01Head-up displays
    • G02B27/017Head mounted

Definitions

  • At least one embodiment of the present disclosure is directed to a lip language recognition method and apparatus thereof, an augmented reality device, and a storage medium.
  • Augmented Reality (AR) technology is a new technology that integrates physical and virtual information in a real environment. It is characterized by applying virtual information to the real environment, and can bring physical and virtual information in the real environment. Blend into the same picture or space to achieve a sensory experience that transcends reality.
  • the existing virtual reality system mainly simulates a virtual three-dimensional world through a high-performance computing system with a central processing unit, and provides the user with a sensory experience of sight, hearing, etc., so that the user is as immersive as the immersive Human-computer interaction is also possible.
  • At least one embodiment of the present disclosure provides a lip language recognition method, including: acquiring a face image sequence of an object to be identified; performing lip language recognition based on the face image sequence to determine a lip motion corresponding to a face image The semantic information of the speech content of the object to be identified; the semantic information is used for presentation.
  • lip language recognition is performed based on the sequence of face images to determine semantic information of the speech content of the object to be recognized corresponding to the lip motion in the face image, including Sending the sequence of face images to the server, and performing lip language recognition by the server to determine semantic information of the speech content of the object to be recognized corresponding to the lip motion in the face image.
  • the lip language recognition method before the semantic information is used for presentation, the lip language recognition method further includes: receiving the semantic information sent by the server.
  • the semantic information is semantic text information and/or semantic audio information.
  • the method provided by at least one embodiment of the present disclosure further includes displaying the semantic information.
  • Displaying the semantic information includes displaying the semantic text information in a range of a user's field of view wearing the augmented reality device or playing the semantic audio information according to the presentation mode instruction.
  • acquiring a sequence of a face image of the object to be identified includes: acquiring an image sequence including the object to be identified; and positioning an orientation of the object to be identified; Determining an orientation of the object to be identified, determining a position of the face region of the object to be identified in each frame image in the image sequence, and extracting an image of the face region of the object to be identified from the frame images The sequence of face images.
  • positioning the orientation of the object to be identified includes: positioning an orientation of the object to be identified according to a voice signal sent when the object to be recognized speaks.
  • the method further includes: saving the sequence of the face images.
  • sending the sequence of face images to a server includes: transmitting the saved sequence of face images to the server when receiving a sending instruction.
  • At least one embodiment of the present disclosure also provides a lip language recognition apparatus, including: a face image sequence acquisition unit, a transmission unit, and a reception unit.
  • the face image sequence obtaining unit is configured to acquire the face image sequence of the object to be identified;
  • the sending unit is configured to send the face image sequence to the server, and the server performs lip language recognition to determine the face image
  • the receiving unit is configured to receive the semantic information sent by the server.
  • the lip language recognition apparatus provided in at least one embodiment of the present disclosure further includes: a display unit configured to display the semantic information.
  • the display unit includes: a presentation mode instruction generation subunit configured to generate a presentation mode instruction, where the presentation mode instruction includes a display mode instruction and an audio mode instruction. .
  • the semantic information is semantic text information and/or semantic audio information
  • the display unit further includes a display subunit and a play subunit.
  • a display subunit configured to display the semantic text information in a field of view of a user wearing the augmented reality device when receiving the display mode instruction; and the playing subunit configured to receive the audio mode instruction, Playing the semantic audio information.
  • the face image sequence acquisition unit includes an image sequence acquisition subunit, a positioning subunit, and a face image sequence generation subunit; and an image sequence acquisition subunit,
  • the image sequence is configured to acquire the image to be identified;
  • the positioning subunit is configured to locate the orientation of the object to be identified;
  • the face image sequence generating subunit is configured to determine the to wait according to the located orientation of the object to be identified Identifying a position of a face region of the object in each frame image of the image sequence, and extracting an image of the face region of the object to be identified from the frame images to generate the face image sequence.
  • At least one embodiment of the present disclosure also provides a lip language recognition apparatus comprising: a processor; a machine readable storage medium storing one or more computer program modules; the one or more computer program modules being stored in the The machine readable storage medium is configured to be executed by the processor, the one or more computer program modules comprising instructions for performing a lip language recognition method provided by any of the embodiments of the present disclosure.
  • At least one embodiment of the present disclosure also provides an augmented reality device, including the lip language recognition device provided by any embodiment of the present disclosure.
  • the augmented reality device provided in at least one embodiment of the present disclosure further includes an imaging device, a display device, or a playback device.
  • the camera device is configured to collect an image of the object to be identified;
  • the display device is configured to display the semantic information;
  • the playback device is configured to play the semantic information.
  • At least one embodiment of the present disclosure further provides a lip language recognition method, including: receiving a face image sequence of an object to be recognized transmitted by an augmented reality device; performing lip language recognition based on the face image sequence to determine a face image The lip action corresponding to the semantic information of the speech content of the object to be identified; the semantic information is sent to the augmented reality device.
  • At least one embodiment of the present disclosure also provides a storage medium that non-transitoryly stores computer readable instructions that can perform lip language recognition provided by any of the embodiments of the present disclosure when the non-transitory computer readable instructions are executed by a computer method.
  • FIG. 1 is a flowchart of a lip language recognition method according to at least one embodiment of the present disclosure
  • FIG. 2A is a flowchart of another lip language recognition method according to at least one embodiment of the present disclosure.
  • FIG. 2B is a flowchart of still another lip language recognition method according to at least one embodiment of the present disclosure
  • 2C is a system flowchart of a lip language recognition method according to at least one embodiment of the present disclosure
  • FIG. 3A is a schematic block diagram of a lip language recognition apparatus according to at least one embodiment of the present disclosure
  • FIG. 3B is a schematic block diagram of the display unit 304 shown in FIG. 3A;
  • FIG. 3C is a schematic block diagram of the face image sequence obtaining unit 301 shown in FIG. 3A;
  • FIG. 3D is a schematic block diagram of another lip language recognition device according to at least one embodiment of the present disclosure.
  • FIG. 3E is a schematic block diagram of an augmented reality device according to at least one embodiment of the present disclosure.
  • FIG. 3F is a schematic block diagram of an augmented reality device provided by at least one embodiment of the present disclosure.
  • FIG. 4 is a schematic structural diagram of an augmented reality device according to at least one embodiment of the present disclosure.
  • the AR device may be provided with an imaging device, and the imaging device may collect the real object in the real environment in real time, and calculate the position and angle of the physical object, and then corresponding image processing, thereby achieving fusion with the virtual information.
  • the imaging device may collect the real object in the real environment in real time, and calculate the position and angle of the physical object, and then corresponding image processing, thereby achieving fusion with the virtual information.
  • At least one embodiment of the present disclosure provides a lip language recognition method, including: acquiring a face image sequence of an object to be recognized; performing lip language recognition based on a face image sequence to determine a lip motion corresponding to a face image to be recognized The semantic information of the object's speech content; the semantic information is used for presentation.
  • At least one embodiment of the present disclosure also provides a lip language recognition device, an augmented reality device, and a storage medium corresponding to the lip language recognition method described above.
  • the lip language recognition method provided by at least one embodiment of the present disclosure may, on the one hand, determine a speech content of an object to be identified, display a lip language of the object to be identified, and implement a lip language translation of the object to be identified;
  • the lip language recognition method can be implemented by using components of the existing AR device, and the hardware is not separately added, so that the AR device can be extended without increasing the cost. Features that further enhance the user experience.
  • At least one embodiment of the present disclosure provides a lip language recognition method, which can further expand the function of the augmented reality device and improve the user experience of the device.
  • the lip language identification method can be used for an AR device or a VR (Virtual Reality, VR for short) device, etc., and the embodiment of the present disclosure does not limit this.
  • the lip recognition method can be implemented at least partially in software and loaded and executed by a processor in the AR device, or at least partially implemented in hardware or firmware, to extend the functionality of the augmented reality device, and to enhance the device. user experience.
  • FIG. 1 is a flowchart of a lip language recognition method according to at least one embodiment of the present disclosure. As shown in FIG. 1, the lip language recognition method includes steps S10 to S30. The steps S10 to S30 of the lip language recognition method and their respective exemplary implementations are respectively described below.
  • Step S10 Acquire a sequence of face images of the object to be identified.
  • Step S20 performing lip language recognition based on the face image sequence to determine semantic information corresponding to the lip motion in the face image.
  • Step S30 The semantic information is used for presentation.
  • an augmented reality AR device is a head-mounted wearable smart device that utilizes augmented reality technology to enhance a sensory experience that can transcend reality.
  • AR devices combine technologies such as image display, image processing, multi-sensor fusion, and 3D modeling to be used in medical, gaming, network video communications, and exhibitions.
  • Current AR devices typically include an imaging device (such as a camera), an optical projection device (a device consisting of optical elements such as various lenses, which can project an image into the field of view of the user wearing the AR device) and a sound collection device (such as a speaker or Mike, etc.), has a scalable space in function.
  • an imaging device such as a camera
  • an optical projection device a device consisting of optical elements such as various lenses, which can project an image into the field of view of the user wearing the AR device
  • a sound collection device such as a speaker or Mike, etc.
  • the image pickup device may include, for example, a CMOS (Complementary Metal Oxide Semiconductor) sensor, a CCD (Charge Coupled Device) sensor, an infrared camera, or the like.
  • CMOS Complementary Metal Oxide Semiconductor
  • CCD Charge Coupled Device
  • the camera device can be placed in the plane in which the OLED display is located, such as on the bezel of the AR device.
  • an image can be acquired using an imaging device in an AR device. After the user wears the AR device, the camera device can collect images within its field of view. If the user needs to communicate with other objects, for example, during a meeting or when the user talks with other objects, the object that needs to communicate is usually faced. At this time, the camera device can acquire an image of an AC object located within its field of view, and the image includes an image of the AC object.
  • the object to be identified described above refers to an object in an image acquired by an image pickup device of the AR device.
  • the object may be a person with whom it communicates, a person who is in a video, or the like, and the embodiment of the present disclosure does not limit this.
  • a plurality of frames of images continuously captured by the camera device may be combined into an image sequence. Since the image captured by the camera device includes the object to be identified, the area of the face of the object to be identified may also be included, and the multi-frame image of the area where the face of the object to be recognized is located may be the face image sequence of the object to be identified.
  • a face image sequence acquisition unit may be provided, and a face image sequence of the object to be identified is acquired by the face image sequence acquisition unit; for example, by a central processing unit (CPU), an image processor (GPU), a tensor A processor (TPU), a field programmable gate array (FPGA), or other form of processing unit with data processing capabilities and/or instruction execution capabilities, and corresponding computer instructions to implement a face image sequence acquisition unit.
  • the processing unit may be a general purpose processor or a dedicated processor, and may be an X86 or ARM architecture based processor or the like.
  • step S20 for example, in one example, it may be performed by a central processing unit (CPU), an image processor (GPU), a field programmable logic gate array (FPGA) in an AR device, or with data processing capabilities and/or instructions
  • CPU central processing unit
  • GPU image processor
  • FPGA field programmable logic gate array
  • a sequence of face images can also be sent to the server.
  • the server can be a local server, or a server set on a local area network, or a cloud server, so that the face sequence can be processed by the server (for example, a processing unit in the server, etc.) for lip language recognition to determine the person.
  • the lip action in the face image corresponds to the semantic information of the speech content of the object to be recognized.
  • the face image sequence can be transmitted to the server via wireless communication methods such as Bluetooth or Wi-Fi.
  • the server may perform lip language recognition according to the received sequence of face images.
  • the face image of each frame in the face image sequence includes the area of the face of the object to be identified, and the area where the face is located includes the lips of the person, and the server may
  • the face recognition algorithm is used to recognize the face from each frame of the face image. Due to the plurality of multi-frame continuous images in the face image sequence, the object to be recognized (ie, the person) can be further extracted according to the recognized face.
  • the lip shape change feature can input the lip change characteristics into the lip language recognition model, identify the corresponding pronunciation, and further determine a statement or a phrase that can express the semantics composed of each pronunciation according to the recognized pronunciations.
  • the statement or the word may be sent as the semantic information to the augmented reality device, and the augmented reality device may display the semantic information, and then the user wearing the AR device may know the content or meaning of the object to be recognized according to the displayed semantic information.
  • the lip recognition model described above may be based on a network model of deep learning, such as a Convolutional Neural Network (CNN) model or a Multi-Return Neural Network (RNN) model. And using the network model to identify corresponding pronunciations according to the lip shape change characteristics when the object to be recognized is speaking, and matching each pronunciation in a database using a plurality of preset pronunciations and sentences or phrase correspondences, and determining A sentence or phrase that can be expressed by each pronunciation.
  • CNN Convolutional Neural Network
  • RNN Multi-Return Neural Network
  • the above semantic information does not necessarily identify all the pronunciations represented by the lip shape changes when the object to be recognized speaks, and the key semantic information or key semantic information of the speech content of the object to be identified may be identified.
  • a sentence or phrase that consists of a pronunciation may be the one or the phrase that is most likely to be determined.
  • a transmitting unit may be provided, and a sequence of face images is transmitted to the server through the transmitting unit to perform lip language recognition by the server; for example, through a central processing unit (CPU), an image processor (GPU), and a tensor A processor (TPU), a field programmable gate array (FPGA), or other form of processing unit with data processing capabilities and/or instruction execution capabilities, and corresponding computer instructions to implement the transmitting unit.
  • CPU central processing unit
  • GPU image processor
  • TPU tensor A processor
  • FPGA field programmable gate array
  • the identification unit may also be directly disposed in the AR device, and the recognition unit may perform lip language recognition; for example, it may be through a central processing unit (CPU), an image processor (GPU), a tensor processor (TPU), or the field.
  • CPU central processing unit
  • GPU image processor
  • TPU tensor processor
  • FPGA programming logic gate array
  • step S30 for example, after the speech content of the object to be recognized is determined based on the lip language recognition method, the lip language of the object to be recognized may be displayed, thereby implementing lip language translation of the object to be recognized.
  • the components of the existing AR device can be utilized, and the functions of the AR device are expanded without further increasing the cost without further increasing the cost, thereby further improving the user experience.
  • the algorithm and model for recognizing the lip language need chip or hardware support with complex data processing capability and operation speed. Therefore, the above-mentioned lip language recognition algorithm and model may not be set on the AR device, for example, for example. Through server processing, this does not affect the portability of the AR device, nor does it increase the hardware cost of the AR device.
  • the processing unit in the AR device can also implement the above-mentioned lip language recognition algorithm and model without affecting the portability and hardware cost of the AR device, thereby improving the market for the AR device.
  • Competitiveness embodiments of the present disclosure do not limit this. The following is an example of implementing the lip language recognition method by a server, but the embodiment of the present disclosure does not limit this.
  • the semantic information may be semantic text information in text form or semantic audio information in audio form, or both semantic text information and semantic audio information.
  • the lip language recognition method further includes displaying semantic information.
  • the server may send voice text information and/or to the AR device, and a display mode button or menu may be set on the AR device.
  • the display mode may include a display mode and an audio mode, and the user may select a display mode according to the need, and the user may select a display mode instruction after selecting the display mode.
  • the AR device may display the semantic language information according to the instruction in the user's field of vision wearing the augmented reality device; when the presentation mode instruction is an audio mode instruction, the augmented reality device plays the Semantic audio information.
  • a presentation unit can be provided and the semantic information can be presented through the presentation unit; for example, through a central processing unit (CPU), an image processor (GPU), a tensor processor (TPU), a field programmable logic gate array ( The FPGA) or other form of processing unit with data processing capabilities and/or instruction execution capabilities and corresponding computer instructions to implement the presentation unit.
  • CPU central processing unit
  • GPU image processor
  • TPU tensor processor
  • the FPGA field programmable logic gate array
  • the lip language setting method can convert the recognized lip language of the object to be recognized into text or audio, and realize translation of the lip language, which can help people with special needs to better with others. communicate with. For example, when a person with hearing impairment or an elderly person cannot hear the voice of another person, or is inconvenient to communicate with others, it brings inconvenience to his life, but by wearing the AR device, others can speak. The content is converted into text to help communicate with others.
  • the voice of the participants may be small, and others may not hear the speaker's speech more clearly; or In the large lecture hall, participants who are far away from the speaker cannot hear the speaker's speech more clearly; or, when communicating in a place with loud noise, the communication personnel cannot hear the speech more clearly.
  • the content of the person's speech For example, in the above cases, the required person can wear the AR device, convert the lip language of the speaker as the object to be recognized into text or audio, realize translation of the lip language, and effectively improve the fluency of the communication.
  • FIG. 2A is a flowchart of acquiring a sequence of face images according to at least one embodiment of the present disclosure. That is, FIG. 2A is a flowchart of some examples of step S10 shown in FIG. 1. In some embodiments, as shown in FIG. 2A, the step of acquiring the face image of the object to be identified, which is described in step S10 above, includes steps S11 to S13.
  • Step S11 Acquire an image sequence including the object to be identified.
  • Step S12 Locating the orientation of the object to be identified.
  • Step S13 determining, according to the located orientation of the object to be identified, the position of the face region of the object to be identified in each frame image in the image sequence, and intercepting the image of the face region of the object to be identified from each frame image to generate a face image sequence.
  • step S12 may be performed first, and then step S11 is performed, that is, the orientation of the object to be identified is determined first, and then the image sequence of the object to be identified in the orientation is acquired.
  • the sequence of the face image may be directly collected.
  • step S11 may be performed first, and then step S12 is performed, that is, the image sequence including the object to be identified is acquired first, and then the face image sequence of the object to be identified is accurately and quickly obtained according to the determined orientation of the object to be identified.
  • the video of the object to be identified may be collected by the camera device of the AR device, the video is composed of consecutive multi-frame images, or the camera continuously captures images of multiple objects to be recognized, and the multi-frame image may constitute the image sequence, and each frame image Each includes an object to be identified, and also includes a face region of the object to be identified, and the image sequence can be directly used as a sequence of face images.
  • the image in the image sequence may be an original image directly acquired by the camera device, or may be an image obtained after pre-processing the original image, which is not limited in the embodiment of the present disclosure.
  • the image pre-processing operation can eliminate extraneous information or noise information in the original image in order to better perform face detection on the acquired image.
  • the image pre-processing operation may include image scaling, compression or format conversion, color gamut conversion, gamma correction, image enhancement, or noise reduction filtering on the acquired image.
  • the part of the face area of the object to be recognized may be intercepted from each frame image to generate a face.
  • Image sequence includes a multi-frame face image, and each frame face image is a partial image taken from the entire image of the object to be recognized, the partial image including a face region.
  • the orientation of the object to be identified that is, the orientation of the face region of the object to be identified in the space in which the user wearing the AR device is located.
  • the user wearing the AR device is in a conference room, and the object to be identified is located at a certain position in the conference room.
  • the location of the object to be identified may be captured by the AR device.
  • the central axis of the field of view of the device is a reference position, the angle between the position of the object to be identified and the central axis is taken as the orientation of the object to be identified, and then the image of the face region of the object to be identified is further positioned according to the orientation of the object to be identified.
  • the user wearing the AR device faces the recognition object, and the angle between the object to be recognized and the central axis of the field of view of the camera device of the AR device is 30 degrees to the right side, and the 30 degrees is the orientation of the object to be identified, according to
  • the orientation may initially determine that the position of the object to be identified in the image is within a certain distance from the center of the image, and then the face recognition may be performed on the area, and the face area is further located, and the part of the image is intercepted as a face. image.
  • a large number (for example, 10,000 sheets or more) of images including faces may be collected in advance as a sample library, and feature extraction is performed on images in the sample library. Then, using the images in the sample library and the extracted feature points, the classification model is trained and tested by machine learning (such as deep learning, or local feature-based regression algorithm) to obtain a classification model of the user's face image.
  • the classification model may also be implemented by other conventional algorithms in the art, such as a support vector machine (SVM), etc., which is not limited by the embodiments of the present disclosure.
  • SVM support vector machine
  • the machine learning algorithm can be implemented by using a conventional method in the art, and details are not described herein again.
  • the input of the classification model is an acquired image
  • the output is an image of a user's face, so that face recognition can be realized.
  • an infrared sensor can be disposed on the AR device, and the infrared sensor can sense the object to be identified, and then locate the orientation of the object to be identified.
  • the infrared sensor can sense the orientation of multiple objects to be identified, but if Only one of the objects to be recognized is speaking. For the lip language recognition, only the face image of the object to be recognized that needs to be recognized needs to be recognized, and no other objects to be recognized without speech are needed.
  • the orientation of the object to be recognized can be located by means of sound localization, that is, according to a voice signal emitted when the object to be recognized speaks.
  • a microphone array can be disposed on the AR device, and the microphone array is a cluster of microphones, which is a set of multiple microphones, and the position of the sound source can be located through the microphone array.
  • the speech signal to be recognized by the object (person) to be recognized is also a vocal sound source, and accordingly, the orientation of the object to be recognized that is speaking can be identified. If there are multiple objects to be recognized at the same time, it is also possible to locate the positions of the plurality of objects to be recognized that are being spoken. The above positioning does not require accurate positioning of the accurate position of the object to be identified, as long as the approximate orientation is located.
  • the lip recognition method is feasible.
  • the lip shape of the object to be recognized without speaking is basically unchanged, and therefore, for the waiting
  • semantic information is not determined, and only the semantic information of the object to be recognized of the speech is determined.
  • the user can select the lip language to be recognized in real time, and the camera device of the AR device can collect the image of the object to be recognized in real time.
  • the AR device obtains a sequence of face images, and sends the sequence of face images to the server in real time.
  • the server returns the semantic information according to the lip language recognition, and the AR device displays the semantic information after receiving.
  • the user can also select not to perform lip language recognition in real time according to the need, and the image device of the AR device still collects the image of the object to be recognized in real time.
  • the sequence of the face image may be generated by parsing the video directly collected by the camera device (the video is composed of consecutive multi-frame images), or the multi-frame captured by the camera device by using the capture method. Face image generation.
  • the sequence of face images is saved.
  • the sequence of face images can be saved in an AR device (eg, stored in a register of the AR device).
  • a sending button or a menu may be set on the AR device, and the user may select a timing for performing lip language recognition on the saved face image sequence according to the need.
  • the user operates a send button or menu to generate a sending command according to the
  • the sending instruction AR device sends the saved sequence of face images to the server, and the server returns the semantic information according to the lip language recognition, and the AR device receives the semantic information and displays it.
  • the above manner of not performing real-time lip language recognition can be applied to a scene for wearing an AR device for real-time bidirectional communication with an object to be recognized.
  • the users at the venue do not have hearing impairments. They can normally hear the speaker or report the speech of the presenter.
  • the AR device can be worn. The acquired sequence of face images is saved first, and then sent to the server for lip language recognition when needed.
  • At least one embodiment of the present disclosure also provides a lip language recognition method, for example, the lip language recognition method is implemented by a server.
  • the lip recognition method can be implemented at least in part in software and loaded and executed by a processor in the server, or at least partially implemented in hardware or firmware to extend the functionality of the augmented reality device and enhance the user of the device. Experience.
  • FIG. 2B is a flowchart of still another lip language recognition method according to at least one embodiment of the present disclosure.
  • the lip language recognition method includes steps S100 to S300.
  • the steps S100 to S300 of the lip language recognition method and their respective exemplary implementations are respectively described below.
  • Step S100 Receive a sequence of face images of the object to be identified sent by the augmented reality device.
  • the server receives a sequence of face images of an object to be identified, for example, transmitted by the AR device.
  • a sequence of face images of an object to be identified for example, transmitted by the AR device.
  • Step S200 performing lip language recognition based on the sequence of face images to determine semantic information of the speech content of the object to be recognized corresponding to the lip motion in the face image.
  • lip language recognition can be performed by a processing unit in the server based on a sequence of face images.
  • the specific implementation method of the lip language recognition may refer to the related description of step S20, and details are not described herein again.
  • Step S300 Send semantic information to the augmented reality device.
  • the semantic information is semantic text information and/or semantic audio information.
  • the semantic information is sent by the server to, for example, an AR device such that the semantic information can be displayed or played on the AR device.
  • FIG. 2C is a system flowchart of a lip recognition method according to at least one embodiment of the present disclosure.
  • a lip recognition method provided by at least one embodiment of the present disclosure is systematically described below with reference to FIG. 2C.
  • the orientation of the object to be identified can be located according to the infrared sensor or the microphone, and the face image can be acquired by the camera.
  • the captured face image can be uploaded in real time for lip recognition, or can be uploaded in non-real time.
  • the face image sequence line can be saved to a register in the AR device and read according to the sending instruction. Take a sequence of face images to send them to the server.
  • the position of the lip can be located in the face image based on the face image located at the orientation, so that the semantic information can be acquired according to the action of the recognition lip.
  • lip action matching can be performed on the server side to perform text conversion or audio conversion on the semantic information corresponding to the lip action to respectively obtain semantic text information or semantic audio information.
  • the semantic text information can be displayed on the AR device or played for voice; the semantic audio information can be played back.
  • FIG. 3A is a schematic block diagram of a lip language recognition apparatus according to at least one embodiment of the present disclosure.
  • the lip language recognition device 03 includes a face image sequence acquisition unit 301, a transmission unit 302, and a reception unit 303.
  • the lip recognition device 03 further includes a display unit 304.
  • the face image sequence acquisition unit 301 is configured to acquire a sequence of face images of the object to be identified.
  • the face image sequence obtaining unit 301 can implement the step S10, and the specific implementation method can refer to the related description of step S10, and details are not described herein again.
  • the sending unit 302 is configured to be configured to send the sequence of face images to the server, and the lip language recognition by the server determines the semantic information corresponding to the lip motion in the face image.
  • the face image sequence can be transmitted to the server via wireless communication methods such as Bluetooth or Wi-Fi.
  • the sending unit 302 can implement the step S20, and the specific implementation method can refer to the related description of step S20, and details are not described herein again.
  • the receiving unit 303 is configured to receive semantic information transmitted by the server; the presentation unit 304 is configured to present semantic information.
  • the receiving unit 303 and the displaying unit 304 may implement the step S30, and the specific implementation method may refer to the related description of step S30, and details are not described herein again.
  • the semantic information is semantic text information and/or semantic audio information.
  • presentation unit 304 can include presentation mode instruction generation sub-unit 3041; in other examples, presentation unit 304 can also include display sub-unit 3042 and play sub-unit 3043.
  • the presentation mode instruction generation sub-unit 3041 is configured to generate a presentation mode instruction.
  • the presentation mode instructions include display mode instructions and audio mode instructions.
  • the display sub-unit 3042 is configured to display the semantic text information within the field of view of the user wearing the augmented reality device upon receiving the display mode command.
  • Play subunit 3043 is configured to play semantic audio information upon receiving an audio mode command.
  • the face image sequence acquisition unit 301 includes an image sequence acquisition sub-unit 3011, a positioning sub-unit 3012, and a face image sequence generation sub-unit 3013.
  • the image sequence acquisition sub-unit 3011 is configured to acquire an image sequence of the object to be identified.
  • the positioning sub-unit 3012 is configured to locate the orientation of the object to be identified.
  • the face image sequence generation sub-unit 3013 is configured to determine, according to the located orientation of the object to be identified, the position of the face region of the object to be identified in each frame image of the image sequence, and intercept the face region of the object to be identified from each frame image.
  • the image generates a sequence of face images.
  • the AR device-based identification device provided by the embodiment of the present disclosure can determine the speech content of the object to be identified, and display the lip language of the object to be identified.
  • the translation of the lip language of the object to be identified, and the components of the existing AR device can be utilized, and the function of the AR device is expanded without further increasing the cost without further increasing the cost, thereby further improving the user experience.
  • the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, ie may be located in one place, or may be distributed over multiple network elements; The above units may be combined into one unit, or may be further split into a plurality of subunits.
  • each unit in the apparatus of this embodiment may be implemented by means of software, or by software and hardware, and of course, by general hardware.
  • the technical solution provided by the embodiments of the present disclosure may be embodied in the form of a software product in essence or in the form of a software product, and the software implementation is taken as an example.
  • the corresponding computer program instructions in the non-volatile memory are read into the memory by the processor included in the AR device to which the device is applied.
  • the lip language recognition device may include more or less circuits, and the connection relationship between the respective circuits is not limited, and may be determined according to actual needs.
  • the specific configuration of each circuit is not limited, and may be composed of an analog device according to the circuit principle, a digital chip, or other suitable manner.
  • FIG. 3D is a schematic block diagram of another lip language recognition apparatus according to at least one embodiment of the present disclosure.
  • the lip recognition device 200 includes a processor 210, a machine readable storage medium 220, and one or more computer program modules 221.
  • processor 210 is coupled to machine readable storage medium 220 via bus system 230.
  • one or more computer program modules 221 are stored in machine readable storage medium 220.
  • one or more computer program modules 221 include instructions for performing the lip language recognition method provided by any of the embodiments of the present disclosure.
  • instructions in one or more computer program modules 221 can be executed by processor 210.
  • the bus system 230 can be a conventional serial, parallel communication bus, etc., and embodiments of the present disclosure do not limit this.
  • the processor 210 can be a central processing unit (CPU), an image processor (GPU), or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and can be a general purpose processor or a dedicated processor, and Other components in the lip recognition device 200 can be controlled to perform the desired functions.
  • CPU central processing unit
  • GPU image processor
  • Other components in the lip recognition device 200 can be controlled to perform the desired functions.
  • Machine-readable storage medium 220 can include one or more computer program products, which can include various forms of computer-readable storage media, such as volatile memory and/or nonvolatile memory.
  • the volatile memory may include, for example, random access memory (RAM) and/or cache or the like.
  • the nonvolatile memory may include, for example, a read only memory (ROM), a hard disk, a flash memory, or the like.
  • One or more computer program instructions can be stored on a computer readable storage medium, and the processor 210 can execute the program instructions to implement the functions (implemented by the processor 210) and/or other desired functions in the disclosed embodiments. For example, lip recognition methods and the like.
  • Various applications and various data such as a sequence of face images and various data used and/or generated by the application, etc., may also be stored in the computer readable storage medium.
  • the embodiment of the present disclosure does not give all the constituent elements of the lip recognition device 200.
  • those skilled in the art can provide and set other constituent units not shown according to specific needs, which is not limited by the embodiments of the present disclosure.
  • At least one embodiment of the present disclosure also provides an augmented reality device.
  • 3E-4 are schematic block diagrams of an augmented reality device according to at least one embodiment of the present disclosure.
  • the augmented reality device 1 includes the lip language recognition device 100/200 provided by any embodiment of the present disclosure, and the lip language recognition device 100/200 may specifically refer to the correlation of FIG. 3A to FIG. 3D. Description, no longer repeat here.
  • the augmented reality device 1 further includes an image pickup device, a display device, or a playback device.
  • a camera for collecting an image of an object to be identified a display device for displaying semantic text information, and a playback device for playing semantic audio information.
  • the playback device may be a speaker, a speaker, or the like, and the following is an example in which the playback device is a speaker. The embodiment of the present disclosure does not limit this.
  • the augmented reality device 1 can be worn in the eye of a person to implement a lip language recognition function of the object to be recognized as needed.
  • the AR device 1 includes an imaging device 101, (eg, a camera for acquiring an image of an object to be recognized), a display device 102 (for displaying semantic text information), and a speaker 103 ( Input/output (I/O) devices for playing semantic audio information.
  • an imaging device 101 eg, a camera for acquiring an image of an object to be recognized
  • a display device 102 for displaying semantic text information
  • a speaker 103 Input/output (I/O) devices for playing semantic audio information.
  • I/O Input/output
  • the AR device 1 further includes a machine readable storage medium 104, a processor 105, a communication interface 106, and a bus 107.
  • the camera 101, the display device 102, the speaker 103, the machine readable storage medium 104, the processor 105, and the communication interface 106 complete communication with each other via the bus 107.
  • the processor 105 can perform the lip language recognition method described above by reading and executing machine executable instructions in the machine readable storage medium 104 corresponding to the control logic of the lip recognition method.
  • the communication interface 106 is coupled to a communication device (not shown).
  • the communication device can communicate with the network and other devices via wireless communication, such as the Internet, an intranet, and/or a wireless network such as a cellular telephone network, a wireless local area network (LAN), and/or a metropolitan area network (MAN) ).
  • Wireless communication can use any of a variety of communication standards, protocols, and technologies including, but not limited to, Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), Wideband Code Division Multiple Access (W-CDMA).
  • GSM Global System for Mobile Communications
  • EDGE Enhanced Data GSM Environment
  • W-CDMA Wideband Code Division Multiple Access
  • CDMA Code Division Multiple Access
  • TDMA Time Division Multiple Access
  • Wi-Fi eg based on IEEE 802.11a, IEEE 802.11b, IEEE 802.11g and/or IEEE 802.11n standards
  • VoIP Internet Protocol-based voice transmission
  • Wi-MAX protocols for email, instant messaging, and/or short message service (SMS), or any other suitable communication protocol.
  • the machine-readable storage medium 104 referred to in the embodiments of the present disclosure may be any electronic, magnetic, optical, or other physical storage device, and may contain or store information such as executable instructions, data, and the like.
  • the machine-readable storage medium may be: RAM (Radom Access Memory), volatile memory, non-volatile memory, flash memory, storage drive (such as a hard disk drive), any type of storage disk (such as a disk). , dvd, etc.), or a similar storage medium, or a combination thereof.
  • the non-volatile medium 108 can be a non-volatile memory, a flash memory, a storage drive (such as a hard drive), any type of storage disk (such as a compact disc, dvd, etc.), or a similar non-volatile storage medium, or a combination thereof. .
  • the embodiment of the present disclosure does not give all the constituent elements of the AR device 1.
  • those skilled in the art can provide and set other component units not shown according to specific needs, which is not limited by the embodiments of the present disclosure.
  • An embodiment of the present disclosure also provides a storage medium.
  • the storage medium non-transitoryly stores computer readable instructions, and when the non-transitory computer readable instructions are executed by a computer (including a processor), the lip language recognition method provided by any of the embodiments of the present disclosure can be performed.
  • the storage medium may be any combination of one or more computer readable storage media, such as a computer readable storage medium containing computer readable program code for obtaining a sequence of facial images of an object to be identified, another computer readable
  • the storage medium contains computer readable program code that exhibits semantic information.
  • the computer can execute the program code stored in the computer storage medium to perform a lip language recognition method such as provided in any embodiment of the present disclosure.
  • the storage medium may include a memory card of a smart phone, a storage unit of a tablet, a hard disk of a personal computer, a random access memory (RAM), a read only memory (ROM), an erasable programmable read only memory (EPROM), Portable compact disk read only memory (CD-ROM), flash memory, or any combination of the above storage media may be other suitable storage media.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • CD-ROM Portable compact disk read only memory
  • flash memory or any combination of the above storage media may be other suitable storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Theoretical Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • User Interface Of Digital Computer (AREA)
  • Processing Or Creating Images (AREA)

Abstract

一种唇语识别方法及其装置、增强现实设备以及存储介质。该唇语识别方法包括:获取待识别对象的人脸图像序列;基于人脸图像序列进行唇语识别以确定出人脸图像中的唇部动作对应的待识别对象讲话内容的语义信息;所述语义信息用于展示。该唇语识别方法可将识别出的待识别对象的唇语转换成文字或者音频,实现对唇语的翻译;另外,可利用已有AR设备的部件,不需要单独增加硬件,在不增加成本的基础上扩展了AR设备的功能,进一步提升用户体验。

Description

唇语识别方法及其装置、增强现实设备以及存储介质
本申请要求于2018年4月26日递交的中国专利申请第201810384886.2号的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。
技术领域
本公开至少一实施例涉及一种唇语识别方法及其装置、增强现实设备以及存储介质。
背景技术
增强现实(Augmented Reality,简称AR)技术是一种将真实环境中的实物和虚拟信息进行融合的新技术,其特点是将虚拟信息应用到真实环境中,可将真实环境中的实物和虚拟信息融合到同一个画面或者是空间中,从而达到超越现实的感官体验。
现有的虚拟现实系统主要是通过带有中央处理器的高性能运算系统模拟一个虚拟的三维世界,并提供给使用者视觉、听觉等的感官体验,从而让使用者犹如身临其境,同时还可以进行人机互动。
发明内容
本公开至少一实施例提供一种唇语识别方法,包括:获取待识别对象的人脸图像序列;基于所述人脸图像序列进行唇语识别以确定出人脸图像中的唇部动作对应的所述待识别对象讲话内容的语义信息;将所述语义信息用于展示。
例如,在本公开至少一实施例提供的方法中,基于所述人脸图像序列进行唇语识别以确定出人脸图像中的唇部动作对应的所述待识别对象讲话内容的语义信息,包括:将所述人脸图像序列发送给服务器,由所述服务器进行唇语识别确定出所述人脸图像中的唇部动作对应的待识别对象讲话内容的语义信息。
例如,在本公开至少一实施例提供的方法中,在将所述语义信息用于展示之前,所述唇语识别方法还包括:接收所述服务器发送的所述语义信息。
例如,在本公开至少一实施例提供的方法中,所述语义信息为语义文字信息和/或语义音频信息。
例如,本公开至少一实施例提供的方法,还包括展示所述语义信息。展示所述语义信息包括:根据展示模式指令将所述语义文字信息显示在佩戴增强现实设备的用户视野范围内或播放所述语义音频信息。
例如,在本公开至少一实施例提供的方法中,获取所述待识别对象的人脸图像序列,包括:获取包括所述待识别对象的图像序列;定位所述待识别对象的方位;根据定位出的待识别对象的方位确定所述待识别对象的人脸区域在所述图像序列中各帧图像中的位置,从所述各帧图像中截取所述待识别对象的人脸区域的图像生成所述人脸图像序列。
例如,在本公开至少一实施例提供的方法中,定位所述待识别对象的方位,包括:根据所述待识别对象讲话时发出的语音信号定位所述待识别对象的方位。
例如,在本公开至少一实施例提供的方法中,在获取所述待识别对象的人脸图像序列之后,还包括:保存所述人脸图像序列。
例如,在本公开至少一实施例提供的方法中,将所述人脸图像序列发送给服务器,包括:在接收到发送指令时将保存的所述人脸图像序列发送给所述服务器。
本公开至少一实施例还提供一种唇语识别装置,包括:人脸图像序列获取单元、发送单元和接收单元。人脸图像序列获取单元配置为获取所述待识别对象的人脸图像序列;发送单元配置为将所述人脸图像序列发送给服务器,由所述服务器进行唇语识别确定出人脸图像中的唇部动作对应的语义信息;接收单元配置为接收服务器发送的所述语义信息。
例如,本公开至少一实施例提供的唇语识别装置,还包括:展示单元,配置为展示所述语义信息。
例如,在本公开至少一实施例提供的唇语识别装置中,所述展示单元包括:展示模式指令生成子单元,配置为生成展示模式指令,所述展示模式指令包括显示模式指令和音频模式指令。
例如,在本公开至少一实施例提供的唇语识别装置中,所述语义信息为语义文字信息和/或语义音频信息,所述展示单元还包括显示子单元和播放子单元。显示子单元,配置为在接收到所述显示模式指令时,将所述语义文字信息显示在佩戴增强现实设备的用户视野范围内;播放子单元,配置为在接收到所述音频模式指令时,播放所述语义音频信息。
例如,在本公开至少一实施例提供的唇语识别装置中,所述人脸图像序列获取单元包括图像序列获取子单元、定位子单元和人脸图像序列生成子单元;图像序列获取子单元,配置为获取所述待识别对象的图像序列;定位子单元,配置为定位所述待识别对象的方位;人脸图像序列生成子单元,配置为根据定位出的待识别对象的方位确定所述待识别对象的人脸区域在所述图像序列各帧图像中的位置,从所述各帧图像中截取所述待识别对象的人脸区域的图像生成所述人脸图像序列。
本公开至少一实施例还提供一种唇语识别装置,包括:处理器;机器可读存储介质,存储有一个或多个计算机程序模块;所述一个或多个计算机程序模块被存储在所述机器可读存储介质中并被配置为由所述处理器执行,所述一个或多个计算机程序模块包括用于执行实现本公开任一实施例提供的唇语识别方法的指令。
本公开至少一实施例还提供一种增强现实设备,包括本公开任一实施例提供的唇语识别装置。
例如,在本公开至少一实施例提供的增强现实设备,还包括摄像装置、显示装置或播放装置。所述摄像装置配置为采集所述待识别对象的图像;所述显示装置配置为显示所述语义信息;所述播放装置配置为播放所述语义信息。
本公开至少一实施例还提供一种唇语识别方法,包括:接收增强现实设备发送的待识别对象的人脸图像序列;基于所述人脸图像序列进行唇语识别以确定出人脸图像中的唇部动作对应的所述待识别对象讲话内容的语义信息;向增强现实设备发送所述语义信息。
本公开至少一实施例还提供一种存储介质,非暂时性地存储计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时可以执行本公开任一实施例提供的唇语识别方法。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例的附图作简单地介绍,显而易见地,下面描述中的附图仅仅涉及本公开的一些实施例,而非对本公开的限制。
图1是本公开至少一实施例提供的一种唇语识别方法的流程图;
图2A是本公开至少一实施例提供的另一种唇语识别方法的流程图;
图2B是本公开至少一实施例提供的再一种唇语识别方法的流程图;
图2C是本公开至少一实施例提供的一种唇语识别方法的系统流程图;
图3A是本公开至少一实施例提供的一种唇语识别装置的示意框图;
图3B是图3A中所示的展示单元304的示意框图;
图3C是图3A中所示的人脸图像序列获取单元301的示意框图;
图3D是本公开至少一实施例提供的另一种唇语识别装置的示意框图;
图3E是本公开至少一实施例提供的一种增强现实设备的示意框图;
图3F是本公开至少一实施例提供的一种增强现实设备的示意框图;以及
图4是本公开至少一实施例提供的一种增强现实设备的结构示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例的附图,对本公开实施例的技术方案进行清楚、完整地描述。显然,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。基于所描述的本公开的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
除非另外定义,本公开使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。同样,“一个”、“一”或者“该”等类似词语也不表示数量限制,而是表示存在至少一个。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等 同,而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系,当被描述对象的绝对位置改变后,则该相对位置关系也可能相应地改变。
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
例如,AR设备中可以设置有摄像装置,摄像装置可实时地采集真实环境中的实物,并通过计算实物的位置及角度等,再加上相应地图像处理,进而实现与虚拟信息的融合。常规的增强现实设备的功能还存在较大的可扩展空间。
本公开至少一实施例提供一种唇语识别方法,包括:获取待识别对象的人脸图像序列;基于人脸图像序列进行唇语识别以确定出人脸图像中的唇部动作对应的待识别对象讲话内容的语义信息;将所述语义信息用于展示。
本公开至少一实施例还提供一种对应于上述唇语识别方法的唇语识别装置、增强现实设备以及存储介质。
本公开至少一实施例提供的唇语识别方法,一方面,可确定出待识别对象的讲话内容,将待识别对象的唇语进行展示,实现了对将待识别对象的唇语翻译;另一方面,在本公开至少一实施例提供该唇语识别方法中,可利用已有AR设备的部件实现上述唇语识别方法,不需要单独增加硬件,从而可以在不增加成本的基础上扩展AR设备的功能,进一步提升用户体验。
下面结合附图对本公开的实施例进行详细说明。
本公开至少一实施例提供一种唇语识别方法,可进一步扩展增强现实设备的功能,提升设备的用户体验。例如,该唇语识别方法可用于AR设备或VR(Virtual Reality,简称VR)装置等,本公开的实施例对此不作限制。例如,该唇语识别方法可以至少部分以软件的方式实现,并由AR设备中的处理器加载并执行,或至少部分以硬件或固件等方式实现,以扩展增强现实设备的功能,提升设备的用户体验。
图1为本公开至少一实施例提供的一种唇语识别方法的流程图。如图1所示,该唇语识别方法包括步骤S10至步骤S30。下面对该唇语识别方法的步骤S10至步骤S30以及它们各自的示例性实现方式分别进行介绍。
步骤S10:获取待识别对象的人脸图像序列。
步骤S20:基于人脸图像序列进行唇语识别确定出人脸图像中的唇部动作对应的语义信息。
步骤S30:将语义信息用于展示。
例如,增强现实AR设备为一种头戴式可穿戴智能设备,其利用增强现实技术可增强可达到超越现实的感官体验。
例如,AR设备结合了图像显示、图像处理、多传感器融合及三维建模等技术,可应用在医疗、游戏、网络视频通信、展览等领域。
目前的AR设备中通常包括摄像装置(例如摄像头)、光学投影装置(由各种透镜等光学元件组成的装置,可将图像投影到佩戴AR设备的用户视野内)和声音采集装置(例如扬声器或者麦克等)等,在功能上具有可扩展的空间。
该摄像装置例如可以包括CMOS(互补金属氧化物半导体)传感器、CCD(电荷耦合器件)传感器、红外摄像头等。例如,摄像装置可以设置在OLED显示屏所在的平面内,例如设置在AR设备的边框上。
例如,可利用AR设备中的摄像装置采集图像。用户佩戴AR设备后,摄像装置可采集到其视场范围内的图像,如果用户需要与其他对象进行交流,例如,在开会时或者用户与其他对象谈话时,通常会面对需要交流的对象,此时摄像装置可采集到位于其视场范围内的交流对象的图像,该图像中包括交流对象的图像。
对于步骤S10,例如,上述的待识别对象指利用AR设备的摄像装置采集的图像中的对象。例如,该对象可以是与其交流的人,也可以是处于视频中的人等,本公开的实施例对此不作限制。例如,可将摄像装置连续采集的多帧图像组成图像序列。由于摄像装置采集的图像中包含待识别对象,也会包括待识别对象的人脸所在区域,可将包括待识别对象的人脸所在区域的多帧图像作为待识别对象的人脸图像序列。
例如,获取人脸图像序列的具体实现示例在下面进行详细地介绍,在此 不再赘述。
例如,可以提供人脸图像序列获取单元,并通过该人脸图像序列获取单元来获取待识别对象的人脸图像序列;例如,通过中央处理单元(CPU)、图像处理器(GPU)、张量处理器(TPU)、现场可编程逻辑门阵列(FPGA)或者具有数据处理能力和/或指令执行能力的其它形式的处理单元以及相应计算机指令来实现人脸图像序列获取单元。例如,该处理单元可以为通用处理器或专用处理器,可以是基于X86或ARM架构的处理器等。
对于步骤S20,例如,在一个示例中,可以通过AR设备中的中央处理单元(CPU)、图像处理器(GPU)、现场可编程逻辑门阵列(FPGA)或者具有数据处理能力和/或指令执行能力的其它形式的处理单元来对人脸序列进行处理以进行唇语识别。例如,在另一个示例中,还可以将人脸图像序列发送给服务器。例如,服务器可以为本地服务器,或者设置在局域网的服务器,或者为云端服务器,从而可以由服务器(例如,服务器中的处理单元等)对人脸序列进行处理以进行唇语识别,以确定出人脸图像中的唇部动作对应的待识别对象讲话内容的语义信息。例如,可以通过蓝牙、Wi-Fi等无线通信方式将人脸图像序列发送给服务器。
例如,服务器可根据接收到的人脸图像序列进行唇语识别,人脸图像序列中各帧人脸图像包括待识别对象的人脸所在区域,而人脸所在区域包括人的唇部,服务器可利用人脸识别算法从各帧人脸图像中识别出人脸,由于人脸图像序列中多个多帧连续的图像,可根据识别出的人脸进一步的提取待识别对象(即人)讲话时的唇形变化特征,可将该些唇部变化特征输入到唇语识别模型中,识别出对应的发音,根据识别出的各发音进一步确定由各发音组成的能够表达语义的语句或词组等,语句或者词语可作为语义信息发送增强现实设备,增强现实设备接收到该语义信息后可进行展示,进而佩戴AR设备的用户可根据展示的语义信息获知待识别对象讲话的内容或者含义。
需要注意的是,人脸识别算法可采用本领域的常规算法实现,在此不再赘述。
例如,上述的唇语识别模型可以基于深度学习的网络模型,例如为卷积神经网络CNN(Convolutional Neural Network,简称CNN)模型或者多层反馈神经网络RNN(Recurrent Neural Network,循环神经网络)模型等,并利 用该网络模型根据待识别对象讲话时的唇形变化特征,识别出对应的各发音,在利用预先设置的多个发音与语句或词组对应关系的数据库,对各发音进行匹配,确定由各发音组成的能够表达语义的语句或词组。
例如,上述语义信息不一定要识别出待识别对象讲话时唇形变化所代表的所有发音,可以识别出待识别对象讲话内容的重点语义信息或者关键语义信息。例如,将发音组成的语句或词组,可以为确定出的可能性最大的语句或词组。
例如,可以提供发送单元,并通过该发送单元将人脸图像序列发送至服务器,以由服务器来进行唇语识别;例如,可以通过中央处理单元(CPU)、图像处理器(GPU)、张量处理器(TPU)、现场可编程逻辑门阵列(FPGA)或者具有数据处理能力和/或指令执行能力的其它形式的处理单元以及相应计算机指令来实现发送单元。
例如,也可以在AR设备中直接设置识别单元,由该识别单元进行唇语识别;例如,可以通过中央处理单元(CPU)、图像处理器(GPU)、张量处理器(TPU)、现场可编程逻辑门阵列(FPGA)或者具有数据处理能力和/或指令执行能力的其它形式的处理单元以及相应计算机指令来实现识别单元。
对于步骤S30,例如,在基于该唇语识别方法确定出待识别对象的讲话内容后,可以将待识别对象的唇语进行展示,从而实现了对待识别对象的唇语翻译。
在本公开至少一实施例提供的唇语识别方法中,可利用已有AR设备的部件,不需要单独增加硬件,在不增加成本的基础上扩展了AR设备的功能,进一步提升用户体验。
需要说明是,对唇语进行识别的算法和模型等需要具有复杂数据处理能力和运算速度的芯片或者硬件支持,因此,上述的唇语识别的算法和模型等可以不设置在AR设备上,例如通过服务器处理,这样不影响AR设备的便携性,也不会增加AR设备的硬件成本。当然,随着科技水平的提高,在不影响AR设备的便携性和硬件成本的条件下,AR设备中的处理单元也可以实现上述唇语识别的算法和模型等,从而提高了AR设备的市场竞争力,本公开的实施例对此不作限制。下面以通过服务器实现该唇语识别方法为例进 行介绍,但是本公开的实施例对此不作限制。
例如,该语义信息可以为文字形式的语义文字信息或者音频形式的语义音频信息,或者同时包括语义文字信息和语义音频信息。例如,该唇语识别方法还包括展示语义信息。例如,服务器可将语音文字信息和/或发送给AR设备,在AR设备上可设置展示模式按钮或者菜单等。例如,展示模式可包括显示模式和音频模式,用户可根据需要选择展示模式,用户选择展示模式后会生成对应的展示模式指令。例如,当展示模式指令为显示模式指令时,AR设备可根据该指令将语义言文字信息显示在佩戴增强现实设备的用户视野范围内;当展示模式指令为音频模式指令时,增强现实设备播放该语义音频信息。
例如,可以提供展示单元,并通过该展示单元来展示语义信息;例如,可以通过中央处理单元(CPU)、图像处理器(GPU)、张量处理器(TPU)、现场可编程逻辑门阵列(FPGA)或者具有数据处理能力和/或指令执行能力的其它形式的处理单元以及相应计算机指令来实现展示单元。
本公开至少一实施例提供的唇语设别方法,可以将识别出的待识别对象的唇语转换成文字或者音频,实现对唇语的翻译,可帮助有特殊需要的人士与他人更好地交流。例如,对于有听力障碍的人士或者老年人等,无法听到他人讲话时的声音,或不方便与他人进行交流时,给其生活带来了不便,但通过佩戴该AR设备可将他人的讲话内容转换成文字,帮助与别人交流。
或者,对于特殊场景下,例如,需要安静的场合(例如需要保密的会议室等),参会人员讲话声音可能较小,其他人不能较为清楚的听到讲话人的讲话内容;或者在面积较大的报告厅,距离讲话人距离较远的参会人员不能较为清楚的听到讲话人的讲话内容;或者,在周围噪声较大的场所交流时,交流人员之间不能较为清楚的听到讲话人的讲话内容。例如,在上述这些情况下,需要的人员可佩带该AR设备,将作为待识别对象的讲话人的唇语转换成文字或者音频,实现对唇语的翻译,有效改善交流的流畅性。
图2A为本公开至少一实施例提供的一种获取人脸图像序列的流程图,也就是说,图2A为图1中所示的步骤S10的一些示例的流程图。在一些实施方式中,如图2A所示,上述步骤S10所述的获取待识别对象的人脸图像序列,包括步骤S11至步骤S13。
步骤S11:获取包括待识别对象的图像序列。
步骤S12:定位待识别对象的方位。
步骤S13:根据定位出的待识别对象的方位确定待识别对象的人脸区域在图像序列中各帧图像中的位置,从各帧图像中截取待识别对象的人脸区域的图像生成人脸图像序列。
例如,本公开实施例对步骤S11和步骤S12的顺序不作限制。例如,可以先执行步骤S12再执行步骤S11,即先确定待识别对象的方位,然后采集该方位上的待识别对象的图像序列,例如,可直接采集人脸图像序列。例如,也可以先执行步骤S11再执行步骤S12,即先获取包括待识别对像的图像序列,再根据确定的待识别对象的方位,准确快速的获取待识别对象的人脸图像序列。
例如,可通过AR设备的摄像装置采集待识别对象的视频,视频由连续的多帧图像组成,或者摄像装置连续抓拍多帧待识别对象的图像,多帧图像可组成该图像序列,各帧图像均包括待识别对象,也会包括待识别对象的人脸区域,可直接将该图像序列作为人脸图像序列。例如,该图像序列中的图像可以是摄像装置直接采集得到的原始图像,也可以是对原始图像进行预处理之后获得的图像,本公开的实施例对此不作限制。
例如,图像预处理操作可以消除原始图像中的无关信息或噪声信息,以便于更好地对采集的图像进行人脸检测。例如,该图像预处理操作可以包括对采集的图像进行图像缩放、压缩或格式转换、色域转换、伽玛(Gamma)校正、图像增强或降噪滤波等处理。
例如,对于唇语识别而言,只需要包含待识别对象的人脸区域即可,为了进一步提高识别速度,可从各帧图像中截取待识别对象的人脸区域的该部分图像,生成人脸图像序列。例如,该人脸图像序列包括多帧人脸图像,每帧人脸图像为从待识别对象的整个图像中截取的部分图像,该部分图像包括人脸区域。
例如,在从图像中截取人脸图像时,需要定位待识别对象的方位,即待识别对象的人脸区域在佩戴该AR设备的用户所在空间的方位。例如,佩戴该AR设备的用户在一会议室内,待识别对象位于会议室内的某个位置,相对于AR设备的摄像装置的视场范围而言,待识别对象所在的位置可以以AR 设备的摄像装置的视场范围的中心轴线为参考位置,待识别对象所在位置与中心轴线的夹角作为待识别对象的方位,再根据待识别对象的方位进一步的定位待识别对象的人脸区域的在图像中的位置。
例如,佩戴该AR设备的用户面对待识别对象,待识别对象与AR设备的摄像装置的视场范围的中心轴线的夹角为右侧30度,该30度即为待识别对象的方位,根据该方位可以初步确定待识别对象在图像中的位置为距离图像中心一定距离的某一区域内,然后可对该区域进行人脸识别,进一步的定位出人脸区域,截取该部分图像作为人脸图像。
例如,可以预先搜集大量的(例如,10000张或更多张)包括人脸的图像作为样本库,并对样本库中的图像进行特征提取。然后,使用样本库中的图像和提取的特征点通过机器学习(例如深度学习,或者基于局部特征的回归算法)等算法对分类模型进行训练和测试,以得到获取用户的人脸图像的分类模型。例如,该分类模型也可以通过本领域内的其他常规算法例如支持向量机(Support Vector Machine,SVM)等实现,本公开的实施例对此不作限制。需要注意的是,该机器学习算法可以采用本领域内的常规方法实现,在此不再赘述。例如,该分类模型的输入为采集的图像,输出为用户的人脸的图像,从而可以实现人脸识别。
例如,对于定位待识别对象的方位的方式,可以有多种,本公开的实施例不限于上述定位方式。例如,AR设备上可设置红外传感器,红外传感器可以感应待识别对象,进而定位待识别对象的方位,当待识别对象有多个时,通过红外传感器可以感应多个待识别对象的方位,但如果只有其中一个待识别对象正在讲话,对于唇语识别而言,只需要对该正在讲话的待识别对象的人脸图像进行识别即可,不需要其他没有讲话的待识别对象。
由于通过红外传感器不能定位正在讲话的待识别对象,为此,可通过声音定位的方式,即根据待识别对象讲话时发出的语音信号定位待识别对象的方位。具体而言,可在AR设备上设置麦克风阵列,麦克风阵列是麦克风的集群,是由多个麦克风组成的集合,通过麦克风阵列可以定位发声声源的位置。例如,待识别对象(人)讲话的语音信号也为一种发声声源,因此,可据此识别正在讲话的待识别对象的方位。如果有多个待识别对象同时讲话,也可以定位多个正在讲话的待识别对象的方位,上述定位并不要求准确定位 待识别对象的准确位置,只要定位出大致的方位即可。
当然,即使不对正在讲话的待识别对象进行定位,该唇语识别方法也是可行的,后续在进行唇语识别时,没有讲话的待识别对象的唇形基本是不变的,因此,对于该待识别对象而言,也不会确定出语义信息,从而只会确定出讲话的待识别对象的语义信息。
例如,用户可选择实时的对唇语进行识别,AR设备的摄像装置可实时的采集待识别对象的图像。例如,AR设备获取人脸图像序列,将人脸图像序列实时的发送给服务器,服务器据此进行唇语识别后返回语义信息,AR设备接收后展示该语义信息。
例如,用户也可以根据需要选择不实时进行唇语识别,AR设备的摄像装置仍然实时的采集待识别对象的图像。例如,获取人脸图像序列后,该人脸图像序列可以为根据摄像装置直接采集的视频(视频由连续的多帧图像组成)进行解析后生成的,或者由摄像装置采用抓拍方式抓拍的多帧人脸图像生成。例如,获取该人脸图像序列后,保存人脸图像序列。例如,可将该人脸图像序列保存在AR设备中(例如,保存在AR设备的寄存器中)。例如,在AR设备上可设置发送按钮或者菜单等,用户可根据需要选择针对保存的人脸图像序列进行唇语识别的时机,此时,用户操作发送按钮或者菜单,据此生成发送指令,根据该发送指令AR设备将保存的人脸图像序列发送给服务器,服务器据此进行唇语识别后返回语义信息,AR设备接收该语义信息后并进行展示。
例如,上述不进行实时唇语识别的方式,可适用于佩戴AR设备的用于不需要与待识别对象实时地进行双向交流的场景。例如,在参加某些演讲或者报告时,在会场的用户没有听力障碍,可以正常的听到演讲人或者报告主讲人的讲话,为了后续整理或者回顾讲话内容,可佩戴该AR设备,AR设备可先保存获取的人脸图像序列,后续需要时再发送给服务器进行唇语识别。
本公开至少一实施例还提供一种唇语识别方法,例如,该唇语识别方法由服务器实现。例如,该唇语识别方法可以至少部分以软件的方式实现,并由服务器中的处理器加载并执行,或至少部分以硬件或固件等方式实现,以扩展增强现实设备的功能,提升设备的用户体验。
图2B为本公开至少一实施例提供的再一种唇语识别方法的流程图。如 图2B所示,该唇语识别方法包括步骤S100至步骤S300。下面对该唇语识别方法的步骤S100至步骤S300以及它们各自的示例性实现方式分别进行介绍。
步骤S100:接收增强现实设备发送的待识别对象的人脸图像序列。
例如,服务器接收例如AR设备发送的待识别对象的人脸图像序列。该人脸图像序列的具体获取方法可参考步骤S10的相关描述,在此不再赘述。
步骤S200:基于人脸图像序列进行唇语识别以确定出人脸图像中的唇部动作对应的待识别对象讲话内容的语义信息。
例如,可以由服务器中的处理单元基于人脸图像序列进行唇语识别。例如,该唇语识别的具体实现方法可参考步骤S20的相关描述,在此不再赘述。
步骤S300:向增强现实设备发送语义信息。
例如,语义信息为语义文字信息和/或语义音频信息。由服务器将该语义信息发送至例如AR设备,从而可在AR设备上显示或播放该语义信息。
本公开实施例中的唇语识别方法的技术效果可以参考本公开上述实施例中提供的唇语识别方法的技术效果,这里不再赘述。
图2C为本公开至少一实施例提供的一种唇部识别方法的系统流程图。下面参考图2C对本公开至少一实施例提供的唇部识别方法进行系统地介绍。
首先,可根据红外传感器或麦克风定位待识别对象(例如,讲话的人)的方位以及通过摄像头可采集人脸图像。例如,采集的人脸图像可以实时上传以进行唇部识别,也可以非实时上传,例如,非实时上传时,可将人脸图像序列线保存至AR设备中的寄存器中,并根据发送指令读取人脸图像序列以将其发送至服务器中。
例如,将人脸图像信息发送至服务器后,可基于位于在该方位的人脸图像,在人脸图像中定位唇部的位置,从而可以根据识别唇部的动作获取语义信息。例如,可在服务器端进行唇部动作匹配,从而将与唇部动作对应的语义信息进行文字转换或音频转换以分别获取语义文字信息或语义音频信息。例如,语义文字信息可以在AR设备上显示或进行语音播放;该语义音频信息可进行语音播放。
本公开至少一实施例还提供了一种唇语识别装置。图3A为本公开至少一实施例提供的一种唇语识别装置的示意框图。如图3A所示,在一些示例 中,该唇语识别装置03包括人脸图像序列获取单元301、发送单元302、接收单元303。在另一些示例中,该唇语识别装置03还包括展示单元304。
人脸图像序列获取单元301配置为获取待识别对象的人脸图像序列。例如,该人脸图像序列获取单元301可以实现步骤S10,其具体实现方法可以参考步骤S10的相关描述,在此不再赘述。
发送单元302配置为配置为将人脸图像序列发送给服务器,由服务器进行唇语识别确定出人脸图像中的唇部动作对应的语义信息。例如,可以通过蓝牙、Wi-Fi等无线通信方式将人脸图像序列发送给服务器。例如,该发送单元302可以实现步骤S20,其具体实现方法可以参考步骤S20的相关描述,在此不再赘述。
接收单元303配置为接收服务器发送的语义信息;展示单元304配置为展示语义信息。例如,该接收单元303和展示单元304可以实现步骤S30,其具体实现方法可以参考步骤S30的相关描述,在此不再赘述。
例如,在一些实施方式中,语义信息为语义文字信息和/或语义音频信息。例如,在一些示例中展示单元304可以包括展示模式指令生成子单元3041;在另一些示例中,展示单元304还可以包括显示子单元3042和播放子单元3043。
展示模式指令生成子单元3041配置为生成展示模式指令。例如,该展示模式指令包括显示模式指令和音频模式指令。
显示子单元3042配置为在接收到显示模式指令时,将语义文字信息显示在佩戴增强现实设备的用户视野范围内。
播放子单元3043配置为在接收到音频模式指令时,播放语义音频信息。
例如,在一些示例中,如图3C所示,人脸图像序列获取单元301包括图像序列获取子单元3011、定位子单元3012和人脸图像序列生成子单元3013。
图像序列获取子单元3011配置为获取待识别对象的图像序列。
定位子单元3012配置为定位待识别对象的方位。
人脸图像序列生成子单元3013配置为根据定位出的待识别对象的方位确定待识别对象的人脸区域在图像序列各帧图像中的位置,从各帧图像中截取待识别对象的人脸区域的图像生成人脸图像序列。
与前述基于AR设备的唇语识别方法的实施例相对应,本公开实施例提供的基于AR设备的识别装置,可确定出待识别对象的讲话内容,将待识别对象的唇语进行展示,实现了对将待识别对象的唇语的翻译,并且,可利用已有AR设备的部件,不需要单独增加硬件,在不增加成本的基础上扩展了AR设备的功能,进一步提升用户体验。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上;上述各单元可以合并为一个单元,也可以进一步拆分成多个子单元。
例如,本实施例的装置中的各个单元可借助软件的方式实现,或者通过软件和硬件的方式来实现,当然也可以通过通用硬件实现。基于这样的理解,本公开实施例提供的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,以软件实现为例,作为一个逻辑意义上的装置,是通过应用该装置的AR设备所包括的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。
需要注意的是,在本公开的实施例提供的唇语识别装置可以包括更多或更少的电路,并且各个电路之间的连接关系不受限制,可以根据实际需求而定。各个电路的具体构成方式不受限制,可以根据电路原理由模拟器件构成,也可以由数字芯片构成,或者以其他适用的方式构成。
图3D为本公开至少一实施例提供的另一种唇语识别装置的示意框图。如图3D所示,该唇语识别装置200包括处理器210、机器可读存储介质220以及一个或多个计算机程序模块221。
例如,处理器210与机器可读存储介质220通过总线系统230连接。例如,一个或多个计算机程序模块221被存储在机器可读存储介质220中。例如,一个或多个计算机程序模块221包括用于执行本公开任一实施例提供的唇语识别方法的指令。例如,一个或多个计算机程序模块221中的指令可以由处理器210执行。例如,总线系统230可以是常用的串行、并行通信总线等,本公开的实施例对此不作限制。
例如,该处理器210可以是中央处理单元(CPU)、图像处理器(GPU)或者具有数据处理能力和/或指令执行能力的其它形式的处理单元,可以为通 用处理器或专用处理器,并且可以控制唇语识别装置200中的其它组件以执行期望的功能。
机器可读存储介质220可以包括一个或多个计算机程序产品,该计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。该易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。该非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。在计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器210可以运行该程序指令,以实现本公开实施例中(由处理器210实现)的功能以及/或者其它期望的功能,例如唇语识别方法等。在该计算机可读存储介质中还可以存储各种应用程序和各种数据,例如人脸图像序列以及应用程序使用和/或产生的各种数据等。
需要说明的是,为表示清楚、简洁,本公开实施例并没有给出该唇语识别装置200的全部组成单元。为实现唇语识别装置200的必要功能,本领域技术人员可以根据具体需要提供、设置其他未示出的组成单元,本公开的实施例对此不作限制。
关于不同实施例中的唇语识别装置100和唇语识别装置200的技术效果可以参考本公开的实施例中提供的唇语识别方法的技术效果,这里不再赘述。
本公开至少一实施例还提供一种增强现实设备。图3E-图4分别为本公开至少一实施例提供的一种增强现实设备的示意框图。
如图3E所示,在一个示例中,该增强现实设备1包括本公开任一实施例提供的唇语识别装置100/200,唇语识别装置100/200具体可参考图3A至图3D的相关描述,在此不再赘述。例如,该增强现实设备1还包括摄像装置、显示装置或播放装置。例如,摄像头,用于采集待识别对象的图像;显示装置,用于显示语义文字信息;播放装置,用于播放语义音频信息。例如,播放装置可以是扬声器、音箱等,且下面以播放装置为扬声器为例进行介绍,本公开的实施例对此不作限制。
如图3F所示,该增强现实设备1可以配带在人的眼部,从而根据需要实现对待识别对象的唇语识别功能。
例如,在另一个示例中,参见图4,该AR设备1包括摄像装置101、(例如摄像头,用于采集待识别对象的图像)、显示装置102(用于显示语义文字 信息)、扬声器103(用于播放语义音频信息)等输入/输出(I/O)装置。
例如,该AR设备1还包括:机器可读存储介质104、处理器105、通信接口106和总线107。例如,摄像装置101、显示装置102、扬声器103、机器可读存储介质104、处理器105和通信接口106通过总线107完成相互间的通信。处理器105通过读取并执行机器可读存储介质104中与唇语识别方法的控制逻辑对应的机器可执行指令,可执行上文描述的唇语识别方法。
例如,该通信接口106与通信装置(图中未示出)连接。该通信装置可以通过无线通信来与网络和其他设备进行通信,该网络例如为因特网、内部网和/或诸如蜂窝电话网络之类的无线网络、无线局域网(LAN)和/或城域网(MAN)。无线通信可以使用多种通信标准、协议和技术中的任何一种,包括但不局限于全球移动通信系统(GSM)、增强型数据GSM环境(EDGE)、宽带码分多址(W-CDMA)、码分多址(CDMA)、时分多址(TDMA)、蓝牙、Wi-Fi(例如基于IEEE 802.11a、IEEE 802.11b、IEEE 802.11g和/或IEEE 802.11n标准)、基于因特网协议的语音传输(VoIP)、Wi-MAX,用于电子邮件、即时消息传递和/或短消息服务(SMS)的协议,或任何其他合适的通信协议。
本公开实施例中提到的机器可读存储介质104可以是任何电子、磁性、光学或其它物理存储装置,可以包含或存储信息,如可执行指令、数据,等等。例如,机器可读存储介质可以是:RAM(Radom Access Memory,随机存取存储器)、易失存储器、非易失性存储器、闪存、存储驱动器(如硬盘驱动器)、任何类型的存储盘(如光盘、dvd等),或者类似的存储介质,或者它们的组合。
非易失性介质108可以是非易失性存储器、闪存、存储驱动器(如硬盘驱动器)、任何类型的存储盘(如光盘、dvd等),或者类似的非易失性存储介质,或者它们的组合。
需要说明的是,为表示清楚、简洁,本公开实施例并没有给出该AR设备1的全部组成单元。为实现AR设备1的必要功能,本领域技术人员可以根据具体需要提供、设置其他未示出的组成单元,本公开的实施例对此不作限制。
本公开一实施例还提供一种存储介质。例如,该存储介质非暂时性地存 储计算机可读指令,当非暂时性计算机可读指令由计算机(包括处理器)执行时可以执行本公开任一实施例提供的唇语识别方法。
例如,该存储介质可以是一个或多个计算机可读存储介质的任意组合,例如一个计算机可读存储介质包含获取待识别对象的人脸图像序列的计算机可读的程序代码,另一个计算机可读存储介质包含展示语义信息的计算机可读的程序代码。例如,当该程序代码由计算机读取时,计算机可以执行该计算机存储介质中存储的程序代码,执行例如本公开任一实施例提供的唇语识别方法。
例如,存储介质可以包括智能电话的存储卡、平板电脑的存储部件、个人计算机的硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、闪存、或者上述存储介质的任意组合,也可以为其他适用的存储介质。
有以下几点需要说明:
(1)本公开实施例附图只涉及到与本公开实施例涉及到的结构,其他结构可参考通常设计。
(2)在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合以得到新的实施例。
以上所述仅是本公开的示范性实施方式,而非用于限制本公开的保护范围,本公开的保护范围由所附的权利要求确定。

Claims (19)

  1. 一种唇语识别方法,包括:
    获取待识别对象的人脸图像序列;
    基于所述人脸图像序列进行唇语识别以确定出人脸图像中的唇部动作对应的所述待识别对象讲话内容的语义信息;
    将所述语义信息用于展示。
  2. 根据权利要求1所述的方法,其中,基于所述人脸图像序列进行唇语识别以确定出人脸图像中的唇部动作对应的所述待识别对象讲话内容的语义信息,包括:
    将所述人脸图像序列发送给服务器,由所述服务器进行唇语识别确定出所述人脸图像中的唇部动作对应的待识别对象讲话内容的语义信息。
  3. 根据权利要求2所述的方法,其中,将所述语义信息用于展示之前,所述唇语识别方法还包括:
    接收所述服务器发送的所述语义信息。
  4. 根据权利要求1-3任一所述的方法,其中,所述语义信息为语义文字信息和/或语义音频信息。
  5. 根据权利要求4所述的方法,还包括展示所述语义信息,其中,展示所述语义信息包括:
    根据展示模式指令将所述语义文字信息显示在佩戴增强现实设备的用户视野范围内或播放所述语义音频信息。
  6. 根据权利要求1-5任一所述的方法,其中,获取所述待识别对象的人脸图像序列,包括:
    获取包括所述待识别对象的图像序列;
    定位所述待识别对象的方位;
    根据定位出的待识别对象的方位确定所述待识别对象的人脸区域在所述图像序列中各帧图像中的位置,从所述各帧图像中截取所述待识别对象的人脸区域的图像生成所述人脸图像序列。
  7. 根据权利要求6所述的方法,其中,定位所述待识别对象的方位,包括:
    根据所述待识别对象讲话时发出的语音信号定位所述待识别对象的方位。
  8. 根据权利要求2-7任一项所述的方法,其中,在获取所述待识别对象的人脸图像序列之后,还包括:
    保存所述人脸图像序列。
  9. 根据权利要求8所述的方法,其中,将所述人脸图像序列发送给服务器,包括:
    在接收到发送指令时将保存的所述人脸图像序列发送给所述服务器。
  10. 一种唇语识别装置,包括:
    人脸图像序列获取单元,配置为获取所述待识别对象的人脸图像序列;
    发送单元,配置为将所述人脸图像序列发送给服务器,由所述服务器进行唇语识别确定出人脸图像中的唇部动作对应的语义信息;
    接收单元,配置为接收服务器发送的所述语义信息。
  11. 根据权利要求10所述的唇语识别装置,还包括:
    展示单元,配置为展示所述语义信息。
  12. 根据权利要求11所述的唇语识别装置,其中,所述展示单元包括:
    展示模式指令生成子单元,配置为生成展示模式指令,所述展示模式指令包括显示模式指令和音频模式指令。
  13. 根据权利要求12所述的唇语识别装置,其中,所述语义信息为语义文字信息和/或语义音频信息,所述展示单元还包括:
    显示子单元,配置为在接收到所述显示模式指令时,将所述语义文字信息显示在佩戴增强现实设备的用户视野范围内;
    播放子单元,配置为在接收到所述音频模式指令时,播放所述语义音频信息。
  14. 根据权利要求10-13任一所述的唇语识别装置,其中,所述人脸图像序列获取单元包括:
    图像序列获取子单元,配置为获取所述待识别对象的图像序列;
    定位子单元,配置为定位所述待识别对象的方位;
    人脸图像序列生成子单元,配置为根据定位出的待识别对象的方位确定所述待识别对象的人脸区域在所述图像序列各帧图像中的位置,从所述各帧 图像中截取所述待识别对象的人脸区域的图像生成所述人脸图像序列。
  15. 一种唇语识别装置,包括:
    处理器;
    机器可读存储介质,存储有一个或多个计算机程序模块;
    其中,所述一个或多个计算机程序模块被存储在所述机器可读存储介质中并被配置为由所述处理器执行,所述一个或多个计算机程序模块包括用于执行实现权利要求1-9任一所述的唇语识别方法的指令。
  16. 一种增强现实设备,包括如权利要求10-15任一所述的唇语识别装置。
  17. 根据权利要求16所述的增强现实设备,还包括摄像装置、显示装置或播放装置;其中,
    所述摄像装置配置为采集所述待识别对象的图像;
    所述显示装置配置为显示所述语义信息;
    所述播放装置配置为播放所述语义信息。
  18. 一种唇语识别方法,包括:
    接收增强现实设备发送的待识别对象的人脸图像序列;
    基于所述人脸图像序列进行唇语识别以确定出人脸图像中的唇部动作对应的所述待识别对象讲话内容的语义信息;
    向增强现实设备发送所述语义信息。
  19. 一种存储介质,非暂时性地存储计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时可以执行根据权利要求1-9任一所述的唇语识别方法或权利要求18所述的唇语识别方法的指令。
PCT/CN2019/084109 2018-04-26 2019-04-24 唇语识别方法及其装置、增强现实设备以及存储介质 WO2019206186A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/610,254 US11527242B2 (en) 2018-04-26 2019-04-24 Lip-language identification method and apparatus, and augmented reality (AR) device and storage medium which identifies an object based on an azimuth angle associated with the AR field of view

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810384886.2A CN108596107A (zh) 2018-04-26 2018-04-26 基于ar设备的唇语识别方法及其装置、ar设备
CN201810384886.2 2018-04-26

Publications (1)

Publication Number Publication Date
WO2019206186A1 true WO2019206186A1 (zh) 2019-10-31

Family

ID=63609654

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/084109 WO2019206186A1 (zh) 2018-04-26 2019-04-24 唇语识别方法及其装置、增强现实设备以及存储介质

Country Status (3)

Country Link
US (1) US11527242B2 (zh)
CN (1) CN108596107A (zh)
WO (1) WO2019206186A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125437A (zh) * 2019-12-24 2020-05-08 四川新网银行股份有限公司 对视频中唇语图片识别的方法

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596107A (zh) 2018-04-26 2018-09-28 京东方科技集团股份有限公司 基于ar设备的唇语识别方法及其装置、ar设备
CN111063344B (zh) * 2018-10-17 2022-06-28 青岛海信移动通信技术股份有限公司 一种语音识别方法、移动终端以及服务器
WO2021076349A1 (en) * 2019-10-18 2021-04-22 Google Llc End-to-end multi-speaker audio-visual automatic speech recognition
CN111738100A (zh) * 2020-06-01 2020-10-02 广东小天才科技有限公司 一种基于口型的语音识别方法及终端设备
CN111739534B (zh) * 2020-06-04 2022-12-27 广东小天才科技有限公司 一种辅助语音识别的处理方法、装置、电子设备及存储介质
CN112672021B (zh) * 2020-12-25 2022-05-17 维沃移动通信有限公司 语言识别方法、装置及电子设备
CN114842846A (zh) * 2022-04-21 2022-08-02 歌尔股份有限公司 头戴设备的控制方法、装置及计算机可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298443A (zh) * 2011-06-24 2011-12-28 华南理工大学 结合视频通道的智能家居语音控制系统及其控制方法
CN107340859A (zh) * 2017-06-14 2017-11-10 北京光年无限科技有限公司 多模态虚拟机器人的多模态交互方法和系统
CN108227903A (zh) * 2016-12-21 2018-06-29 深圳市掌网科技股份有限公司 一种虚拟现实语言交互系统与方法
CN108596107A (zh) * 2018-04-26 2018-09-28 京东方科技集团股份有限公司 基于ar设备的唇语识别方法及其装置、ar设备

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5680481A (en) * 1992-05-26 1997-10-21 Ricoh Corporation Facial feature extraction method and apparatus for a neural network acoustic and visual speech recognition system
US20030028380A1 (en) 2000-02-02 2003-02-06 Freeland Warwick Peter Speech system
US20040243416A1 (en) 2003-06-02 2004-12-02 Gardos Thomas R. Speech recognition
CN101101752B (zh) * 2007-07-19 2010-12-01 华中科技大学 基于视觉特征的单音节语言唇读识别系统
US20100332229A1 (en) 2009-06-30 2010-12-30 Sony Corporation Apparatus control based on visual lip share recognition
KR101092820B1 (ko) * 2009-09-22 2011-12-12 현대자동차주식회사 립리딩과 음성 인식 통합 멀티모달 인터페이스 시스템
US8836638B2 (en) 2010-09-25 2014-09-16 Hewlett-Packard Development Company, L.P. Silent speech based command to a computing device
CN102004549B (zh) * 2010-11-22 2012-05-09 北京理工大学 一种适用于中文的自动唇语识别系统
US8743244B2 (en) * 2011-03-21 2014-06-03 HJ Laboratories, LLC Providing augmented reality based on third party information
JP5776255B2 (ja) * 2011-03-25 2015-09-09 ソニー株式会社 端末装置、物体識別方法、プログラム及び物体識別システム
KR101920020B1 (ko) 2012-08-07 2019-02-11 삼성전자 주식회사 단말기 상태 전환 제어 방법 및 이를 지원하는 단말기
CN103853190A (zh) * 2012-12-03 2014-06-11 联想(北京)有限公司 一种控制电子设备的方法及电子设备
US20140129207A1 (en) * 2013-07-19 2014-05-08 Apex Technology Ventures, LLC Augmented Reality Language Translation
US20150302651A1 (en) * 2014-04-18 2015-10-22 Sam Shpigelman System and method for augmented or virtual reality entertainment experience
CN104409075B (zh) 2014-11-28 2018-09-04 深圳创维-Rgb电子有限公司 语音识别方法和系统
CN205430338U (zh) 2016-03-11 2016-08-03 依法儿环球有限公司 带vr内容采集组件的智能手机或便携式电子通讯装置
CN106529502B (zh) * 2016-08-01 2019-09-24 深圳奥比中光科技有限公司 唇语识别方法以及装置
US10817066B2 (en) 2016-12-05 2020-10-27 Google Llc Information privacy in virtual reality
WO2018107489A1 (zh) * 2016-12-16 2018-06-21 深圳前海达闼云端智能科技有限公司 一种聋哑人辅助方法、装置以及电子设备
JP2018109924A (ja) 2017-01-06 2018-07-12 ソニー株式会社 情報処理装置、情報処理方法、及びプログラム
US10657361B2 (en) 2017-01-18 2020-05-19 International Business Machines Corporation System to enforce privacy in images on an ad-hoc basis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298443A (zh) * 2011-06-24 2011-12-28 华南理工大学 结合视频通道的智能家居语音控制系统及其控制方法
CN108227903A (zh) * 2016-12-21 2018-06-29 深圳市掌网科技股份有限公司 一种虚拟现实语言交互系统与方法
CN107340859A (zh) * 2017-06-14 2017-11-10 北京光年无限科技有限公司 多模态虚拟机器人的多模态交互方法和系统
CN108596107A (zh) * 2018-04-26 2018-09-28 京东方科技集团股份有限公司 基于ar设备的唇语识别方法及其装置、ar设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125437A (zh) * 2019-12-24 2020-05-08 四川新网银行股份有限公司 对视频中唇语图片识别的方法
CN111125437B (zh) * 2019-12-24 2023-06-09 四川新网银行股份有限公司 对视频中唇语图片识别的方法

Also Published As

Publication number Publication date
US20200058302A1 (en) 2020-02-20
CN108596107A (zh) 2018-09-28
US11527242B2 (en) 2022-12-13

Similar Documents

Publication Publication Date Title
WO2019206186A1 (zh) 唇语识别方法及其装置、增强现实设备以及存储介质
US10136043B2 (en) Speech and computer vision-based control
CN108933915B (zh) 视频会议装置与视频会议管理方法
US10971188B2 (en) Apparatus and method for editing content
US11343446B2 (en) Systems and methods for implementing personal camera that adapts to its surroundings, both co-located and remote
JP5456832B2 (ja) 入力された発話の関連性を判定するための装置および方法
US10083710B2 (en) Voice control system, voice control method, and computer readable medium
WO2019140161A1 (en) Systems and methods for decomposing a video stream into face streams
US8411130B2 (en) Apparatus and method of video conference to distinguish speaker from participants
US10887548B2 (en) Scaling image of speaker's face based on distance of face and size of display
JP7100824B2 (ja) データ処理装置、データ処理方法及びプログラム
JP7427408B2 (ja) 情報処理装置、情報処理方法、及び情報処理プログラム
US20200380959A1 (en) Real time speech translating communication system
WO2021232875A1 (zh) 一种驱动数字人的方法、装置及电子设备
CN114556469A (zh) 数据处理方法、装置、电子设备和存储介质
JP2015126451A (ja) 画像の記録方法、電子機器およびコンピュータ・プログラム
CN110673811B (zh) 基于声音信息定位的全景画面展示方法、装置及存储介质
AU2013222959B2 (en) Method and apparatus for processing information of image including a face
JP7400364B2 (ja) 音声認識システム及び情報処理方法
US8913142B2 (en) Context aware input system for focus control
US20150116198A1 (en) Device and method for displaying multimedia content
JP2022120164A (ja) 音声認識システム、音声認識方法、及び音声処理装置
WO2019082648A1 (ja) 電子機器、制御装置、制御プログラム及び電子機器の動作方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19792122

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19792122

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19792122

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 14/05/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19792122

Country of ref document: EP

Kind code of ref document: A1