CN110750152B - Man-machine interaction method and system based on lip actions - Google Patents

Man-machine interaction method and system based on lip actions Download PDF

Info

Publication number
CN110750152B
CN110750152B CN201910859039.1A CN201910859039A CN110750152B CN 110750152 B CN110750152 B CN 110750152B CN 201910859039 A CN201910859039 A CN 201910859039A CN 110750152 B CN110750152 B CN 110750152B
Authority
CN
China
Prior art keywords
lip
target object
face
information
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910859039.1A
Other languages
Chinese (zh)
Other versions
CN110750152A (en
Inventor
刘青松
李旭滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN201910859039.1A priority Critical patent/CN110750152B/en
Publication of CN110750152A publication Critical patent/CN110750152A/en
Application granted granted Critical
Publication of CN110750152B publication Critical patent/CN110750152B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a man-machine interaction method based on lip actions, which comprises the following steps: step (1), acquiring video information about a target object, and performing face recognition processing about the target object on the video information; step (2), determining the lip action state of the target object based on the result of the face recognition processing; and (3) adjusting a recording action mode and/or a recording data processing mode of the target object based on the lip action state.

Description

Man-machine interaction method and system based on lip actions
Technical Field
The invention relates to the technical field of man-machine interaction, in particular to a man-machine interaction method and system based on lip actions.
Background
Currently, there are many methods for controlling the flow of liquid. The voice recognition technology is widely applied to different occasions, particularly, all IOT intelligent terminal devices can perform active wake-up operation through the voice recognition technology, wherein the voice recognition technology is mainly realized through different modes of offline VAD sentence breaking, cloud VAD sentence breaking, offline VAD and cloud VAD mixed sentence breaking, or ASR cloud sentence breaking, and finally, a voice recognition result is obtained from a corresponding VAD engine or ASR engine. In practical application, most of IOT intelligent terminal devices are placed in noisy environments such as administrative halls, office areas, malls, roads, stations or airports, and in such environments, the VAD cloud and the ASR cloud cannot work normally due to interference of external noise, and the external noise not only can cause false wake-up operation of the IOT intelligent terminal devices, but also can cause abnormal end of sentence breaking operation, so that voice recognition interaction cannot be performed normally between man-machine. It can be seen that there is a great need in the art for a method and system that enables accurate and rapid speech recognition and human-computer interaction in noisy environments.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention provides a man-machine interaction method and a system based on lip actions, wherein the man-machine interaction method based on the lip actions comprises the following steps: step (1), acquiring video information about a target object, and performing face recognition processing about the target object on the video information; step (2), determining the lip action state of the target object based on the result of the face recognition processing; step (3), based on the lip action state, adjusting a recording action mode and/or a recording data processing mode of the target object; in addition, the man-machine interaction system based on the lip action comprises a camera module, a face recognition module, a lip action acquisition module and a recording mode adjustment module; the camera module is used for acquiring video information about a target object; the face recognition module is used for carrying out face recognition processing on the target object by the video information; the lip action acquisition module is used for determining the lip action state of the target object according to the face recognition processing result; the recording mode adjusting module is used for adjusting a recording action mode and/or a recording data processing mode of the target object according to the lip action state. It can be seen that the man-machine interaction method and system based on lip action are different from the existing voice recognition technology in that the voice recognition is carried out by only receiving and processing external voice information, the system and method also combine the lip action of the user to determine the start and end of the voice action of the user, and utilize the multi-mode information and lip action capturing technology about the user to solve the defect that the prior art can not extract the voice signal of the user from the noisy environment.
The invention provides a man-machine interaction method based on lip actions, which is characterized by comprising the following steps of:
step (1), acquiring video information about a target object, and performing face recognition processing about the target object on the video information;
step (2), determining the lip action state of the target object based on the result of the face recognition processing;
step (3), based on the lip action state, adjusting a recording action mode and/or a recording data processing mode of the target object;
further, in the step (1), acquiring video information about a target object, performing face recognition processing on the video information about the target object specifically includes,
step (101), acquiring an initial image about the target object, so as to determine face characteristic information of the target object;
step (102), obtaining video information about the target object, and judging the validity of the video information according to the face characteristic information;
step (103), according to the validity judgment result of the video information, the shooting mode of the video information of the target object is adjusted and acquired in real time;
further, in the step (101), acquiring an initial image about the target object, thereby determining face feature information of the target object specifically includes,
acquiring a plurality of initial images of different azimuth angles of the face of the target object, and analyzing and processing the initial images through a preset face structure recognition model to obtain face characteristic information, wherein the face characteristic information at least comprises the position information of the lips of the target object on the face of the target object;
or,
in the step (102), acquiring video information about the target object, and judging the validity of the video information according to the face characteristic information specifically comprises,
extracting a plurality of frames of different images from the video information according to a preset time interval, and determining the position information of the face and/or the lip of the target object in the frames of different images;
step (1022), the position information and the face characteristic information are matched, so that the validity of the video information is judged;
step (1023), if the position information is matched with the face feature information, determining that the video information is valid, and if the position information is not matched with the face feature information, determining that the video information is not valid;
or,
in the step (103), adjusting in real time a shooting mode of acquiring the target object video information based on a result of the validity judgment regarding the video information specifically includes,
if the video information has validity, maintaining a shooting mode of the video information of the target object to be currently acquired unchanged;
if the video information does not have the validity, adjusting at least one of shooting angle, shooting resolution and shooting exposure of the video information of the target object;
further, in the step (2), determining the lip motion state of the target object based on the result of the face recognition processing specifically includes,
step (201), based on the result of the face recognition processing, determining the position of the lip of the target object in the video information corresponding to the video image, so as to perform positioning and tracking processing on the lip;
a step (202) of determining a lip motion state of the target object based on a result of the positioning tracking process, wherein the lip motion state includes at least opening, closing, and chattering of a lip;
step (203), obtaining the change frequency of the lip action state within a preset time length, so as to judge whether the lip action state meets a preset voice interaction triggering condition about the target object;
further, in the step (3), adjusting a recording action mode and/or a recording data processing mode for the target object based on the lip action state specifically includes,
step (301), if the change frequency of the lip action state within a preset time length exceeds a frequency threshold, determining that the lip action state meets a preset voice interaction triggering condition, and if the change frequency of the lip action state within the preset time length does not exceed the frequency threshold, determining that the lip action state does not meet the preset voice interaction triggering condition;
step (302), when the lip action state is determined to meet a preset voice interaction triggering condition, indicating to start recording action and/or recording data processing on the target object;
step (303), when the lip action state is determined not to meet a preset voice interaction triggering condition, indicating to stop executing recording action and/or recording data processing on the target object;
the invention also provides a man-machine interaction system based on the lip action, which is characterized in that:
the man-machine interaction system based on the lip action comprises a camera module, a face recognition module, a lip action acquisition module and a recording mode adjustment module; wherein,,
the camera module is used for acquiring video information about a target object;
the face recognition module is used for carrying out face recognition processing on the target object by the video information;
the lip action acquisition module is used for determining the lip action state of the target object according to the face recognition processing result;
the recording mode adjusting module is used for adjusting the recording action mode and/or the recording data processing mode of the target object according to the lip action state,
or,
the man-machine interaction system based on the lip action comprises a video stream acquisition module, a target object detection module, a lip key point extraction module, a lip action state estimation module and a state estimation post-processing module; wherein,,
the video stream acquisition module is used for acquiring real-time video stream framing and extracting image information, filtering and selecting the extracted image information according to preset frame rate and frame skip parameters, sending the filtered and selected image information into the target object detection module, and simultaneously recording relative timestamp information among different images;
the target object detection module is used for receiving the output of the video stream acquisition module and carrying out the following processing on each frame of picture:
a1, detecting whether the current frame picture contains face information, if so, carrying out size segmentation based on a face detection frame on the face information in the current frame picture, extracting a face image in the picture, and if not, terminating subsequent processing operation;
a2, screening face information corresponding to the current frame picture, selecting a target object, and selecting a face with the largest area as the target object based on the face size of the face detection frame so as to eliminate surrounding interference objects;
the lip key point extraction module is used for extracting facial five-element key point information in the facial information, extracting key points related to lips from the facial five-element key point information and taking the extracted key points as a reference for lip action judgment, wherein the key points related to lips comprise 20 key points related to upper lips and 8 key points related to lower lips;
the lip motion state estimation module adopts key points related to lips in continuous multi-frame picture information as corresponding input, and estimates a lip motion state by using a modeling classification mode or a track prediction mode; wherein,,
estimating the state of lip motion using a modeled classification model specifically includes
B1, constructing a face mouth area which is formed by surrounding on the basis of 20 key points related to an upper lip and 8 key points related to a lower lip, and cutting out the face mouth area from a current frame picture;
b2, splicing the mouth regions of the faces of the continuous multi-frame picture information to be used as the integral input of the modeling classification model;
b3, modeling and classifying the face mouth region of the continuous multi-frame picture information by using a deep learning classification model;
b4, judging the lip movement action state of the mouth region of the face of the continuous multi-frame picture information by using a deep learning classification model, and giving out the confidence coefficient corresponding to the lip movement action state judgment;
estimating the labial movement status using the trajectory prediction mode specifically includes,
c1, calculating a first average value of 3 key points related to an upper lip, taking the first average value as a standard point of the lower edge position of the upper lip, calculating a second average value of 3 key points related to a lower lip, taking the second average value as a standard point of the upper edge position of the lower lip, and calculating a difference value between the first average value and the second average value and taking the difference value as a distance of a lip movement action state of a current frame picture;
c2, calculating the distances of lip movement action states corresponding to continuous multi-frame pictures according to the step C1, so as to calculate track changes of a plurality of different distances;
c3, estimating two turning points of the track change by using a training set, wherein the two turning points represent turning point parameters of the lip from opening to closing and turning point parameters from closing to opening respectively;
c4, judging the lip movement state of the lip movement track of the target object by utilizing the two turning point parameters, and giving out the confidence coefficient of the judgment of the lip movement state;
the state estimation post-processing module is used for carrying out core parameter calculation on the confidence coefficient of the lip motion action state calculated by using the modeling classification mode and the track prediction mode according to the confidence coefficient weighting comprehensive decision mode, carrying out processing and robustness judgment on a smooth window on a core parameter calculation result, and adjusting a recording action mode and/or a recording data processing mode of a target object according to the finally determined lip motion action state;
further, the camera module is further used for acquiring a plurality of initial images of different azimuth angles of the face of the target object;
the face recognition module is further used for analyzing and processing the plurality of initial images through a preset face structure recognition model so as to obtain face feature information, wherein the face feature information at least comprises position information of lips of the target object on the face of the target object;
further, the man-machine interaction system based on the lip action further comprises a frame image extraction module, a validity judgment module and a video shooting adjustment module;
the frame image extraction module is used for extracting a plurality of frames of different images from the video information according to a preset time interval;
the face recognition module is also used for determining the position information of the face and/or lips of the target object in the frames of different images;
the effectiveness judging module is used for carrying out matching processing on the position information and the face characteristic information so as to judge the effectiveness of the video information;
the video shooting adjustment module is used for adjusting and acquiring the shooting mode of the target object video information in real time according to the validity judgment result of the video information;
further, the lip action acquisition module comprises a positioning and tracking sub-module, a lip movement determining sub-module and a lip action triggering judging sub-module; wherein,,
the positioning and tracking sub-module is used for determining the position of the lip of the target object corresponding to the video image in the video information according to the face recognition processing result, so as to perform positioning and tracking processing on the lip;
the lip movement determining submodule is used for determining the lip movement of the target object according to the result of the positioning tracking processing, wherein the lip movement at least comprises opening, closing and trembling of the lip;
the lip action triggering judging submodule is used for judging whether the lip movement meets the preset voice interaction triggering condition about the target object according to the change frequency of the lip movement in the preset time length;
further, the recording mode adjusting module comprises a triggering condition judging sub-module and a recording related control sub-module; wherein,,
the trigger condition judging sub-module is used for comparing the change frequency of the lip action state in the preset time length with a preset frequency threshold value, determining that the lip action state meets a preset voice interaction trigger condition if the change frequency of the lip action state in the preset time length exceeds the frequency threshold value, and determining that the lip action state does not meet the preset voice interaction trigger condition if the change frequency of the lip action state in the preset time length does not exceed the frequency threshold value;
the recording related control submodule is used for indicating to start to execute recording action and/or recording data processing on the target object when the lip action state is determined to meet the preset voice interaction triggering condition, or is used for indicating to stop executing recording action and/or recording data processing on the target object when the lip action state is determined to not meet the preset voice interaction triggering condition.
Compared with the prior art, the man-machine interaction method and the system based on the lip actions are different from the prior art in that the voice recognition technology only relies on receiving and processing external voice information to perform voice recognition, the system and the method also combine the lip actions of the user to determine the start and end of the voice actions of the user, and solve the defect that the prior art cannot extract user voice signals from noisy environments by utilizing multi-modal information and lip action capturing technology of the user.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a man-machine interaction method based on lip actions.
Fig. 2 is a schematic structural diagram of a man-machine interaction system based on lip actions according to the present invention.
Fig. 3 is a schematic structural diagram of another man-machine interaction system based on lip actions according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a flow chart of a man-machine interaction method based on lip actions according to an embodiment of the present invention is shown. The man-machine interaction method based on the lip action comprises the following steps:
and (1) acquiring video information about the target object, and performing face recognition processing on the target object on the video information.
Preferably, in the step (1), video information about a target object is acquired, the performing face recognition processing about the target object on the video information specifically includes,
step (101), acquiring an initial image about the target object, so as to determine the face characteristic information of the target object;
step (102), obtaining video information about the target object, and judging the validity of the video information according to the face characteristic information;
and (103) adjusting and acquiring the shooting mode of the target object video information in real time according to the validity judgment result of the video information.
Preferably, in the step (101), an initial image about the target object is acquired, whereby the determination of the face feature information of the target object specifically includes,
acquiring a plurality of initial images of different azimuth angles of the face of the target object, and analyzing and processing the initial images through a preset face structure recognition model to obtain the face characteristic information, wherein the face characteristic information at least comprises the position information of the lips of the target object on the face of the target object.
Preferably, in the step (102), acquiring video information about the target object, and judging the validity of the video information based on the face feature information specifically includes,
extracting a plurality of frames of different images from the video information according to a preset time interval, and determining the position information of the face and/or the lip of the target object in the frames of different images;
step (1022), the position information and the face feature information are matched so as to judge the validity of the video information;
and step (1023) determining that the video information is valid if the position information is matched with the face feature information, and determining that the video information is not valid if the position information is not matched with the face feature information.
Preferably, in the step (103), adjusting the shooting mode of acquiring the target object video information in real time based on the result of the validity judgment on the video information specifically includes,
if the video information has validity, maintaining the shooting mode of the video information of the target object to be obtained at present unchanged;
and if the video information is not effective, adjusting at least one of shooting angle, shooting resolution and shooting exposure of the video information of the target object.
And (2) determining the lip action state of the target object based on the result of the face recognition processing.
Preferably, in the step (2), determining the lip motion state of the target object based on the result of the face recognition process specifically includes,
step (201), based on the result of the face recognition processing, determining the position of the lip of the target object corresponding to the video image in the video information, so as to perform positioning tracking processing on the lip;
a step (202) of determining a lip action state of the target object based on a result of the positioning tracking process, wherein the lip action state includes at least opening, closing, and chatter of a lip;
and step (203), obtaining the change frequency of the lip action state within a preset time length, so as to judge whether the lip action state meets a preset voice interaction triggering condition related to the target object.
And (3) adjusting a recording action mode and/or a recording data processing mode of the target object based on the lip action state.
Preferably, in the step (3), adjusting the recording action mode and/or recording data processing mode for the target object based on the lip action state specifically includes,
step (301), if the change frequency of the lip action state within the preset time length exceeds a frequency threshold, determining that the lip action state meets the preset voice interaction triggering condition, and if the change frequency of the lip action state within the preset time length does not exceed the frequency threshold, determining that the lip action state does not meet the preset voice interaction triggering condition;
step (302), when the lip action state is determined to meet the preset voice interaction triggering condition, indicating to start recording action and/or recording data processing on the target object;
and (303) when the lip action state is determined not to meet the preset voice interaction triggering condition, indicating to stop executing the recording action and/or recording data processing on the target object.
Referring to fig. 2, a schematic structural diagram of a man-machine interaction system based on lip actions according to an embodiment of the present invention is provided. The man-machine interaction system based on the lip action comprises a camera module, a face recognition module, a lip action acquisition module and a recording mode adjustment module; wherein,,
the camera module is used for acquiring video information about a target object;
the face recognition module is used for carrying out face recognition processing on the target object by the video information;
the lip action acquisition module is used for determining the lip action state of the target object according to the face recognition processing result;
the recording mode adjusting module is used for adjusting a recording action mode and/or a recording data processing mode of the target object according to the lip action state.
Preferably, the camera module is further used for acquiring a plurality of initial images of different azimuth angles of the face of the target object;
preferably, the face recognition module is further configured to analyze the plurality of initial images through a preset face structure recognition model, so as to obtain the face feature information;
preferably, the face feature information at least comprises position information of lips of the target object on the face of the target object;
preferably, the man-machine interaction system based on lip actions further comprises a frame image extraction module, a validity judgment module and a video shooting adjustment module;
preferably, the frame image extracting module is used for extracting a plurality of frames of different images from the video information according to a preset time interval;
preferably, the face recognition module is further configured to determine positional information of the face and/or lips of the target object in the several frames of different images;
preferably, the validity judging module is used for carrying out matching processing on the position information and the face characteristic information so as to judge the validity of the video information;
preferably, the video shooting adjustment module is used for adjusting the shooting mode for acquiring the video information of the target object in real time according to the validity judgment result of the video information;
preferably, the lip action acquisition module comprises a positioning tracking sub-module, a lip movement determining sub-module and a lip action triggering judging sub-module; the method comprises the steps of carrying out a first treatment on the surface of the
Preferably, the positioning and tracking sub-module is configured to determine, according to a result of the face recognition processing, a position of a lip of the target object in the video information corresponding to the video image, so as to perform positioning and tracking processing on the lip;
preferably, the lip movement determining submodule is used for determining the lip movement of the target object according to the result of the positioning tracking processing, wherein the lip movement at least comprises opening, closing and tremor of the lip; preferably, the lip action triggering judging submodule is used for judging whether the lip movement meets the preset voice interaction triggering condition about the target object according to the change frequency of the lip movement within the preset time length;
preferably, the recording mode adjustment module comprises a triggering condition judgment sub-module and a recording related control sub-module;
preferably, the triggering condition judging submodule is used for comparing the change frequency of the lip action state in the preset time length with a preset frequency threshold value, determining that the lip action state meets the preset voice interaction triggering condition if the change frequency of the lip action state in the preset time length exceeds the frequency threshold value, and determining that the lip action state does not meet the preset voice interaction triggering condition if the change frequency of the lip action state in the preset time length does not exceed the frequency threshold value;
preferably, the recording-related control submodule is used for indicating to start to execute recording action and/or recording data processing on the target object when the lip action state is determined to meet the preset voice interaction triggering condition;
preferably, the recording-related control sub-module is further configured to instruct stopping performing the recording action and/or recording data processing on the target object when it is determined that the lip action state does not meet the preset voice interaction triggering condition.
Referring to fig. 3, a schematic structural diagram of another man-machine interaction system based on lip actions according to the present invention is provided. The man-machine interaction system based on the lip action comprises a video stream acquisition module, a target object detection module, a lip key point extraction module, a lip action state estimation module and a state estimation post-processing module; wherein,,
the video stream acquisition module is used for acquiring real-time video stream framing and extracting image information, filtering and selecting the extracted image information according to preset frame rate and frame skip parameters, sending the image information after filtering and selecting into the target object detection module, and simultaneously recording relative timestamp information among different images;
the target object detection module is used for receiving the output of the video stream acquisition module and carrying out the following processing on each frame of picture:
a1, detecting whether the current frame picture contains face information, if so, carrying out size segmentation based on a face detection frame on the face information in the current frame picture, extracting a face image in the picture, and if not, terminating subsequent processing operation;
a2, screening face information corresponding to the current frame picture, selecting a target object, and selecting a face with the largest area as the target object based on the face size of the face detection frame so as to eliminate surrounding interference objects;
the lip key point extraction module is used for extracting facial five-element key point information in the facial information, extracting key points related to lips from the facial five-element key point information and taking the extracted key points as a reference for lip action judgment, wherein the key points related to lips comprise 20 key points related to upper lips and 8 key points related to lower lips;
the lip motion state estimation module adopts key points related to lips in continuous multi-frame picture information as corresponding input, and estimates a lip motion state by using a modeling classification mode or a track prediction mode; wherein,,
estimating the state of lip motion using a modeled classification model specifically includes
B1, constructing a face mouth area which is formed by surrounding on the basis of 20 key points related to an upper lip and 8 key points related to a lower lip, and cutting out the face mouth area from a current frame picture;
b2, splicing the mouth regions of the faces of the continuous multi-frame picture information to be used as the integral input of the modeling classification model;
b3, modeling and classifying the face mouth region of the continuous multi-frame picture information by using a deep learning classification model;
b4, judging the lip movement action state of the mouth region of the face of the continuous multi-frame picture information by using a deep learning classification model, and giving out the confidence coefficient corresponding to the lip movement action state judgment;
estimating the labial movement status using the trajectory prediction mode specifically includes,
c1, calculating a first average value of 3 key points related to an upper lip, taking the first average value as a standard point of the lower edge position of the upper lip, calculating a second average value of 3 key points related to a lower lip, taking the second average value as a standard point of the upper edge position of the lower lip, and calculating a difference value between the first average value and the second average value and taking the difference value as a distance of a lip movement action state of a current frame picture;
c2, calculating the distances of lip movement action states corresponding to continuous multi-frame pictures according to the step C1, so as to calculate track changes of a plurality of different distances;
c3, estimating two turning points of the track change by using a training set, wherein the two turning points represent turning point parameters of the lip from opening to closing and turning point parameters from closing to opening respectively;
c4, judging the lip movement state of the lip movement track of the target object by utilizing the two turning point parameters, and giving out the confidence coefficient of the judgment of the lip movement state;
the state estimation post-processing module is used for carrying out core parameter calculation on the confidence coefficient of the lip motion action state calculated by using the modeling classification mode and the track prediction mode according to the confidence coefficient weighting comprehensive decision mode, carrying out processing and robustness judgment on a smooth window on a core parameter calculation result, and adjusting a recording action mode and/or a recording data processing mode of a target object according to the finally determined lip motion action state.
As can be seen from the above embodiments, the human-computer interaction method and system based on lip actions are different from the existing speech recognition technology in that the speech recognition is performed by only receiving and processing external speech information, the system and method also combine the lip actions of the user to determine the start and end of the speech actions of the user, and use the multi-modal information and lip action capturing technology about the user to solve the defect that the prior art cannot extract the speech signals of the user from the noisy environment.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (8)

1. The man-machine interaction method based on the lip action is characterized by comprising the following steps of:
step (1), acquiring video information about a target object, and performing face recognition processing about the target object on the video information;
step (2), determining the lip action state of the target object based on the result of the face recognition processing;
step (3), based on the lip action state, adjusting a recording action mode and/or a recording data processing mode of the target object;
in the step (2), determining the lip motion state of the target object based on the result of the face recognition processing specifically includes,
step (201), based on the result of the face recognition processing, determining the position of the lip of the target object in the video information corresponding to the video image, so as to perform positioning and tracking processing on the lip;
a step (202) of determining a lip motion state of the target object based on a result of the positioning tracking process, wherein the lip motion state includes at least opening, closing, and chattering of a lip; and step (203) of obtaining the change frequency of the lip action state within a preset time length so as to judge whether the lip action state meets a preset voice interaction triggering condition about the target object.
2. The lip action-based human-machine interaction method according to claim 1, wherein:
in the step (1), obtaining video information about a target object, performing face recognition processing on the video information about the target object specifically includes,
step (101), acquiring an initial image about the target object, so as to determine face characteristic information of the target object;
step (102), obtaining video information about the target object, and judging the validity of the video information according to the face characteristic information;
and (103) adjusting and acquiring the shooting mode of the target object video information in real time according to the validity judgment result of the video information.
3. A lip action-based human-machine interaction method as claimed in claim 2, wherein:
in the step (101), acquiring an initial image about the target object, thereby determining face feature information of the target object specifically includes,
acquiring a plurality of initial images of different azimuth angles of the face of the target object, and analyzing and processing the initial images through a preset face structure recognition model to obtain face characteristic information, wherein the face characteristic information at least comprises the position information of the lips of the target object on the face of the target object;
or,
in the step (102), acquiring video information about the target object, and judging the validity of the video information according to the face characteristic information specifically comprises,
extracting a plurality of frames of different images from the video information according to a preset time interval, and determining the position information of the face and/or the lip of the target object in the frames of different images;
step (1022), the position information and the face characteristic information are matched, so that the validity of the video information is judged;
step (1023), if the position information is matched with the face feature information, determining that the video information is valid, and if the position information is not matched with the face feature information, determining that the video information is not valid;
or,
in the step (103), adjusting in real time a shooting mode of acquiring the target object video information based on a result of the validity judgment regarding the video information specifically includes,
if the video information has validity, maintaining a shooting mode of the video information of the target object to be currently acquired unchanged;
and if the video information does not have the validity, adjusting at least one of shooting angle, shooting resolution and shooting exposure of the video information of the target object.
4. The lip action-based human-machine interaction method according to claim 1, wherein:
in the step (3), adjusting a recording action mode and/or a recording data processing mode for the target object based on the lip action state specifically includes,
step (301), if the change frequency of the lip action state within a preset time length exceeds a frequency threshold, determining that the lip action state meets a preset voice interaction triggering condition, and if the change frequency of the lip action state within the preset time length does not exceed the frequency threshold, determining that the lip action state does not meet the preset voice interaction triggering condition;
step (302), when the lip action state is determined to meet a preset voice interaction triggering condition, indicating to start recording action and/or recording data processing on the target object;
and (303) when the lip action state is determined not to meet the preset voice interaction triggering condition, indicating to stop executing the recording action and/or recording data processing on the target object.
5. A man-machine interaction system based on lip actions is characterized in that:
the man-machine interaction system based on the lip action comprises a camera module, a face recognition module, a lip action acquisition module and a recording mode adjustment module; wherein,,
the camera module is used for acquiring video information about a target object;
the face recognition module is used for carrying out face recognition processing on the target object by the video information;
the lip action acquisition module is used for determining the lip action state of the target object according to the face recognition processing result;
the recording mode adjusting module is used for adjusting a recording action mode and/or a recording data processing mode of the target object according to the lip action state;
or,
the man-machine interaction system based on the lip action comprises a video stream acquisition module, a target object detection module, a lip key point extraction module, a lip action state estimation module and a state estimation post-processing module; wherein,,
the video stream acquisition module is used for acquiring real-time video stream framing and extracting image information, filtering and selecting the extracted image information according to preset frame rate and frame skip parameters, sending the filtered and selected image information into the target object detection module, and simultaneously recording relative timestamp information among different images;
the target object detection module is used for receiving the output of the video stream acquisition module and carrying out the following processing on each frame of picture:
a1, detecting whether the current frame picture contains face information, if so, carrying out size segmentation based on a face detection frame on the face information in the current frame picture, extracting a face image in the picture, and if not, terminating subsequent processing operation;
a2, screening face information corresponding to the current frame picture, selecting a target object, and selecting a face with the largest area as the target object based on the face size of the face detection frame so as to eliminate surrounding interference objects;
the lip key point extraction module is used for extracting facial five-element key point information in the facial information, extracting key points related to lips from the facial five-element key point information and taking the extracted key points as a reference for lip action judgment, wherein the key points related to lips comprise 20 key points related to upper lips and 8 key points related to lower lips;
the lip motion state estimation module adopts key points related to lips in continuous multi-frame picture information as corresponding input, and estimates a lip motion state by using a modeling classification mode or a track prediction mode; wherein,,
estimating the state of lip motion using a modeled classification model specifically includes
B1, constructing a face mouth area which is formed by surrounding on the basis of 20 key points related to an upper lip and 8 key points related to a lower lip, and cutting out the face mouth area from a current frame picture;
b2, splicing the mouth regions of the faces of the continuous multi-frame picture information to be used as the integral input of the modeling classification model;
b3, modeling and classifying the face mouth region of the continuous multi-frame picture information by using a deep learning classification model;
b4, judging the lip movement action state of the mouth region of the face of the continuous multi-frame picture information by using a deep learning classification model, and giving out the confidence coefficient corresponding to the lip movement action state judgment;
estimating the labial movement status using the trajectory prediction mode specifically includes,
c1, calculating a first average value of 3 key points related to an upper lip, taking the first average value as a standard point of the lower edge position of the upper lip, calculating a second average value of 3 key points related to a lower lip, taking the second average value as a standard point of the upper edge position of the lower lip, and calculating a difference value between the first average value and the second average value and taking the difference value as a distance of a lip movement action state of a current frame picture;
c2, calculating the distances of lip movement action states corresponding to continuous multi-frame pictures according to the step C1, so as to calculate track changes of a plurality of different distances;
c3, estimating two turning points of the track change by using a training set, wherein the two turning points represent turning point parameters of the lip from opening to closing and turning point parameters from closing to opening respectively;
c4, judging the lip movement state of the lip movement track of the target object by utilizing the two turning point parameters, and giving out the confidence coefficient of the judgment of the lip movement state;
the state estimation post-processing module is used for carrying out core parameter calculation on the confidence coefficient of the lip motion action state calculated by using the modeling classification mode and the track prediction mode according to the confidence coefficient weighting comprehensive decision mode, carrying out processing and robustness judgment on a smooth window on a core parameter calculation result, and adjusting a recording action mode and/or a recording data processing mode of a target object according to the finally determined lip motion action state;
the lip action acquisition module comprises a positioning tracking sub-module, a lip movement determination sub-module and a lip action triggering judgment sub-module; wherein,,
the positioning and tracking sub-module is used for determining the position of the lip of the target object corresponding to the video image in the video information according to the face recognition processing result, so as to perform positioning and tracking processing on the lip;
the lip movement determining submodule is used for determining the lip movement of the target object according to the result of the positioning tracking processing, wherein the lip movement at least comprises opening, closing and trembling of the lip;
the lip action triggering judging sub-module is used for judging whether the lip movement meets the preset voice interaction triggering condition about the target object according to the change frequency of the lip movement within the preset time length.
6. A lip action based human-machine interaction system according to claim 5, wherein:
the camera module is also used for acquiring a plurality of initial images of different azimuth angles of the face of the target object;
the face recognition module is further used for analyzing and processing the plurality of initial images through a preset face structure recognition model so as to obtain face feature information, wherein the face feature information at least comprises position information of lips of the target object on the face of the target object.
7. A lip action based human-machine interaction system according to claim 5, wherein:
the man-machine interaction system based on the lip action further comprises a frame image extraction module, a validity judgment module and a video shooting adjustment module;
the frame image extraction module is used for extracting a plurality of frames of different images from the video information according to a preset time interval;
the face recognition module is also used for determining the position information of the face and/or lips of the target object in the frames of different images;
the effectiveness judging module is used for carrying out matching processing on the position information and the face characteristic information so as to judge the effectiveness of the video information;
the video shooting adjustment module is used for adjusting and acquiring the shooting mode of the target object video information in real time according to the effectiveness judgment result of the video information.
8. A lip action based human-machine interaction system according to claim 5, wherein:
the recording mode adjusting module comprises a triggering condition judging sub-module and a recording related control sub-module; wherein,,
the trigger condition judging sub-module is used for comparing the change frequency of the lip action state in the preset time length with a preset frequency threshold value, determining that the lip action state meets a preset voice interaction trigger condition if the change frequency of the lip action state in the preset time length exceeds the frequency threshold value, and determining that the lip action state does not meet the preset voice interaction trigger condition if the change frequency of the lip action state in the preset time length does not exceed the frequency threshold value;
the recording related control submodule is used for indicating to start to execute recording action and/or recording data processing on the target object when the lip action state is determined to meet the preset voice interaction triggering condition, or is used for indicating to stop executing recording action and/or recording data processing on the target object when the lip action state is determined to not meet the preset voice interaction triggering condition.
CN201910859039.1A 2019-09-11 2019-09-11 Man-machine interaction method and system based on lip actions Active CN110750152B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910859039.1A CN110750152B (en) 2019-09-11 2019-09-11 Man-machine interaction method and system based on lip actions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910859039.1A CN110750152B (en) 2019-09-11 2019-09-11 Man-machine interaction method and system based on lip actions

Publications (2)

Publication Number Publication Date
CN110750152A CN110750152A (en) 2020-02-04
CN110750152B true CN110750152B (en) 2023-08-29

Family

ID=69276346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910859039.1A Active CN110750152B (en) 2019-09-11 2019-09-11 Man-machine interaction method and system based on lip actions

Country Status (1)

Country Link
CN (1) CN110750152B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428672A (en) * 2020-03-31 2020-07-17 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium
CN111918106A (en) * 2020-07-07 2020-11-10 胡飞青 Multimedia playing system and method for application scene recognition
CN112015364A (en) * 2020-08-26 2020-12-01 广州视源电子科技股份有限公司 Method and device for adjusting pickup sensitivity
CN112597467A (en) * 2020-12-15 2021-04-02 中标慧安信息技术股份有限公司 Multimedia-based resident authentication method and system
CN112966654B (en) * 2021-03-29 2023-12-19 深圳市优必选科技股份有限公司 Lip movement detection method, lip movement detection device, terminal equipment and computer readable storage medium
CN113393833B (en) * 2021-06-16 2024-04-02 中国科学技术大学 Audio and video awakening method, system, equipment and storage medium
CN113486760A (en) * 2021-06-30 2021-10-08 上海商汤临港智能科技有限公司 Object speaking detection method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2566844A1 (en) * 2005-04-13 2006-10-26 Pixel Instruments, Corp. Method, system, and program product for measuring audio video synchronization using lip and teeth charateristics
CN101101752A (en) * 2007-07-19 2008-01-09 华中科技大学 Monosyllabic language lip-reading recognition system based on vision character
WO2012128382A1 (en) * 2011-03-18 2012-09-27 Sharp Kabushiki Kaisha Device and method for lip motion detection
CN104966053A (en) * 2015-06-11 2015-10-07 腾讯科技(深圳)有限公司 Face recognition method and recognition system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101092820B1 (en) * 2009-09-22 2011-12-12 현대자동차주식회사 Lipreading and Voice recognition combination multimodal interface system
US10360441B2 (en) * 2015-11-25 2019-07-23 Tencent Technology (Shenzhen) Company Limited Image processing method and apparatus
US20170161553A1 (en) * 2015-12-08 2017-06-08 Le Holdings (Beijing) Co., Ltd. Method and electronic device for capturing photo

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2566844A1 (en) * 2005-04-13 2006-10-26 Pixel Instruments, Corp. Method, system, and program product for measuring audio video synchronization using lip and teeth charateristics
CN101101752A (en) * 2007-07-19 2008-01-09 华中科技大学 Monosyllabic language lip-reading recognition system based on vision character
WO2012128382A1 (en) * 2011-03-18 2012-09-27 Sharp Kabushiki Kaisha Device and method for lip motion detection
CN104966053A (en) * 2015-06-11 2015-10-07 腾讯科技(深圳)有限公司 Face recognition method and recognition system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
马新军 ; 吴晨晨 ; 仲乾元 ; 李园园 ; .基于SIFT的说话人唇动识别.计算机应用.2017,(09),全文. *

Also Published As

Publication number Publication date
CN110750152A (en) 2020-02-04

Similar Documents

Publication Publication Date Title
CN110750152B (en) Man-machine interaction method and system based on lip actions
US7472063B2 (en) Audio-visual feature fusion and support vector machine useful for continuous speech recognition
US8635066B2 (en) Camera-assisted noise cancellation and speech recognition
TWI780366B (en) Facial recognition system, facial recognition method and facial recognition program
US20160140959A1 (en) Speech recognition system adaptation based on non-acoustic attributes
CN105844659B (en) The tracking and device of moving component
WO2012128382A1 (en) Device and method for lip motion detection
CN103605969A (en) Method and device for face inputting
CN111933136B (en) Auxiliary voice recognition control method and device
CN115131821A (en) Improved YOLOv5+ Deepsort-based campus personnel crossing warning line detection method
CN111341350A (en) Man-machine interaction control method and system, intelligent robot and storage medium
Ponce-López et al. Multi-modal social signal analysis for predicting agreement in conversation settings
CN115527158A (en) Method and device for detecting abnormal behaviors of personnel based on video monitoring
CN114299953B (en) Speaker role distinguishing method and system combining mouth movement analysis
CN111241922A (en) Robot, control method thereof and computer-readable storage medium
KR20210066774A (en) Method and Apparatus for Distinguishing User based on Multimodal
CN114282621B (en) Multi-mode fused speaker role distinguishing method and system
CN114299952B (en) Speaker role distinguishing method and system combining multiple motion analysis
CN110892412A (en) Face recognition system, face recognition method, and face recognition program
CN115188081B (en) Complex scene-oriented detection and tracking integrated method
CN106599765B (en) Method and system for judging living body based on video-audio frequency of object continuous pronunciation
CN116705016A (en) Control method and device of voice interaction equipment, electronic equipment and medium
CN112132865A (en) Personnel identification method and system
CN112183165A (en) Face recognition method based on accumulated monitoring video
Yoshinaga et al. Audio-visual speech recognition using new lip features extracted from side-face images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant