CN110750152A - Human-computer interaction method and system based on lip action - Google Patents

Human-computer interaction method and system based on lip action Download PDF

Info

Publication number
CN110750152A
CN110750152A CN201910859039.1A CN201910859039A CN110750152A CN 110750152 A CN110750152 A CN 110750152A CN 201910859039 A CN201910859039 A CN 201910859039A CN 110750152 A CN110750152 A CN 110750152A
Authority
CN
China
Prior art keywords
lip
target object
face
information
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910859039.1A
Other languages
Chinese (zh)
Other versions
CN110750152B (en
Inventor
刘青松
李旭滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN201910859039.1A priority Critical patent/CN110750152B/en
Publication of CN110750152A publication Critical patent/CN110750152A/en
Application granted granted Critical
Publication of CN110750152B publication Critical patent/CN110750152B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a human-computer interaction method based on lip actions, which comprises the following steps: the method comprises the following steps of (1) acquiring video information about a target object, and carrying out face recognition processing about the target object on the video information; step (2), based on the result of the face recognition processing, determining the lip action state of the target object; and (3) adjusting the recording action mode and/or the recording data processing mode of the target object based on the lip action state.

Description

Human-computer interaction method and system based on lip action
Technical Field
The invention relates to the technical field of human-computer interaction, in particular to a human-computer interaction method and system based on lip actions.
Background
Currently, the current practice is. The voice recognition technology is widely applied to different occasions, particularly, all the IOT intelligent terminal devices can carry out active awakening operation through the voice recognition technology, wherein the voice recognition technology is mainly realized through different modes such as offline VAD sentence break, cloud VAD sentence break, mixing of the offline VAD and the cloud VAD sentence break, or ASR cloud sentence break, and finally voice recognition results are obtained from corresponding VAD engines or ASR engines. In practical application, most IOT intelligent terminal devices are placed in noisy environments such as administrative halls, office areas, markets, roads, stations or airports, under the environment, VAD cloud ends and ASR cloud ends cannot normally work due to interference of external noise, the external noise can not only lead to mistaken awakening operation of the IOT intelligent terminal devices, but also can lead to abnormal ending of sentence-breaking operation, and therefore voice recognition interaction cannot be normally carried out between human and machines. Therefore, there is a need in the art for a method and system for accurately and quickly performing speech recognition and human-computer interaction in a noisy environment.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a human-computer interaction method and system based on lip actions, wherein the human-computer interaction method based on the lip actions comprises the following steps: the method comprises the following steps of (1) acquiring video information about a target object, and carrying out face recognition processing about the target object on the video information; step (2), based on the result of the face recognition processing, determining the lip action state of the target object; step (3), based on the lip action state, adjusting the recording action mode and/or the recording data processing mode of the target object; in addition, the human-computer interaction system based on the lip action comprises a camera module, a face recognition module, a lip action acquisition module and a recording mode adjustment module; the camera module is used for acquiring video information about a target object; the face recognition module is used for carrying out face recognition processing on the target object by the video information; the lip action acquisition module is used for determining the lip action state of the target object according to the result of the face recognition processing; the recording mode adjusting module is used for adjusting a recording action mode and/or a recording data processing mode of the target object according to the lip action state. Therefore, the human-computer interaction method and the human-computer interaction system based on the lip motions are different from the existing voice recognition technology which only depends on receiving and processing external voice information to perform voice recognition, the system and the method also combine the lip motions of the user to determine the start and the end of the voice motions of the user, and utilize multi-mode information and the lip motion capture technology about the user to solve the defect that the voice signals of the user cannot be extracted from a noisy environment in the prior art.
The invention provides a human-computer interaction method based on lip actions, which is characterized by comprising the following steps of:
the method comprises the following steps of (1) acquiring video information about a target object, and carrying out face recognition processing about the target object on the video information;
step (2), determining the lip action state of the target object based on the result of the face recognition processing;
step (3), based on the lip action state, adjusting a recording action mode and/or a recording data processing mode of the target object;
further, in the step (1), acquiring video information about a target object, and performing face recognition processing about the target object on the video information specifically includes,
a step (101) of acquiring an initial image of the target object so as to determine the facial feature information of the target object;
step (102), video information about the target object is obtained, and the effectiveness of the video information is judged according to the face feature information;
step (103), adjusting the shooting mode of the video information of the target object in real time according to the effectiveness judgment result of the video information;
further, in the step (101), acquiring an initial image of the target object, thereby determining the facial feature information of the target object specifically includes,
acquiring a plurality of initial images related to different azimuth angles of the face of the target object, and analyzing and processing the initial images through a preset face structure recognition model to obtain the face feature information, wherein the face feature information at least comprises position information of lips of the target object on the face;
alternatively, the first and second electrodes may be,
in the step (102), acquiring video information about the target object, and judging the validity of the video information according to the face feature information specifically comprises,
step (1021), extracting a plurality of different frames of images from the video information according to a preset time interval, and determining position information of the face and/or lip of the target object in the plurality of different frames of images;
step (1022), the position information and the face feature information are matched, so that the validity of the video information is judged;
step (1023), if the position information matches with the face feature information, determining that the video information has validity, and if the position information does not match with the face feature information, determining that the video information does not have validity;
alternatively, the first and second electrodes may be,
in the step (103), adjusting in real time a shooting mode for acquiring the video information of the target object according to the validity judgment result on the video information specifically includes,
if the video information has validity, maintaining the shooting mode of the current video information of the target object unchanged;
if the video information does not have validity, adjusting at least one of a shooting angle, a shooting resolution and a shooting exposure of the target object video information;
further, in the step (2), determining the lip motion state of the target object based on the result of the face recognition processing specifically includes,
step (201), based on the result of the face recognition processing, determining the position of the lip of the target object in the video information corresponding to the video image, so as to perform positioning tracking processing on the lip;
a step (202) of determining lip motion states of the target object based on a result of the localization tracking process, wherein the lip motion states include at least opening, closing and tremor of lips;
step (203), obtaining the change frequency of the lip action state in a preset time length, so as to judge whether the lip action state meets a preset voice interaction triggering condition about the target object;
further, in the step (3), adjusting the recording action mode and/or the recording data processing mode for the target object based on the lip action state specifically includes,
step (301), if the change frequency of the lip action state in a preset time length exceeds a frequency threshold, determining that the lip action state meets a preset voice interaction triggering condition, and if the change frequency of the lip action state in the preset time length does not exceed the frequency threshold, determining that the lip action state does not meet the preset voice interaction triggering condition;
step (302), when the lip action state is determined to meet a preset voice interaction triggering condition, indicating to start executing a recording action and/or recording data processing on the target object;
step (303), when the lip action state is determined not to meet the preset voice interaction triggering condition, indicating to stop executing the recording action and/or recording data processing on the target object;
the invention also provides a human-computer interaction system based on lip action, which is characterized in that:
the human-computer interaction system based on the lip action comprises a camera module, a face recognition module, a lip action acquisition module and a recording mode adjustment module; wherein the content of the first and second substances,
the camera module is used for acquiring video information about a target object;
the face recognition module is used for carrying out face recognition processing on the target object by the video information;
the lip action acquisition module is used for determining the lip action state of the target object according to the result of the face recognition processing;
the recording mode adjusting module is used for adjusting a recording action mode and/or a recording data processing mode of the target object according to the lip action state,
alternatively, the first and second electrodes may be,
the human-computer interaction system based on the lip action comprises a video stream acquisition module, a target object detection module, a lip key point extraction module, a lip action state estimation module and a state estimation post-processing module; wherein the content of the first and second substances,
the video stream acquisition module is used for acquiring real-time video stream framing and extracting image information, filtering and selecting the extracted image information according to a preset frame rate and frame skipping parameters, sending the filtered and selected image information into the target object detection module and simultaneously recording relative timestamp information among different images;
the target object detection module is used for receiving the output of the video stream acquisition module and processing each frame of picture as follows:
a1, detecting whether the current frame picture contains face information, if so, performing size segmentation based on a face detection frame on the face information in the current frame picture, extracting a face image in the picture, and if not, terminating the subsequent processing operation;
a2, screening face information corresponding to the current frame picture, selecting a target object, and selecting a face with the largest area as the target object to eliminate surrounding interference objects based on the face size of the face detection frame;
the lip key point extraction module is used for extracting facial five sense organ key point information in the face information, extracting key points related to lips from the facial five sense organ key point information, and taking the key points as the reference of lip action judgment, wherein the key points related to the lips comprise 20 key points related to the upper lips and 8 key points related to the lower lips;
the lip motion state estimation module adopts key points related to lips in continuous multi-frame picture information as corresponding input and estimates the lip motion state by utilizing a modeling classification mode or a track prediction mode; wherein the content of the first and second substances,
the estimation of the lip movement state by using the modeled classification model specifically comprises
B1, constructing a surrounded human face mouth region based on the 20 key points related to the upper lip and the 8 key points related to the lower lip, and cutting the human face mouth region from the current frame picture;
b2, splicing the face mouth regions of continuous multi-frame picture information to be used as the integral input of a modeling classification model;
b3, modeling and classifying the face mouth region of continuous multi-frame picture information by using a deep learning classification model;
b4, carrying out lip movement action state judgment on the face mouth region of continuous multi-frame picture information by using the deep learning classification model, and giving a confidence coefficient corresponding to the lip movement action state judgment;
the estimation of the lip motion state using the trajectory prediction mode specifically includes,
c1, calculating a first average value of 3 key points related to the upper lip, taking the first average value as a calibration point of the lower edge position of the upper lip, calculating a second average value of 3 key points related to the lower lip, taking the second average value as a calibration point of the upper edge position of the lower lip, calculating a difference value between the first average value and the second average value, and taking the difference value as the distance of the lip movement action state of the current frame picture;
c2, according to the step C1, calculating the distance of the lip movement action state corresponding to the continuous multiframe pictures, and calculating the track change of a plurality of different distances;
c3, estimating two turning points of the track change by using the training set, wherein the two turning points respectively represent the turning point parameter of the lip from opening to closing and the turning point parameter from closing to opening;
c4, judging the lip movement state of the lip movement track of the target object by using the above two turning point parameters, and giving the confidence of the judgment of the lip movement state;
the state estimation post-processing module is used for performing core parameter calculation on the confidence coefficient of the lip movement state calculated by utilizing the modeling classification mode and the track prediction mode according to a confidence coefficient weighting comprehensive decision mode, performing processing on a smooth window and judging robustness on a core parameter calculation result, and adjusting a recording movement mode and/or a recording data processing mode of a target object according to the finally determined lip movement state;
further, the camera module is also used for acquiring a plurality of initial images related to different azimuth angles of the face of the target object;
the face recognition module is further configured to analyze the plurality of initial images through a preset face structure recognition model so as to obtain the face feature information, where the face feature information at least includes position information of lips of the target object on the face;
further, the human-computer interaction system based on the lip action further comprises a frame image extraction module, an effectiveness judgment module and a video shooting adjustment module;
the frame image extraction module is used for extracting a plurality of different frames of images from the video information according to a preset time interval;
the face recognition module is further used for determining the position information of the face and/or the lip of the target object in the plurality of different frames of images;
the validity judging module is used for matching the position information with the face feature information so as to judge the validity of the video information;
the video shooting adjusting module is used for adjusting the shooting mode for acquiring the video information of the target object in real time according to the effectiveness judgment result of the video information;
further, the lip action acquisition module comprises a positioning tracking sub-module, a lip movement determination sub-module and a lip action triggering judgment sub-module; wherein the content of the first and second substances,
the positioning and tracking submodule is used for determining the position of the lip of the target object in the video information corresponding to the video image according to the result of the face recognition processing, so as to perform positioning and tracking processing on the lip;
the lip movement determination submodule is used for determining lip movement of the target object according to the positioning and tracking processing result, wherein the lip movement at least comprises opening, closing and shaking of lips;
the lip action triggering judgment submodule is used for judging whether the lip movement meets a preset voice interaction triggering condition about the target object according to the change frequency of the lip movement within a preset time length;
further, the recording mode adjusting module comprises a triggering condition judging submodule and a recording related control submodule; wherein the content of the first and second substances,
the trigger condition judgment submodule is used for comparing the change frequency of the lip action state within a preset time length with a preset frequency threshold, if the change frequency of the lip action state within the preset time length exceeds the frequency threshold, the lip action state is determined to meet the preset voice interaction trigger condition, and if the change frequency of the lip action state within the preset time length does not exceed the frequency threshold, the lip action state is determined not to meet the preset voice interaction trigger condition;
and the recording related control submodule is used for indicating to start to execute recording action and/or recording data processing on the target object when the lip action state is determined to meet the preset voice interaction triggering condition, or indicating to stop executing the recording action and/or recording data processing on the target object when the lip action state is determined to not meet the preset voice interaction triggering condition.
Compared with the prior art, the human-computer interaction method and system based on the lip motions are different from the prior voice recognition technology which only depends on receiving and processing external voice information to perform voice recognition, the system and method further combine the lip motions of the user to determine the start and end of the voice motions of the user, and the defect that the voice signals of the user cannot be extracted from a noisy environment in the prior art is solved by utilizing multi-mode information and a lip motion capturing technology of the user.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a human-computer interaction method based on lip action according to the present invention.
Fig. 2 is a schematic structural diagram of a human-computer interaction system based on lip action according to the present invention.
Fig. 3 is a schematic structural diagram of another human-computer interaction system based on lip action according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a human-computer interaction method based on lip movement according to an embodiment of the present invention. The human-computer interaction method based on the lip action comprises the following steps:
the method comprises the following steps of (1) acquiring video information about a target object, and carrying out face recognition processing about the target object on the video information.
Preferably, in the step (1), acquiring video information on the target object, and performing face recognition processing on the video information on the target object specifically includes,
step (101), obtaining an initial image of the target object, and determining the face feature information of the target object;
step (102), video information about the target object is obtained, and the effectiveness of the video information is judged according to the face feature information;
and (103) adjusting the shooting mode for acquiring the video information of the target object in real time according to the effectiveness judgment result of the video information.
Preferably, in the step (101), obtaining an initial image of the target object to determine the facial feature information of the target object specifically includes,
acquiring a plurality of initial images related to different azimuth angles of the face of the target object, and analyzing and processing the plurality of initial images through a preset face structure recognition model so as to obtain the face feature information, wherein the face feature information at least comprises position information of the lip of the target object on the face.
Preferably, in the step (102), acquiring video information about the target object, and determining validity of the video information based on the face feature information specifically includes,
step (1021), extracting several frames of different images from the video information according to a preset time interval, and determining the position information of the face and/or lip of the target object in the several frames of different images;
step (1022), match the position information with the face feature information, so as to determine the validity of the video information;
and (1023) if the position information is matched with the face feature information, determining that the video information has validity, and if the position information is not matched with the face feature information, determining that the video information does not have validity.
Preferably, in the step (103), adjusting in real time the shooting mode for acquiring the video information of the target object according to the judgment result on the validity of the video information specifically includes,
if the video information has validity, the shooting mode for currently acquiring the video information of the target object is maintained unchanged;
and if the video information does not have validity, adjusting at least one of the shooting angle, the shooting resolution and the shooting exposure of the video information of the target object.
And (2) determining the lip action state of the target object based on the face recognition processing result.
Preferably, in the step (2), determining the lip motion state of the target object based on the result of the face recognition processing specifically includes,
step (201), based on the result of the face recognition processing, determining the position of the lip of the target object in the video information corresponding to the video image, so as to perform positioning tracking processing on the lip;
a step (202) of determining lip motion states of the target object based on a result of the localization tracking process, wherein the lip motion states at least include opening, closing and tremor of lips;
and (203) acquiring the change frequency of the lip action state within a preset time length, so as to judge whether the lip action state meets a preset voice interaction triggering condition about the target object.
And (3) adjusting the recording action mode and/or the recording data processing mode of the target object based on the lip action state.
Preferably, in the step (3), adjusting the recording action mode and/or the recording data processing mode for the target object based on the lip action state specifically includes,
step (301), if the change frequency of the lip action state in the preset time length exceeds a frequency threshold, determining that the lip action state meets a preset voice interaction triggering condition, and if the change frequency of the lip action state in the preset time length does not exceed the frequency threshold, determining that the lip action state does not meet the preset voice interaction triggering condition;
step (302), when the lip action state is determined to meet the preset voice interaction triggering condition, indicating to start executing the recording action and/or recording data processing on the target object;
and (303) when the lip action state is determined not to meet the preset voice interaction triggering condition, indicating to stop executing the recording action and/or recording data processing on the target object.
Fig. 2 is a schematic structural diagram of a human-computer interaction system based on lip movement according to an embodiment of the present invention. The human-computer interaction system based on the lip action comprises a camera module, a face recognition module, a lip action acquisition module and a recording mode adjustment module; wherein the content of the first and second substances,
the camera module is used for acquiring video information about a target object;
the face recognition module is used for carrying out face recognition processing on the target object by the video information;
the lip action acquisition module is used for determining the lip action state of the target object according to the result of the face recognition processing;
the recording mode adjusting module is used for adjusting a recording action mode and/or a recording data processing mode of the target object according to the lip action state.
Preferably, the camera module is further configured to acquire a plurality of initial images of different azimuth angles of the face of the target object;
preferably, the face recognition module is further configured to analyze and process the plurality of initial images through a preset face structure recognition model, so as to obtain the face feature information;
preferably, the facial feature information at least includes position information of lips of the target object on the face;
preferably, the human-computer interaction system based on lip action further comprises a frame image extraction module, an effectiveness judgment module and a video shooting adjustment module;
preferably, the frame image extraction module is configured to extract a plurality of different frames of images from the video information according to a preset time interval;
preferably, the face recognition module is further configured to determine position information of the face and/or lips of the target object in the several different images;
preferably, the validity judging module is configured to perform matching processing on the position information and the face feature information, so as to judge validity of the video information;
preferably, the video shooting adjustment module is configured to adjust a shooting mode for acquiring the video information of the target object in real time according to a result of the validity determination regarding the video information;
preferably, the lip action acquisition module comprises a positioning tracking sub-module, a lip movement determination sub-module and a lip action triggering judgment sub-module; (ii) a
Preferably, the positioning and tracking sub-module is configured to determine, according to the result of the face recognition processing, a position of a lip of the target object in the video information corresponding to the video image, so as to perform positioning and tracking processing on the lip;
preferably, the lip movement determination submodule is configured to determine lip movements of the target object according to a result of the localization tracking process, wherein the lip movements include at least opening, closing, and tremor of lips; preferably, the lip action triggering and judging submodule is configured to judge whether the lip movement meets a preset voice interaction triggering condition for the target object according to a change frequency of the lip movement within a preset time length;
preferably, the recording mode adjusting module comprises a triggering condition judging submodule and a recording related control submodule;
preferably, the triggering condition determining submodule is configured to compare a change frequency of the lip action state within a preset time length with a preset frequency threshold, determine that the lip action state satisfies a preset voice interaction triggering condition if the change frequency of the lip action state within the preset time length exceeds the frequency threshold, and determine that the lip action state does not satisfy the preset voice interaction triggering condition if the change frequency of the lip action state within the preset time length does not exceed the frequency threshold;
preferably, the recording-related control submodule is configured to instruct the target object to start to perform a recording action and/or record data processing when it is determined that the lip action state satisfies a preset voice interaction trigger condition;
preferably, the recording-related control sub-module is further configured to instruct the target object to stop performing the recording action and/or recording data processing when it is determined that the lip action state does not satisfy the preset voice interaction trigger condition.
Fig. 3 is a schematic structural diagram of another human-computer interaction system based on lip movement according to the present invention. The human-computer interaction system based on the lip action comprises a video stream acquisition module, a target object detection module, a lip key point extraction module, a lip action state estimation module and a state estimation post-processing module; wherein the content of the first and second substances,
the video stream acquisition module is used for acquiring real-time video stream framing and extracting image information, filtering and selecting the extracted image information according to a preset frame rate and frame skipping parameters, sending the filtered and selected image information into the target object detection module and simultaneously recording relative timestamp information among different images;
the target object detection module is used for receiving the output of the video stream acquisition module and processing each frame of picture as follows:
a1, detecting whether the current frame picture contains face information, if so, performing size segmentation based on a face detection frame on the face information in the current frame picture, extracting a face image in the picture, and if not, terminating the subsequent processing operation;
a2, screening face information corresponding to the current frame picture, selecting a target object, and selecting a face with the largest area as the target object to eliminate surrounding interference objects based on the face size of the face detection frame;
the lip key point extraction module is used for extracting facial five sense organ key point information in the face information, extracting key points related to lips from the facial five sense organ key point information, and taking the key points as the standard of lip action judgment, wherein the key points related to the lips comprise 20 key points related to the upper lips and 8 key points related to the lower lips;
the lip motion state estimation module adopts key points related to lips in continuous multi-frame picture information as corresponding input and estimates the lip motion state by utilizing a modeling classification mode or a track prediction mode; wherein the content of the first and second substances,
the estimation of the lip movement state by using the modeled classification model specifically comprises
B1, constructing a surrounded human face mouth region based on the 20 key points related to the upper lip and the 8 key points related to the lower lip, and cutting the human face mouth region from the current frame picture;
b2, splicing the face mouth regions of continuous multi-frame picture information to be used as the integral input of a modeling classification model;
b3, modeling and classifying the face mouth region of continuous multi-frame picture information by using a deep learning classification model;
b4, carrying out lip movement action state judgment on the face mouth region of continuous multi-frame picture information by using the deep learning classification model, and giving a confidence coefficient corresponding to the lip movement action state judgment;
the estimation of the lip motion state using the trajectory prediction mode specifically includes,
c1, calculating a first average value of 3 key points related to the upper lip, taking the first average value as a calibration point of the lower edge position of the upper lip, calculating a second average value of 3 key points related to the lower lip, taking the second average value as a calibration point of the upper edge position of the lower lip, calculating a difference value between the first average value and the second average value, and taking the difference value as the distance of the lip movement action state of the current frame picture;
c2, according to the step C1, calculating the distance of the lip movement action state corresponding to the continuous multiframe pictures, and calculating the track change of a plurality of different distances;
c3, estimating two turning points of the track change by using the training set, wherein the two turning points respectively represent the turning point parameter of the lip from opening to closing and the turning point parameter from closing to opening;
c4, judging the lip movement state of the lip movement track of the target object by using the above two turning point parameters, and giving the confidence of the judgment of the lip movement state;
the state estimation post-processing module is used for performing core parameter calculation on the confidence coefficient of the lip movement state calculated by utilizing the modeling classification mode and the track prediction mode according to a confidence coefficient weighting comprehensive decision mode, performing processing on a smooth window and judging robustness on a core parameter calculation result, and adjusting a recording movement mode and/or a recording data processing mode of a target object according to the finally determined lip movement state.
It can be seen from the above embodiments that the human-computer interaction method and system based on lip movement are different from the existing voice recognition technology that only external voice information is received and processed for voice recognition, the system and method further combine the lip movement of the user to determine the start and end of the voice movement of the user, and utilize multi-modal information and lip movement capturing technology about the user to solve the defect that the prior art cannot extract the voice signal of the user from a noisy environment.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A human-computer interaction method based on lip actions is characterized by comprising the following steps:
the method comprises the following steps of (1) acquiring video information about a target object, and carrying out face recognition processing about the target object on the video information;
step (2), determining the lip action state of the target object based on the result of the face recognition processing;
and (3) adjusting a recording action mode and/or a recording data processing mode of the target object based on the lip action state.
2. A human-computer interaction method based on lip action according to claim 1, characterized in that:
in the step (1), acquiring video information about a target object, and performing face recognition processing about the target object on the video information specifically includes,
a step (101) of acquiring an initial image of the target object so as to determine the facial feature information of the target object;
step (102), video information about the target object is obtained, and the effectiveness of the video information is judged according to the face feature information;
and (103) adjusting the shooting mode for acquiring the video information of the target object in real time according to the judgment result about the effectiveness of the video information.
3. A human-computer interaction method based on lip action according to claim 2, characterized in that:
in the step (101), acquiring an initial image of the target object, so as to determine the face feature information of the target object specifically comprises,
acquiring a plurality of initial images related to different azimuth angles of the face of the target object, and analyzing and processing the initial images through a preset face structure recognition model to obtain the face feature information, wherein the face feature information at least comprises position information of lips of the target object on the face;
alternatively, the first and second electrodes may be,
in the step (102), acquiring video information about the target object, and judging the validity of the video information according to the face feature information specifically comprises,
step (1021), extracting a plurality of different frames of images from the video information according to a preset time interval, and determining position information of the face and/or lip of the target object in the plurality of different frames of images;
step (1022), the position information and the face feature information are matched, so that the validity of the video information is judged;
step (1023), if the position information matches with the face feature information, determining that the video information has validity, and if the position information does not match with the face feature information, determining that the video information does not have validity;
alternatively, the first and second electrodes may be,
in the step (103), adjusting in real time a shooting mode for acquiring the video information of the target object according to the validity judgment result on the video information specifically includes,
if the video information has validity, maintaining the shooting mode of the current video information of the target object unchanged;
and if the video information does not have validity, adjusting at least one of the shooting angle, the shooting resolution and the shooting exposure of the target object video information.
4. A human-computer interaction method based on lip action according to claim 1, characterized in that:
in the step (2), determining the lip motion state of the target object based on the result of the face recognition processing specifically includes,
step (201), based on the result of the face recognition processing, determining the position of the lip of the target object in the video information corresponding to the video image, so as to perform positioning tracking processing on the lip;
a step (202) of determining lip motion states of the target object based on a result of the localization tracking process, wherein the lip motion states include at least opening, closing and tremor of lips;
and (203) acquiring the change frequency of the lip action state in a preset time length, so as to judge whether the lip action state meets a preset voice interaction triggering condition about the target object.
5. A human-computer interaction method based on lip action according to claim 1, characterized in that:
in the step (3), adjusting the recording action mode and/or the recording data processing mode for the target object based on the lip action state specifically includes,
step (301), if the change frequency of the lip action state in a preset time length exceeds a frequency threshold, determining that the lip action state meets a preset voice interaction triggering condition, and if the change frequency of the lip action state in the preset time length does not exceed the frequency threshold, determining that the lip action state does not meet the preset voice interaction triggering condition;
step (302), when the lip action state is determined to meet a preset voice interaction triggering condition, indicating to start executing a recording action and/or recording data processing on the target object;
and (303) when the lip action state is determined not to meet the preset voice interaction triggering condition, indicating to stop executing the recording action and/or recording data processing on the target object.
6. A human-computer interaction system based on lip action is characterized in that:
the human-computer interaction system based on the lip action comprises a camera module, a face recognition module, a lip action acquisition module and a recording mode adjustment module; wherein the content of the first and second substances,
the camera module is used for acquiring video information about a target object;
the face recognition module is used for carrying out face recognition processing on the target object by the video information;
the lip action acquisition module is used for determining the lip action state of the target object according to the result of the face recognition processing;
the recording mode adjusting module is used for adjusting a recording action mode and/or a recording data processing mode of the target object according to the lip action state;
alternatively, the first and second electrodes may be,
the human-computer interaction system based on the lip action comprises a video stream acquisition module, a target object detection module, a lip key point extraction module, a lip action state estimation module and a state estimation post-processing module; wherein the content of the first and second substances,
the video stream acquisition module is used for acquiring real-time video stream framing and extracting image information, filtering and selecting the extracted image information according to a preset frame rate and frame skipping parameters, sending the filtered and selected image information into the target object detection module and simultaneously recording relative timestamp information among different images;
the target object detection module is used for receiving the output of the video stream acquisition module and processing each frame of picture as follows:
a1, detecting whether the current frame picture contains face information, if so, performing size segmentation based on a face detection frame on the face information in the current frame picture, extracting a face image in the picture, and if not, terminating the subsequent processing operation;
a2, screening face information corresponding to the current frame picture, selecting a target object, and selecting a face with the largest area as the target object to eliminate surrounding interference objects based on the face size of the face detection frame;
the lip key point extraction module is used for extracting facial five sense organ key point information in the face information, extracting key points related to lips from the facial five sense organ key point information, and taking the key points as the reference of lip action judgment, wherein the key points related to the lips comprise 20 key points related to the upper lips and 8 key points related to the lower lips;
the lip motion state estimation module adopts key points related to lips in continuous multi-frame picture information as corresponding input and estimates the lip motion state by utilizing a modeling classification mode or a track prediction mode; wherein the content of the first and second substances,
the estimation of the lip movement state by using the modeled classification model specifically comprises
B1, constructing a surrounded human face mouth region based on the 20 key points related to the upper lip and the 8 key points related to the lower lip, and cutting the human face mouth region from the current frame picture;
b2, splicing the face mouth regions of continuous multi-frame picture information to be used as the integral input of a modeling classification model;
b3, modeling and classifying the face mouth region of continuous multi-frame picture information by using a deep learning classification model;
b4, carrying out lip movement action state judgment on the face mouth region of continuous multi-frame picture information by using the deep learning classification model, and giving a confidence coefficient corresponding to the lip movement action state judgment;
the estimation of the lip motion state using the trajectory prediction mode specifically includes,
c1, calculating a first average value of 3 key points related to the upper lip, taking the first average value as a calibration point of the lower edge position of the upper lip, calculating a second average value of 3 key points related to the lower lip, taking the second average value as a calibration point of the upper edge position of the lower lip, calculating a difference value between the first average value and the second average value, and taking the difference value as the distance of the lip movement action state of the current frame picture;
c2, according to the step C1, calculating the distance of the lip movement action state corresponding to the continuous multiframe pictures, and calculating the track change of a plurality of different distances;
c3, estimating two turning points of the track change by using the training set, wherein the two turning points respectively represent the turning point parameter of the lip from opening to closing and the turning point parameter from closing to opening;
c4, judging the lip movement state of the lip movement track of the target object by using the above two turning point parameters, and giving the confidence of the judgment of the lip movement state;
the state estimation post-processing module is used for performing core parameter calculation on the confidence coefficient of the lip movement state calculated by utilizing the modeling classification mode and the track prediction mode according to a confidence coefficient weighting comprehensive decision mode, performing processing on a smooth window and judging robustness on a core parameter calculation result, and adjusting a recording movement mode and/or a recording data processing mode of a target object according to the finally determined lip movement state.
7. A human-computer interaction system based on lip action according to claim 6, characterized in that:
the camera module is also used for acquiring a plurality of initial images related to different azimuth angles of the face of the target object;
the face recognition module is further configured to analyze the plurality of initial images through a preset face structure recognition model, so as to obtain the face feature information, where the face feature information at least includes position information of a lip of the target object on a face of the target object.
8. A human-computer interaction system based on lip action according to claim 6, characterized in that:
the human-computer interaction system based on the lip action further comprises a frame image extraction module, an effectiveness judgment module and a video shooting adjustment module;
the frame image extraction module is used for extracting a plurality of different frames of images from the video information according to a preset time interval;
the face recognition module is further used for determining the position information of the face and/or the lip of the target object in the plurality of different frames of images;
the validity judging module is used for matching the position information with the face feature information so as to judge the validity of the video information;
and the video shooting adjusting module is used for adjusting the shooting mode for acquiring the video information of the target object in real time according to the effectiveness judgment result of the video information.
9. A human-computer interaction system based on lip action according to claim 6, characterized in that:
the lip action acquisition module comprises a positioning tracking sub-module, a lip movement determination sub-module and a lip action triggering judgment sub-module; wherein the content of the first and second substances,
the positioning and tracking submodule is used for determining the position of the lip of the target object in the video information corresponding to the video image according to the result of the face recognition processing, so as to perform positioning and tracking processing on the lip;
the lip movement determination submodule is used for determining lip movement of the target object according to the positioning and tracking processing result, wherein the lip movement at least comprises opening, closing and shaking of lips;
the lip action triggering judgment submodule is used for judging whether the lip movement meets a preset voice interaction triggering condition about the target object according to the change frequency of the lip movement within a preset time length.
10. A human-computer interaction system based on lip action according to claim 6, characterized in that:
the recording mode adjusting module comprises a triggering condition judging submodule and a recording related control submodule; wherein the content of the first and second substances,
the trigger condition judgment submodule is used for comparing the change frequency of the lip action state within a preset time length with a preset frequency threshold, if the change frequency of the lip action state within the preset time length exceeds the frequency threshold, the lip action state is determined to meet the preset voice interaction trigger condition, and if the change frequency of the lip action state within the preset time length does not exceed the frequency threshold, the lip action state is determined not to meet the preset voice interaction trigger condition;
and the recording related control submodule is used for indicating to start to execute recording action and/or recording data processing on the target object when the lip action state is determined to meet the preset voice interaction triggering condition, or indicating to stop executing the recording action and/or recording data processing on the target object when the lip action state is determined to not meet the preset voice interaction triggering condition.
CN201910859039.1A 2019-09-11 2019-09-11 Man-machine interaction method and system based on lip actions Active CN110750152B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910859039.1A CN110750152B (en) 2019-09-11 2019-09-11 Man-machine interaction method and system based on lip actions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910859039.1A CN110750152B (en) 2019-09-11 2019-09-11 Man-machine interaction method and system based on lip actions

Publications (2)

Publication Number Publication Date
CN110750152A true CN110750152A (en) 2020-02-04
CN110750152B CN110750152B (en) 2023-08-29

Family

ID=69276346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910859039.1A Active CN110750152B (en) 2019-09-11 2019-09-11 Man-machine interaction method and system based on lip actions

Country Status (1)

Country Link
CN (1) CN110750152B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428672A (en) * 2020-03-31 2020-07-17 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium
CN111918106A (en) * 2020-07-07 2020-11-10 胡飞青 Multimedia playing system and method for application scene recognition
CN112015364A (en) * 2020-08-26 2020-12-01 广州视源电子科技股份有限公司 Method and device for adjusting pickup sensitivity
CN112597467A (en) * 2020-12-15 2021-04-02 中标慧安信息技术股份有限公司 Multimedia-based resident authentication method and system
CN112966654A (en) * 2021-03-29 2021-06-15 深圳市优必选科技股份有限公司 Lip movement detection method and device, terminal equipment and computer readable storage medium
CN113393833A (en) * 2021-06-16 2021-09-14 中国科学技术大学 Audio and video awakening method, system, device and storage medium
CN113486760A (en) * 2021-06-30 2021-10-08 上海商汤临港智能科技有限公司 Object speaking detection method and device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2566844A1 (en) * 2005-04-13 2006-10-26 Pixel Instruments, Corp. Method, system, and program product for measuring audio video synchronization using lip and teeth charateristics
CN101101752A (en) * 2007-07-19 2008-01-09 华中科技大学 Monosyllabic language lip-reading recognition system based on vision character
US20110071830A1 (en) * 2009-09-22 2011-03-24 Hyundai Motor Company Combined lip reading and voice recognition multimodal interface system
WO2012128382A1 (en) * 2011-03-18 2012-09-27 Sharp Kabushiki Kaisha Device and method for lip motion detection
CN104966053A (en) * 2015-06-11 2015-10-07 腾讯科技(深圳)有限公司 Face recognition method and recognition system
US20170161553A1 (en) * 2015-12-08 2017-06-08 Le Holdings (Beijing) Co., Ltd. Method and electronic device for capturing photo
US20170344811A1 (en) * 2015-11-25 2017-11-30 Tencent Technology (Shenzhen) Company Limited Image processing method and apparatus

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2566844A1 (en) * 2005-04-13 2006-10-26 Pixel Instruments, Corp. Method, system, and program product for measuring audio video synchronization using lip and teeth charateristics
CN101101752A (en) * 2007-07-19 2008-01-09 华中科技大学 Monosyllabic language lip-reading recognition system based on vision character
US20110071830A1 (en) * 2009-09-22 2011-03-24 Hyundai Motor Company Combined lip reading and voice recognition multimodal interface system
WO2012128382A1 (en) * 2011-03-18 2012-09-27 Sharp Kabushiki Kaisha Device and method for lip motion detection
CN104966053A (en) * 2015-06-11 2015-10-07 腾讯科技(深圳)有限公司 Face recognition method and recognition system
US20170344811A1 (en) * 2015-11-25 2017-11-30 Tencent Technology (Shenzhen) Company Limited Image processing method and apparatus
US20170161553A1 (en) * 2015-12-08 2017-06-08 Le Holdings (Beijing) Co., Ltd. Method and electronic device for capturing photo

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
任玉强;田国栋;周祥东;吕江靖;周曦;: "高安全性人脸识别系统中的唇语识别算法研究" *
马新军;吴晨晨;仲乾元;李园园;: "基于SIFT的说话人唇动识别" *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021196648A1 (en) * 2020-03-31 2021-10-07 北京市商汤科技开发有限公司 Method and apparatus for driving interactive object, device and storage medium
CN111428672A (en) * 2020-03-31 2020-07-17 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium
JP2022531055A (en) * 2020-03-31 2022-07-06 北京市商▲湯▼科技▲開▼▲發▼有限公司 Interactive target drive methods, devices, devices, and recording media
CN111918106A (en) * 2020-07-07 2020-11-10 胡飞青 Multimedia playing system and method for application scene recognition
CN112015364A (en) * 2020-08-26 2020-12-01 广州视源电子科技股份有限公司 Method and device for adjusting pickup sensitivity
CN112597467A (en) * 2020-12-15 2021-04-02 中标慧安信息技术股份有限公司 Multimedia-based resident authentication method and system
CN112966654A (en) * 2021-03-29 2021-06-15 深圳市优必选科技股份有限公司 Lip movement detection method and device, terminal equipment and computer readable storage medium
WO2022205843A1 (en) * 2021-03-29 2022-10-06 深圳市优必选科技股份有限公司 Lip movement detection method and apparatus, terminal device, and computer readable storage medium
CN112966654B (en) * 2021-03-29 2023-12-19 深圳市优必选科技股份有限公司 Lip movement detection method, lip movement detection device, terminal equipment and computer readable storage medium
CN113393833A (en) * 2021-06-16 2021-09-14 中国科学技术大学 Audio and video awakening method, system, device and storage medium
CN113393833B (en) * 2021-06-16 2024-04-02 中国科学技术大学 Audio and video awakening method, system, equipment and storage medium
CN113486760A (en) * 2021-06-30 2021-10-08 上海商汤临港智能科技有限公司 Object speaking detection method and device, electronic equipment and storage medium
WO2023273064A1 (en) * 2021-06-30 2023-01-05 上海商汤临港智能科技有限公司 Object speaking detection method and apparatus, electronic device, and storage medium

Also Published As

Publication number Publication date
CN110750152B (en) 2023-08-29

Similar Documents

Publication Publication Date Title
CN110750152B (en) Man-machine interaction method and system based on lip actions
CN110543867B (en) Crowd density estimation system and method under condition of multiple cameras
US7343289B2 (en) System and method for audio/video speaker detection
US7472063B2 (en) Audio-visual feature fusion and support vector machine useful for continuous speech recognition
JP4616702B2 (en) Image processing
US8314854B2 (en) Apparatus and method for image recognition of facial areas in photographic images from a digital camera
WO2019023921A1 (en) Gesture recognition method, apparatus, and device
CN105844659B (en) The tracking and device of moving component
CN111881726B (en) Living body detection method and device and storage medium
JP2001092974A (en) Speaker recognizing method, device for executing the same, method and device for confirming audio generation
JP2008501172A (en) Image comparison method
WO2012128382A1 (en) Device and method for lip motion detection
CN112215155A (en) Face tracking method and system based on multi-feature fusion
CN103605969A (en) Method and device for face inputting
TWI780366B (en) Facial recognition system, facial recognition method and facial recognition program
US10964326B2 (en) System and method for audio-visual speech recognition
Tiawongsombat et al. Robust visual speakingness detection using bi-level HMM
CN111341350A (en) Man-machine interaction control method and system, intelligent robot and storage medium
CN112286364A (en) Man-machine interaction method and device
KR20190009006A (en) Real time multi-object tracking device and method by using global motion
CN114282621B (en) Multi-mode fused speaker role distinguishing method and system
CN110892412A (en) Face recognition system, face recognition method, and face recognition program
CN106599765B (en) Method and system for judging living body based on video-audio frequency of object continuous pronunciation
CN112132865A (en) Personnel identification method and system
CN114299952B (en) Speaker role distinguishing method and system combining multiple motion analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant