CN111325139A

CN111325139A - Lip language identification method and device

Info

Publication number: CN111325139A
Application number: CN202010099127.9A
Authority: CN
Inventors: 刘晓成
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2020-06-23
Anticipated expiration: 2040-02-18
Also published as: CN111325139B

Abstract

The application provides a lip language identification method and device. The method comprises the following steps: receiving a visible light video frame collected by a terminal and a thermal imaging video frame corresponding to the visible light video frame; determining an identification area in the visible light video frame according to the thermal imaging video frame, and identifying a human face in the identification area; matching the recognized face with the faces in the target group face library, and if the recognized face is matched with at least one face in the target group face library, determining that the recognized face is effective; and performing lip language recognition on the face determined to be effective according to the visible light video frame containing the face determined to be effective.

Description

Lip language identification method and device

Technical Field

The application relates to the technical field of image processing, in particular to a lip language identification method and device.

Background

In the fields of artificial intelligence and image processing, various functions (such as analyzing facial expression action information of a user) can be realized by using image information of a target. Image acquisition and recognition are always a popular research topic, and relate to multiple aspects of user daily life and scientific research. For example, the accuracy of somatosensory interaction and semantic recognition can be improved by recognizing the lip language of the face of the user, so that more comfortable interaction experience is further brought.

At present, the lip language identification process is as follows: acquiring image information of a target human body object in a mode of combining at least one of a depth camera and an infrared camera; determining the position of the face according to the synthetic image; and extracting lip action characteristics to perform lip language recognition. The process mainly focuses on how to train a lip recognition model and accurately recognize lips in the video, but does not pay attention to personnel to which the lips in the video belong, so that the usability of the lip language recognition device is poor.

How to improve the usability and applicability of lip language recognition in a complex dynamic environment is a problem to be researched and solved urgently in the industry.

Disclosure of Invention

The embodiment of the application provides a lip language identification method and device, which improve the complex dynamic environment adaptability of lip language identification and ensure the consistency of the lip language identification from beginning to end.

In a first aspect, an embodiment of the present application provides a lip language identification method, including:

receiving a visible light video frame and a thermal imaging video frame corresponding to the visible light video frame, which are collected by a terminal;

determining an identification area in the visible light video frame according to the thermal imaging video frame, and identifying the face in the identification area;

matching the recognized face with the faces in the target group face library, and if the recognized face is matched with at least one face in the target group face library, determining that the recognized face is effective;

and performing lip language recognition on the face determined to be effective according to the visible light video frame containing the face determined to be effective.

Optionally, determining an identification area in the visible light video frame according to the thermal imaging video frame, and identifying a face in the identification area, includes:

determining a background area in a thermal imaging video frame, and shielding the background;

carrying out differential operation on the visible light video frame and the thermal imaging video frame with the shielded background area to obtain a differential video frame;

and determining an identification area in the visible light video frame according to the difference video frame, and identifying the face in the identification area.

Optionally, if the identified face matches at least one face in the face library of the target group, determining that the identified face is valid includes:

comparing the recognized face with the faces in the face library of the target group;

and if the similarity between the recognized face and at least one face in the target group face library is greater than or equal to a first threshold value, determining that the recognized face is effective.

Optionally, the method in the embodiment of the present application further includes: if the similarity between the recognized face and the face in the target group face library is smaller than a first threshold value and larger than or equal to a second threshold value, adding the recognized face into the target group face library, or replacing the face with the highest similarity between the recognized face and the face in the target group library by using the recognized face; wherein the first threshold is greater than the second threshold.

Optionally, if a plurality of faces are recognized and determined to be valid, performing lip language recognition on the faces determined to be valid, including: and respectively carrying out lip language recognition on a plurality of faces determined to be effective.

In a second aspect, an embodiment of the present application provides a server, including:

the receiving module is used for receiving the visible light video frames collected by the terminal and the thermal imaging video frames corresponding to the visible light video frames;

the face recognition module is used for determining a recognition area in the visible light video frame according to the thermal imaging video frame and recognizing a face in the recognition area;

the effective face determining module is used for matching the face obtained by recognition with the faces in the target group face library, and if the face obtained by recognition is matched with at least one face in the target group face library, determining that the recognized face is effective;

and the lip language recognition module is used for carrying out lip language recognition on the face determined to be effective according to the visible light video frame containing the face determined to be effective.

Optionally, the face recognition module is specifically configured to:

Optionally, the valid face determination module is specifically configured to: comparing the recognized face with the faces in the face library of the target group; and if the similarity between the recognized face and at least one face in the target group face library is greater than or equal to a first threshold value, determining that the recognized face is effective.

Optionally, the apparatus in the embodiment of the present application further includes:

the target group face library updating module is used for adding the recognized face into the target group face library or replacing the face with the highest similarity to the recognized face in the target group library by using the recognized face under the condition that the similarity between the recognized face and the face in the target group face library is smaller than a first threshold value but larger than or equal to a second threshold value; wherein the first threshold is greater than the second threshold.

Optionally, the lip language recognition module is specifically configured to:

if the effective face determining module determines that a plurality of effective faces are available in the faces identified by the face identifying module, lip language identification is performed on the plurality of faces determined to be effective respectively, and the lip language identification method comprises the following steps:

and respectively carrying out lip language recognition on a plurality of faces determined to be effective.

In a third aspect, an embodiment of the present application provides a server, including: a processor and a memory; a memory coupled to the processor and configured to store computer instructions; a processor, coupled to the memory, configured to execute the computer instructions to cause the server to perform the method of any of the first aspects described above.

In a fourth aspect, embodiments of the present application provide a computer storage medium having computer program instructions stored therein, which when executed on a computer, cause the computer to perform the method of any of the above first aspects.

In the embodiment of the application, when lip language is identified, based on the target group face library, lip language identification can be performed on members in a specific group, face interference of people who do not need to be concerned is eliminated, the complex dynamic environment adaptability of lip language identification is improved, and the consistency of the target from beginning to end in the lip language identification is ensured.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a diagram illustrating a lip recognition system architecture provided by an embodiment of the present application;

fig. 2 is a flowchart illustrating a lip language identification method provided by an embodiment of the present application;

fig. 3 schematically illustrates a structural diagram of a server 300 provided in an embodiment of the present application;

fig. 4 schematically shows a structural diagram of a server 400 according to another embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 schematically shows a lip language recognition system architecture diagram provided in an embodiment of the present application. As shown in fig. 1, includes: terminal 101, server 102, network 103. The terminal is provided with a thermal imaging camera 1011 and a visible light camera 1012, and is used for acquiring a thermal imaging video frame sequence and a visible light video frame sequence of a monitoring place in real time and sending the thermal imaging video frame sequence and the visible light video frame sequence to the server. The server 103 may be a common web server, an enterprise server, or the like, and is used for implementing a method for identifying abnormal body temperature. The network 102 may be the internet, a local area network, the internet, etc., for connecting data communication between the terminals and the server.

For clarity of description of the embodiments of the present application, a detailed description of the thermal imaging camera and the visible light camera will be provided below.

A thermal imaging camera: thermal imaging is a camera that detects infrared energy (heat) by non-contact, converts it into an electrical signal, generates a thermal image and a temperature value, and can detect the temperature value. The thermal imaging camera has the following principle: the human body is a naturally occurring source of infrared radiation that does not continue to radiate and absorb infrared radiation. The temperature of each part of a normal human body is stable and special, different temperatures have different thermal fields, and when a certain part is diseased or abnormal, the blood flow at the part can be changed, so that the local temperature is changed. According to the principle, the infrared thermal imaging is to collect the infrared radiation of a human body through a thermal imager, convert the infrared radiation into a digital signal and generate a colorful thermal image. For example, experts in the physical examination center analyze and judge the location, nature and degree of disease of the lesion of the human body based on these heat maps.

Visible light camera: visible light imaging technology is built in the human visual range, carries out prediction in a visible light wave band and depends on natural illumination. In terms of the visual effect of the human eye, different wavelengths of visible radiation cause the human eye to receive different colors, and we refer to a color that corresponds to a wavelength of light radiation as monochromatic light or spectral color, most of which changes with the intensity of light. The visible light camera is manufactured based on a visible light imaging technology, can perform real-time transmission and information processing, and is high in resolution.

Fig. 2 is a flowchart illustrating a lip language identification method provided in an embodiment of the present application, where the flowchart includes the following steps:

s201: the server receives the visible light video frames collected by the terminal and the thermal imaging video frames corresponding to the visible light video frames.

The visible light video frame and the corresponding thermal imaging video frame are acquired synchronously, that is, the acquisition time of the visible light video frame is the same as the acquisition time of the thermal imaging video frame.

In the step, the thermal imaging video frame is collected and sent by the thermal imaging camera, and the visible light video frame is collected and sent by the visible light camera. Wherein, the video frame frequency of the thermal imaging camera can be set to be the same as that of the visible light camera.

S202: and the server determines an identification area in the visible light video frame according to the thermal imaging video frame and identifies the face in the identification area.

In an actual application scenario, the visible light video frame may contain some invalid faces, such as a face in a poster, a face in an advertisement screen, and the like, which may interfere with face recognition. In the thermal imaging video frame collected by the thermal imaging camera, the human face in the propaganda newspaper, the human face in the advertisement screen and the like cannot be displayed in the thermal imaging video frame because infrared signals cannot be radiated outwards. In addition, although the human face far from the thermal imaging camera may radiate an infrared signal outward, the human face is usually ignored because the signal is weak because the human face is far from the thermal imaging camera. Therefore, according to the thermal imaging video frame, the areas of the invalid face or the face with weak infrared signals and other backgrounds can be distinguished, the areas do not need to be identified, and the area with strong infrared signals can be used as an identification area. By the method, invalid faces can be filtered, so that face recognition overhead can be reduced, and interference factors can be eliminated.

In the step, a background area in a thermal imaging video frame can be determined firstly, and the background is shielded; then, carrying out differential operation on the visible light video frame and the thermal imaging video frame with the shielded background area to obtain a differential video frame; and then determining an identification area in the visible light video frame according to the difference video frame, and identifying the face in the identification area.

Specifically, in the above process, the non-photosensitive or weakly photosensitive region in the thermal imaging video frame may be determined as the background region according to the characteristics of the active light source. And taking the visible light video frame as a static image of the current environment, taking the thermal imaging video frame as an active light source image, and carrying out differential operation on the static image and the active light source image to obtain a differential image. And performing difference distribution estimation to determine the face to be recognized in the image.

S203: and the server matches the recognized face with the faces in the target group face library, and if the recognized face is matched with at least one face in the target group face library, the recognized face is determined to be effective.

The faces of the members in the target group (target crowd) may be collected in advance, and the collected face data may be stored in the target group face library. The members of the target group are persons who need lip language recognition.

In this step, whether the identified face matches with a face in the face library of the target group can be judged in the following manner: comparing the recognized face with the faces in the target group face library to obtain the similarity between the recognized face and the faces in the target group face library, if the similarity between the recognized face and at least one face in the target group face library is greater than or equal to a first threshold value, determining that the recognized face is effective, and performing lip language recognition on the face determined to be effective.

The using scene of the lip language recognition device is not necessarily a stable environment, people may move, other people pass through the lens, and the human face in the video cannot be guaranteed to be the same person. Lip language identification is a continuous process, and the consistency of lip language identification from beginning to end needs to be maintained. Based on the target group face library, attention can be paid to a specific group, namely lip language recognition is carried out on members in the specific group, and interference is eliminated.

For example, when the server is performing lip language identification analysis on the person a, the person B passes through the lens and speaks a few words, but the face of the person B is not recorded in the face library of the target group, at this time, the server filters the face of the person B and does not identify the lip language of the person B, so that the lip language identification result of the person B is prevented from polluting the lip language identification result of the person a, and the consistency of the target from beginning to end in the lip language identification process is ensured.

Optionally, if the similarity between the identified face and the face in the target group face library is smaller than a first threshold but greater than or equal to a second threshold, the identified face is added to the target group face library, or the identified face is used to replace the face with the highest similarity between the identified face and the face in the target group library, where the first threshold is greater than the second threshold.

In an actual application scene, the human face changes along with the change of time or the change of age, and the human face in the target group human face library needs to be updated according to the human face obtained through recognition, so that the attention of a specific group is improved.

For example, when the face of the person C in the target group face library has the ziphi, and the person C appears in the acquired video frame again, the hairstyle is changed, the previous ziphi is combed, and when the face of the person C without the bang obtained through recognition is compared with the face of the person C in the target group face library, if the similarity is smaller than the first threshold value but greater than or equal to the second threshold value, the face of the person C obtained through recognition is added into the target group face library, or the face of the person C obtained through recognition is used to replace the face with the highest similarity to the face of the person C in the target group library, so that the attention of the specific group is improved.

S204: and the server performs lip language recognition on the face determined to be effective according to the visible light video frame containing the face determined to be effective.

In the actual lip language identification process, there are two people or the condition of a plurality of people in the dialogue, if carry out the lip language identification to everyone, need many equipment simultaneous workings to can not appear the second person in guaranteeing the camera lens, actual scene is complicated changeable, and the operation is very inconvenient.

Optionally, in the above flow, if there are a plurality of faces that are recognized and determined to be valid, lip language recognition may be performed on the plurality of faces that are determined to be valid, respectively. Specifically, when a plurality of faces are recognized, it is possible to sequentially determine whether the recognized faces are valid faces. And when judging that a plurality of effective faces exist, numbering each effective face, and respectively identifying lip language.

In the step, a lip language recognition model which is trained in advance can be used for carrying out lip language recognition, and an analysis result is output, wherein the model can be a deep neural network model. Specifically, data preprocessing may be performed on the video frame input to the model to form input data, and then the trained lip language recognition model is used to predict the language type of the input data and output an analysis result. And finally, converting the analysis result into storable text information, and recording the storable text information under the corresponding personnel name according to the human face number information.

Lip language identification is carried out on a plurality of effective human face lips, so that the usability of the system and the adaptability of a complex dynamic environment can be improved.

In the embodiment of the application, before lip language recognition, the thermal imaging technology and the face recognition technology are combined to filter invalid faces, so that the consistency of the target from beginning to end in the lip language recognition is ensured.

Based on the same technical concept, the embodiment of the application also provides a server, and the server executes the method in the embodiment.

Referring to fig. 3, a server structure provided in the embodiment of the present application is shown. As shown in fig. 3, the server 300 includes a receiving module 301, a face recognition module 302, a valid face determination module 303, and a lip language recognition module 304.

The receiving module 301 is configured to receive a visible light video frame and a thermal imaging video frame corresponding to the visible light video frame, which are acquired by a terminal;

the face recognition module 302 is configured to determine a recognition area in the visible light video frame according to the thermal imaging video frame, and recognize a face in the recognition area;

an effective face determining module 303, configured to match the identified face with faces in a target group face library, and determine that the identified face is effective if the identified face is matched with at least one face in the target group face library;

and the lip language recognition module 304 is configured to perform lip language recognition on the face determined to be valid according to the visible light video frame containing the face determined to be valid.

Optionally, the embodiment of the present application further includes a target group face library updating module 305, where the target group face library updating module 305 is configured to, if the similarity between the identified face and the face in the target group face library is smaller than a first threshold but greater than or equal to a second threshold, add the identified face to the target group face library, or replace the face with the highest similarity between the identified face and the face in the target group library with the identified face; wherein the first threshold is greater than the second threshold.

Optionally, the face recognition module 302 is specifically configured to: determining a background area in a thermal imaging video frame, and shielding the background; carrying out differential operation on the visible light video frame and the thermal imaging video frame with the shielded background area to obtain a differential video frame; and determining an identification area in the visible light video frame according to the difference video frame, and identifying the face in the identification area.

Optionally, the effective face determining module 303 is specifically configured to: comparing the recognized face with the faces in the face library of the target group; and if the similarity between the recognized face and at least one face in the target group face library is greater than or equal to a first threshold value, determining that the recognized face is effective.

Optionally, the lip language recognition module 304 is further configured to perform lip language recognition on a plurality of faces determined to be valid if the valid face determination module determines that a plurality of faces identified by the face recognition module are valid.

Fig. 4 shows a schematic structural diagram of a server 400 provided in an embodiment of the present application. Referring to fig. 4, the server 400 includes a processor 401 and a network interface 402. The processor 401 may also be a controller. The processor 401 is configured to perform the functions referred to in fig. 2. The network interface 402 is configured to support messaging functionality. The server 400 may also include a memory 403, the memory 403 being coupled to the processor 401 and storing program instructions and data necessary for the device. The processor 401, the network interface 402 and the memory 403 are connected, the memory 403 is used for storing instructions, and the processor 401 is used for executing the instructions stored in the memory 403 to control the network interface 402 to send and receive messages, so as to complete the steps of the method for executing corresponding functions.

In the embodiment of the present application, for concepts, explanations, details, and other steps related to the technical solution provided in the embodiment of the present application, reference is made to the description of the foregoing method or the related steps in other embodiments, and details are not described herein.

It should be noted that the processor referred to in the embodiments of the present application may be a Central Processing Unit (CPU), a general purpose processor, a Digital Signal Processor (DSP), an application-specific integrated circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic devices, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. A processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a DSP and a microprocessor, or the like. Wherein the memory may be integrated in the processor or may be provided separately from the processor.

Embodiments of the present application also provide a computer storage medium for storing instructions that, when executed, may perform the method of the foregoing embodiments.

The embodiments of the present application also provide a computer program product for storing a computer program, where the computer program is used to execute the method of the foregoing embodiments.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A lip language identification method is characterized by comprising the following steps:

receiving a visible light video frame collected by a terminal and a thermal imaging video frame corresponding to the visible light video frame;

determining an identification area in the visible light video frame according to the thermal imaging video frame, and identifying a human face in the identification area;

2. The method of claim 1, wherein determining a recognition area in the visible light video frame from the thermal imaging video frame and recognizing a human face in the recognition area comprises:

determining a background area in the thermal imaging video frame, and shielding the background;

3. The method of claim 1, wherein determining that the recognized face is valid if the recognized face matches at least one face in the library of faces in the target population comprises:

4. The method of claim 3, further comprising:

if the similarity between the recognized face and the face in the target group face library is smaller than the first threshold value but larger than or equal to a second threshold value, adding the recognized face into the target group face library, or replacing the face with the highest similarity between the recognized face and the face in the target group library by using the recognized face; wherein the first threshold is greater than the second threshold.

5. The method according to any one of claims 1 to 4, wherein if a plurality of faces are recognized and determined to be valid, performing lip language recognition on the faces determined to be valid comprises:

6. A server, comprising:

the receiving module is used for receiving a visible light video frame collected by a terminal and a thermal imaging video frame corresponding to the visible light video frame;

7. The server of claim 6, wherein the face recognition module is specifically configured to:

8. The server of claim 6, wherein the valid face determination module is specifically configured to:

9. The server of claim 6, further comprising:

a target group face library updating module, configured to add the identified face to the target group face library or replace a face with the highest similarity to the identified face in the target group library with the identified face when the similarity between the identified face and a face in the target group face library is smaller than the first threshold but greater than or equal to a second threshold; wherein the first threshold is greater than the second threshold.

10. The apparatus according to any one of claims 6 to 9, wherein the lip language identification module is specifically configured to:

if the effective face determining module determines that a plurality of effective faces are available in the faces identified by the face identifying module, lip language identification is carried out on the faces determined to be effective respectively.

11. A server, comprising: a processor and a memory;

the memory, coupled to the processor, configured to store computer instructions; the processor, coupled to the memory, configured to execute the computer instructions to cause the server to:

12. A computer storage medium having computer program instructions stored therein, which when run on a computer, cause the computer to perform the method of any one of claims 1-5.