CN112633208A

CN112633208A - Lip language identification method, service equipment and storage medium

Info

Publication number: CN112633208A
Application number: CN202011599830.2A
Authority: CN
Inventors: 李绪送; 成刚; 杨善松
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-04-09

Abstract

The application discloses a lip language identification method, service equipment and a storage medium, wherein the service equipment firstly carries out video acquisition on a target object and then respectively executes the following steps on each frame of target image needing to extract lip information: extracting a lip image of a target object from a target image, classifying and identifying the lip image, dividing the lip image into a sounding frame or a silent frame, if the classification identification result of continuous multi-frame lip images meets the change rule from the silent frame to the sounding frame and then to the silent frame, positioning the starting and ending positions of lip language from the continuous multi-frame lip images based on the change rule, after obtaining a lip image sequence between the starting positions, carrying out primary rough classification on the lip image sequence, screening out the lip language which is not supported although coupled, and carrying out lip language identification on the screened lip image sequence to obtain a lip language identification result. Therefore, besides voice interaction, multi-modal signals based on lip language recognition results can be added, and the applicability and stability of human-computer interaction are improved.

Description

Lip language identification method, service equipment and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a lip language identification method, a service device, and a storage medium.

Background

At present, intelligent interaction equipment is regarded as an interaction portal of internet of everything, so that voice interaction is rapidly developed, and a plurality of voice interaction systems appear in mass life. However, the single-mode voice interaction system has poor anti-interference capability, the performance is obviously reduced in a scene with background noise, and the problem of difficult awakening is encountered in a far-field scene with a large distance. In addition, when the sound monitoring channel is occupied, the single-mode voice interaction system can completely fail.

In order to solve the above system defects, in the related art, on one hand, the influence of background noise and far-field environment on the interaction process can be eliminated to the greatest extent by means of technologies such as voice noise reduction and microphone arrays, and on the other hand, a processing path can be expanded by means of multi-modal signals given to the interaction device by means of a multi-modal interaction technology, so that the problems encountered by single-modal voice interaction in a specific scene are solved, and the applicability and stability of an interaction system are enhanced.

However, in the related art, how to adopt multi-modal signals to improve the applicability and stability of human-computer interaction is still to be solved.

Disclosure of Invention

The embodiment of the application provides a lip language identification method, service equipment and a storage medium, and the applicability and the stability of human-computer interaction are improved by adopting multi-modal signals.

In a first aspect, an embodiment of the present application provides a service device, including: a memory and a controller;

the memory for storing a computer program;

the controller is coupled to the memory and configured to perform, based on the computer program:

carrying out video acquisition on a target object;

respectively executing the following steps on each frame of target image needing to extract lip information: extracting a lip image of the target object from the target image; classifying and identifying the lip images, and dividing the lip images into pronunciation frames or silence frames, wherein the pronunciation frames are used for indicating that the lips of the target object are in a pronunciation state, and the silence frames are used for indicating that the lips of the target object are in a silence state;

if the classification recognition result of the continuous multi-frame lip images meets the change rule from the silent frame to the pronunciation frame and then to the silent frame, positioning the start-stop position of the lip language from the continuous multi-frame lip images based on the change rule;

and acquiring a lip image sequence between the start and stop positions, and performing lip language identification to obtain a lip language identification result.

In an embodiment of the present application, the controller is further configured to:

dividing the lip images into phonic frames or silent frames based on a pre-trained lip image classification model;

wherein the lip image classification model is obtained according to the following method:

obtaining a lip sample image, wherein the lip sample image is associated with a corresponding class label, and the class label is labeled according to a voice signal corresponding to the lip sample image;

inputting the lip sample image into a lip image classification model to be trained to obtain a predicted class label of the lip sample image output by the lip image classification model to be trained;

and determining the loss between the predicted class label and the class label according to a preset loss function, and training the parameters of the lip image classification model to be trained to obtain the lip image classification model.

carrying out voice activity detection on the voice signal corresponding to the lip sample image to obtain a voice detection result; the voice signal is a voice segment within a specified number of frames before the lip sample image and the lip sample image;

if the voice signal is determined not to be a pronunciation signal based on the voice detection result, marking the lip sample image as a silent frame; if the voice signal is determined to be a pronunciation signal based on the voice detection result, and the normalization result of the energy value of the voice signal is smaller than or equal to a preset threshold value, marking the lip sample image as a silence frame;

and if the voice signal is determined to be a pronunciation signal based on the voice detection result and the normalization result of the energy value of the voice signal is greater than the preset threshold value, marking the lip sample image as a pronunciation frame.

In an embodiment of the present application, each frame of target image from which lip information needs to be extracted is each frame of image acquired by the video; or each frame of target image needing to extract lip information is an image obtained by sampling the video.

carrying out face detection on the target image to obtain face key points of the target image;

and cutting the lip image of the target object from the target image according to the lip key points in the face key points.

the lip images of different frames are aligned.

In an embodiment of the present application, the aligning the lip images of different frames includes any one or a combination of the following:

adjusting the lip boundaries by adopting a translation transformation and/or rotation transformation processing mode so as to enable the lip boundaries of lip images of different frames to be parallel to a specified direction;

scaling the different lip images to a specified size;

and processing lip images of different frames by adopting an affine transformation method so as to enable the lip images of different frames to be a preset orientation relative to the lens orientation for acquiring the video data.

determining the lip image which is firstly classified into the pronunciation frame in the continuous multi-frame lip images as the initial frame of the lip language; and the number of the first and second electrodes,

and determining the lip image which is classified as the pronunciation frame finally in the continuous multi-frame lip images as the end frame of the lip language.

determining the lip image finally classified as a voiced frame according to the following method:

detecting a silence frame occurring for the first time after the sounding frame;

detecting whether a sounding frame exists in a preset frame number after the first-appearing silence frame;

if the sounding frame does not exist, determining the previous frame of the first-appearing silence frame as the lip image classified as the sounding frame at last;

and if the sounding frame exists, returning to the step of executing the silence frame appearing for the first time after the sounding frame detection from the sounding frame.

according to a binary classification model, performing binary classification processing on the lip image sequence, and determining whether the lip image sequence is a noise sequence;

if the sequence is not a noise sequence, executing a step of lip language recognition on the lip image sequence;

and if the lip image sequence is a noise sequence, discarding the lip image sequence.

performing two-dimensional feature extraction on each frame of lip image in the lip image sequence to obtain two-dimensional lip features respectively corresponding to each frame of lip image;

determining three-dimensional lip features of the lip image sequence based on the incidence relation between the two-dimensional lip features respectively corresponding to each frame of lip image;

performing multi-classification recognition on the three-dimensional lip features to obtain a lip language recognition result; alternatively, the first and second electrodes may be,

performing block processing on the lip image sequence, wherein each block comprises three-dimensional information, and the three-dimensional information comprises image width direction information, image height direction information and time sequence information; and respectively extracting features of plane information formed by any two dimensions in the three-dimensional information to obtain LBP-top features, and performing multi-classification recognition based on the LBP-top features of the lip image sequence to obtain a lip language recognition result.

In a second aspect, an embodiment of the present application provides a method for lip language recognition, including:

carrying out video acquisition on a target object;

In an embodiment of the present application, the classifying and identifying the lip images, and dividing the lip images into sound frames or silence frames includes:

In an embodiment of the present application, labeling the lip sample image according to a voice signal corresponding to the lip sample image includes:

In an embodiment of the present application, the extracting a lip image of the target object from the target image includes:

In an embodiment of the application, before the classifying and identifying the lip image, the method further includes:

the lip images of different frames are aligned.

scaling the different lip images to a specified size;

In an embodiment of the present application, the positioning the start-stop position of the lip language from the continuous multi-frame lip image based on the change rule includes:

In an embodiment of the present application, the method further includes:

In an embodiment of the application, after the obtaining of the sequence of lip images between the start-stop positions, the method further includes:

In an embodiment of the present application, the obtaining of the lip image sequence between the start and stop positions, performing lip language recognition, and obtaining a lip language recognition result includes:

In a third aspect, an embodiment of the present application provides a computer storage medium having computer program instructions stored thereon, which, when run on a computer, cause the computer to perform any of the methods of the second aspect.

In the embodiment of the application, service equipment firstly performs video acquisition on a target object, and then respectively executes the following steps on each frame of target image needing to extract lip information: extracting a lip image of a target object from a target image, classifying and identifying the lip image, dividing the lip image into a sounding frame or a silent frame, if the classification identification result of continuous multi-frame lip images meets the change rule from the silent frame to the sounding frame and then to the silent frame, positioning the starting and ending positions of lip language from the continuous multi-frame lip images based on the change rule, and after a lip image sequence between the starting positions is obtained, performing lip language identification to obtain a lip language identification result. Therefore, besides voice interaction, multi-modal signals based on lip language recognition results can be added, and the applicability and stability of human-computer interaction are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating a usage scenario of a human-computer interaction device, according to some embodiments;

a block diagram of the hardware configuration of the control device 100 according to some embodiments is illustrated in fig. 2;

a block diagram of a hardware configuration of a service device 200 according to some embodiments is illustrated in fig. 3;

fig. 4 illustrates a software configuration diagram in a service device 200 according to some embodiments;

FIG. 5 is a schematic diagram illustrating an icon control interface display of an application in the service apparatus 200, according to some embodiments;

fig. 6 is a schematic diagram illustrating an application of a service device in a voice interaction scenario according to an embodiment of the present application;

FIG. 7 is a diagram of an application scenario of a human-computer interaction method according to an embodiment of the present application;

fig. 8a is an overall flowchart of a human-computer interaction method according to an embodiment of the present disclosure;

FIG. 8b is another flowchart of a human-computer interaction method according to an embodiment of the present disclosure;

fig. 9 is a schematic view of a face key point of a human-computer interaction method provided in the embodiment of the present application;

fig. 10a is a schematic diagram of a lip alignment process of a human-computer interaction method according to an embodiment of the present application;

fig. 10b is a schematic diagram of a lip alignment process of a human-computer interaction method according to an embodiment of the present application;

fig. 10c is a schematic diagram of lip alignment processing of a human-computer interaction method according to an embodiment of the present application;

fig. 11 is a schematic flow chart illustrating labeling of a lip sample image according to a speech signal corresponding to the lip sample image;

a schematic diagram of lip image marks is shown in fig. 12 by way of example;

FIG. 13 is a flow chart illustrating locating the position of an end frame;

FIG. 14a is a schematic diagram illustrating a plurality of video frames input into a 2D convolutional neural network model;

FIG. 14b is a schematic diagram illustrating a plurality of video frames input into a 3D convolutional neural network model;

an application scenario of the service device 200 according to some embodiments is illustrated in fig. 14 c.

Detailed Description

To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code that is capable of performing the functionality associated with that element.

Fig. 1 is a schematic diagram of a usage scenario of a human-computer interaction device according to an embodiment. As shown in fig. 1, the human-computer interaction device 200 is also in data communication with a server 400, and a user can operate the human-computer interaction device 200 through the smart device 300 or the control apparatus 100.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the human-computer interaction device includes at least one of an infrared protocol communication or a bluetooth protocol communication, and other short-distance communication methods, and the human-computer interaction device 200 is controlled by a wireless or wired method. The user may control the human-computer interaction device 200 by inputting a user command through at least one of a button on a remote controller, a voice input, a control panel input, and the like.

In some embodiments, the smart device 300 may include any of a mobile terminal, a tablet, a computer, a laptop, an AR/VR device, and the like.

In some embodiments, the smart device 300 may also be used to control the human-computer interaction device 200. For example, the human interaction device 200 is controlled using an application running on the smart device.

In some embodiments, the intelligent device 300 and the human-computer interaction device may also be used for data communication.

In some embodiments, the human-computer interaction device 200 may also be controlled by a manner other than the control apparatus 100 and the intelligent device 300, for example, the human-computer interaction device 200 may receive a voice instruction control of a user directly through a module configured inside the device for obtaining a voice instruction, or may receive a voice instruction control of a user through a voice control apparatus provided outside the device of the human-computer interaction device 200.

In some embodiments, human interaction device 200 is also in data communication with server 400. The human interaction device 200 may be allowed to make communication connection through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the human interaction device 200. The server 400 may be a cluster or a plurality of clusters, and may include one or more types of servers.

In some embodiments, software steps executed by one step execution agent may be migrated on demand to another step execution agent in data communication therewith for execution. Illustratively, software steps executed by the server may be migrated to be executed on the human-computer interaction device in data communication therewith, and vice versa, as needed.

Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 according to an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control device 100 can receive an input operation instruction of a user, and convert the operation instruction into an instruction which can be recognized and responded by the human-computer interaction device 200, and plays a role in interaction mediation between the user and the human-computer interaction device 200.

In some embodiments, the communication interface 130 is used for external communication, and includes at least one of a WIFI chip, a bluetooth module, NFC, or an alternative module.

In some embodiments, the user input/output interface 140 includes at least one of a microphone, a touchpad, a sensor, a key, or an alternative module.

Fig. 3 shows a hardware configuration block diagram of the human-computer interaction device 200 according to the exemplary embodiment.

In some embodiments, human-computer interaction device 200 comprises at least one of a tuner 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface.

In some embodiments the controller comprises a central processor, a video processor, an audio processor, a graphics processor, a RAM, a ROM, a first interface to an nth interface for input/output.

In some embodiments, the display 260 includes a display screen component for displaying pictures, and a driving component for driving image display, a component for receiving image signals from the controller output, displaying video content, image content, and menu manipulation interface, and a user manipulation UI interface, etc.

In some embodiments, the display 260 may be at least one of a liquid crystal display, an OLED display, and a projection display, and may also be a projection device and a projection screen.

In some embodiments, the tuner demodulator 210 receives broadcast television signals via wired or wireless reception, and demodulates audio/video signals, such as EPG data signals, from a plurality of wireless or wired broadcast television signals.

In some embodiments, communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The human interaction device 200 may establish transmission and reception of control signals and data signals with the control apparatus 100 or the server 400 through the communicator 220.

In some embodiments, the detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for collecting ambient light intensity; alternatively, the detector 230 includes an image collector, such as a camera, which may be used to collect external environment scenes, attributes of the user, or user interaction gestures, or the detector 230 includes a sound collector, such as a microphone, which is used to receive external sounds.

In some embodiments, the external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, and the like. The interface may be a composite input/output interface formed by the plurality of interfaces.

In some embodiments, the controller 250 and the modem 210 may be located in different separate devices, that is, the modem 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box.

In some embodiments, controller 250 controls the operation of the human interaction device and responds to user actions through various software control programs stored in memory. The controller 250 controls the overall operation of the human interaction device 200. For example: in response to receiving a user command for selecting a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.

In some embodiments, the object may be any one of selectable objects, such as a hyperlink, an icon, or other actionable control. The operations related to the selected object are: displaying an operation connected to a hyperlink page, document, image, or the like, or performing an operation of a program corresponding to the icon.

In some embodiments the controller comprises at least one of a Central Processing Unit (CPU), a video processor, an audio processor, a Graphics Processing Unit (GPU), a RAM Random Access Memory (RAM), a ROM (Read-Only Memory), a first to nth interface for input/output, a communication Bus (Bus), and the like.

A CPU processor. For executing operating system and application program instructions stored in the memory, and executing various application programs, data and contents according to various interactive instructions receiving external input, so as to finally display and play various audio-video contents. The CPU processor may include a plurality of processors. E.g. comprising a main processor and one or more sub-processors.

In some embodiments, a graphics processor for generating various graphics objects, such as: at least one of an icon, an operation menu, and a user input instruction display figure. The graphic processor comprises an arithmetic unit, which performs operation by receiving various interactive instructions input by a user and displays various objects according to display attributes; the system also comprises a renderer for rendering various objects obtained based on the arithmetic unit, wherein the rendered objects are used for being displayed on a display.

In some embodiments, the video processor is configured to receive an external video signal, and perform at least one of video processing such as decompression, decoding, scaling, noise reduction, frame rate conversion, resolution conversion, and image synthesis according to a standard codec protocol of the input signal, so as to obtain a signal that can be displayed or played on the human-computer interaction device 200 directly.

In some embodiments, the video processor includes at least one of a demultiplexing module, a video decoding module, an image composition module, a frame rate conversion module, a display formatting module, and the like. The demultiplexing module is used for demultiplexing the input audio and video data stream. And the video decoding module is used for processing the video signal after demultiplexing, including decoding, scaling and the like. And the image synthesis module is used for carrying out superposition mixing processing on the GUI signal input by the user or generated by the user and the video image after the zooming processing by the graphic generator so as to generate an image signal for display. And the frame rate conversion module is used for converting the frame rate of the input video. And the display formatting module is used for converting the received video output signal after the frame rate conversion, and changing the signal to be in accordance with the signal of the display format, such as an output RGB data signal.

In some embodiments, the audio processor is configured to receive an external audio signal, decompress and decode the received audio signal according to a standard codec protocol of the input signal, and perform at least one of noise reduction, digital-to-analog conversion, and amplification processing to obtain a sound signal that can be played in the speaker.

In some embodiments, a user may enter user commands on a Graphical User Interface (GUI) displayed on display 260, and the user input interface receives the user input commands through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.

In some embodiments, a "user interface" is a media interface for interaction and information exchange between an application or operating system and a user that enables conversion between an internal form of information and a form that is acceptable to the user. A commonly used presentation form of the User Interface is a Graphical User Interface (GUI), which refers to a User Interface related to computer operations and displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in the display screen of the electronic device, where the control may include at least one of an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc. visual interface elements.

In some embodiments, the user interface 280 is an interface (e.g., physical buttons on the main body of the human-computer interaction device, or the like) that can be used to receive control inputs.

In some embodiments, the system of human-computer interaction devices may include a Kernel (Kernel), a command parser (shell), a file system, and an application program. The kernel, shell, and file system together make up the basic operating system structure that allows users to manage files, run programs, and use the system. After power-on, the kernel is started, kernel space is activated, hardware is abstracted, hardware parameters are initialized, and virtual memory, a scheduler, signals and interprocess communication (IPC) are operated and maintained. And after the kernel is started, loading the Shell and the user application program. The application program is compiled into machine code after being started, and a process is formed.

Referring to fig. 4, in some embodiments, the system is divided into four layers, which are, from top to bottom, an Application (Applications) layer (referred to as an "Application layer"), an Application Framework (Application Framework) layer (referred to as a "Framework layer"), an Android runtime (Android runtime) layer and a system library layer (referred to as a "system runtime library layer"), and a kernel layer.

In some embodiments, at least one application program runs in the application program layer, and the application programs may be windows (windows) programs carried by an operating system, system setting programs, clock programs or the like; or an application developed by a third party developer. In particular implementations, the application packages in the application layer are not limited to the above examples.

The framework layer provides an APPlication Programming Interface (API) and a programming framework for the aPPlication program of the aPPlication layer. The application framework layer includes a number of predefined functions. The application framework layer acts as a processing center that decides to let the applications in the application layer act. The application program can access the resources in the system and obtain the services of the system in execution through the API interface.

As shown in fig. 4, in the embodiment of the present application, the application framework layer includes a manager (Managers), a Content Provider (Content Provider), and the like, where the manager includes at least one of the following modules: an Activity Manager (Activity Manager) is used for interacting with all activities running in the system; the Location Manager (Location Manager) is used for providing the system service or application with the access of the system Location service; a Package Manager (Package Manager) for retrieving various information related to an application Package currently installed on the device; a Notification Manager (Notification Manager) for controlling display and clearing of Notification messages; a Window Manager (Window Manager) is used to manage icons, windows, toolbars, wallpapers, and desktop components on a user interface.

In some embodiments, the activity manager is used to manage the lifecycle of the various applications as well as general navigational fallback functions, such as controlling exit, opening, fallback, etc. of the applications. The window manager is used for managing all window programs, such as obtaining the size of a display screen, judging whether a status bar exists, locking the screen, intercepting the screen, controlling the change of the display window (for example, reducing the display window, displaying a shake, displaying a distortion deformation, and the like), and the like.

In some embodiments, the system runtime layer provides support for the upper layer, i.e., the framework layer, and when the framework layer is used, the android operating system runs the C/C + + library included in the system runtime layer to implement the functions to be implemented by the framework layer.

In some embodiments, the kernel layer is a layer between hardware and software. As shown in fig. 4, the core layer includes at least one of the following drivers: audio drive, display driver, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (like fingerprint sensor, temperature sensor, pressure sensor etc.) and power drive etc..

In some embodiments, the human-computer interaction device may directly enter an interface of a preset video-on-demand program after being started, and the interface of the video-on-demand program may include at least a navigation bar 510 and a content display area located below the navigation bar 510, where content displayed in the content display area may change according to a change of a selected control in the navigation bar, as shown in fig. 5. The programs in the application program layer can be integrated in the video-on-demand program and displayed through one control of the navigation bar, and can also be further displayed after the application control in the navigation bar is selected.

In some embodiments, after the human-computer interaction device is started, the user may directly enter a display interface of a signal source selected last time, or a signal source selection interface, where the signal source may be a preset video-on-demand program, or may be at least one of an HDMI interface, a live tv interface, and the like, and after the user selects different signal sources, the display may display application programs in contents obtained from different signal sources.

In some embodiments, fig. 6 is an application schematic diagram of a human-computer interaction device in a lip language interaction scenario, where a user 1 may speak an instruction that the human-computer interaction device 200 desires to execute through lip language, and then the human-computer interaction device 200 may collect user image data in real time, recognize the instruction of the user 1 included in the user image data, and directly execute the instruction after recognizing the instruction of the user 1, and in the whole process, the user 1 does not actually operate the human-computer interaction device 200 or other devices, but simply speaks the instruction.

In some embodiments, when the human-computer interaction device 200 shown in fig. 2 is applied in the scene shown in fig. 6, the human-computer interaction device 200 may collect user image data in real time through the image collector 232 thereof, and then the image collector 232 sends the collected user image data to the controller 250, and finally the controller 250 identifies an instruction included in the user image data.

At present, intelligent interaction equipment is regarded as an interaction portal of internet of everything, so that voice interaction is rapidly developed, and a plurality of voice interaction systems appear in mass life. However, the single-mode voice interaction system has poor anti-interference capability, the performance is obviously reduced in a scene with background noise, and the problem of difficult awakening is encountered in a far-field scene with a large distance. In addition, when the sound monitoring channel is occupied, the single-mode voice interaction system can completely fail. In order to solve the system defects, on one hand, the influence of background noise and far-field environment on the interaction process can be eliminated to the maximum extent by means of technologies such as voice noise reduction and microphone arrays, and on the other hand, multi-modal signals are given to the interaction equipment by means of a multi-modal interaction technology to obtain a processing path, so that the problems encountered by single-modal voice interaction in a specific scene are solved, and the applicability and the stability of an interaction system are enhanced.

The above-mentioned modalities are senses, and the multi-modality means that a plurality of senses are fused. Besides voice interaction, the multimodal system also includes two types of machine vision (natural object recognition, face recognition, limb action recognition, lip language recognition, etc.) and sensor Intelligence (Artificial Intelligence (AI) reading and understanding of heat, infrared capture signals, and spatial signals). It is desirable in the related art to impart more "human-like" attributes to intelligent interactive devices through multimodal interaction patterns + advanced AI algorithms, increasing machine understanding capabilities and "awareness" of active services.

How to combine multi-modal interaction to improve the applicability and stability of human-computer interaction is a constant concern in the industry.

In many modalities, lip language recognition is a relatively simple and effective way for a machine to recognize human voice content in addition to voice recognition. The lip language recognition is introduced into the intelligent interaction process, so that the problems of single voice modal interaction in a noise scene, a far-field scene and a scene occupied by a voice monitoring channel are effectively solved.

However, lip language recognition also faces challenges at present, for example, lip language recognition accuracy in the related art is not high, resulting in limited application scenarios of lip language recognition. The inventor researches and discovers that the reason that the lip language identification precision is not high is complex and various.

On one hand, the lip identification is greatly influenced by the environment, the lips of different speakers, the positions of the speakers relative to the camera, the illumination intensity and the like can greatly influence the identification result; in the application, the lip images are processed by adopting alignment processing, so that the lip images are kept consistent in the horizontal direction, the occupied ratio and the shooting angle, and errors caused by the lip shape and the camera position are avoided as much as possible.

On the other hand, the determination of the pronunciation position of the speaker and the occurrence position of the keyword still faces a great challenge under the condition of purely depending on image information; after lip language images are classified and recognized, the image frames corresponding to the starting and stopping positions of the lip languages in the lip language images are located and processed, lip language image sequences capable of containing the lip languages are obtained, follow-up processing is combined, and the problem that the lip language recognition is incorrect due to the fact that the pronunciation positions of speakers and the appearance positions of keywords are different is solved.

On the other hand, lip language recognition depends on the time sequence characteristics of lip images, but the situation that lip sequences are different in pronunciation but strong in lip sequence coupling exists; in the method, the accuracy of lip language recognition is improved by positioning the pronunciation position of the lip image, subsequently judging whether the lip image is a noise sequence according to the two classification models, carrying out subsequent lip language recognition operation if the lip image is not the noise sequence, and selecting a proper lip language recognition model. In the whole process, the pronunciation position is positioned, the further positioning of the supported lip language is realized based on the binary model, the lip language which has coupling but is not supported is screened out, and the lip language recognition model is further combined to compatibly recognize the lip languages with different pronunciations, so that the recognition of the lip language with strong coupling is further improved.

Based on the above inventive concept, another application scenario of the lip language identification method provided in the embodiment of the present application is described, as shown in fig. 7, where the application scenario includes: a server 701, a target object 702 and a man-machine interaction device 703;

the server 701 may be a server, a server cluster formed by a plurality of servers, or a cloud computing center. The server 701 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, an intermediate service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like.

The human-computer interaction device 703 may be various intelligent devices. Including but not limited to computers, laptops, smart phones, tablets, smart refrigerators, smart air conditioners, smart televisions, and the like.

The target object 702 may perform a virtual conference through the human-computer interaction device 703, and if the target object 702 needs to adjust the volume up but the sound monitoring channel is occupied, the target object 702 may say "louder" through lip language, and the human-computer interaction device 703 acquires video data of the target object 702 and may send the video data to the server 701 for lip language recognition. Certainly, in specific implementation, lip language recognition can be completed by the human-computer interaction device. The following description takes lip language recognition by the server as an example, as shown in fig. 8 a:

in step 701: the method comprises the steps that human-computer interaction equipment collects video data of a target object, namely the video data of the target object are collected; wherein the target object is a participant in the virtual conference;

in step 702: the server extracts the lip image of the target object from the multi-frame image and stores the extracted lip image into a lip image set in a time sequence;

in step 703: intercepting a plurality of frames of lip images from a lip image set by adopting a sliding window to obtain a lip image sequence to be processed;

in step 704: performing lip language identification on the lip image sequence to be processed to obtain a lip language identification result; and executing corresponding operation according to the result of the lip language identification.

Therefore, in the embodiment of the application, the lip images are intercepted from the video so as to analyze the lip characteristics, the lip image sequence containing the lip language can be intercepted through the sliding window, the image sequence where the lip language is located can be effectively obtained, and then the lip language recognition result is obtained by adopting the lip language recognition so as to control the human-computer interaction equipment.

In another embodiment, in addition to using a sliding window to obtain a lip image sequence, the lip recognition method applied to the application scenario of fig. 7 in the embodiment of the present application is shown in fig. 8 b:

in step 801: carrying out video acquisition on a target object; where the target object may be any one of the participants.

In step 802: respectively executing the following steps on each frame of target image needing to extract lip information: extracting a lip image of a target object from the target image; and classifying and identifying the lip images, and dividing the lip images into phonic frames or silent frames.

In step 803: and if the classification recognition result of the continuous multi-frame lip images meets the change rule from the silent frame to the pronunciation frame and then to the silent frame, positioning the starting and ending positions of the lip language from the continuous multi-frame lip images based on the change rule.

In step 804: and acquiring a lip image sequence between the start position and the stop position, and performing lip language identification to obtain a lip language identification result.

Therefore, in the embodiment of the application, the lip images are intercepted from the video so as to analyze the lip characteristics, the lip image sequence is obtained by positioning the starting and stopping positions of the lip language, and then the lip language recognition result is obtained by adopting the lip language recognition so as to control the human-computer interaction equipment.

For example: when the server identifies the lip language, the server generates an instruction for controlling the human-computer interaction equipment according to the result of the lip language identification, such as 'louder point'; or when the human-computer interaction device identifies the lip language, the human-computer interaction device directly executes the instruction corresponding to the lip language identification result according to the lip language identification result.

The embodiment of the application provides a method and service equipment for lip language recognition, and the method and service equipment can be used for increasing multi-mode signals based on lip language recognition results besides voice interaction to improve the applicability and stability of human-computer interaction. The lip language identification method in the application can comprise the following contents: the lip image sequence is extracted by adopting a sliding window, and the lip images can be aligned, positioned at the starting and stopping positions of the lip images and subsequently identified by the lip language in order to improve the accuracy of the lip language identification. The following description will be made for these main parts:

firstly, adopting sliding window to extract lip image sequence

The part can comprise two parts of content of acquiring a lip image and intercepting a lip image sequence by adopting a sliding window:

1. obtaining lip images

Lip images of a target object are extracted from multi-frame images of video data, and the extracted lip images are stored in a lip image set in time series.

In practice, the lip image region may be located based on classification recognition and location detection techniques. Certainly, in order to improve the accuracy of lip recognition, in the embodiment of the present application, the lip image may be extracted based on the face key points. Thus, it can be implemented as: the following operations are respectively performed on each frame of image in the video data:

and carrying out face detection on each frame of image in the video data to obtain face key points of the image. In an embodiment, a Histogram of Oriented Gradient (HOG) feature extractor and a linear regressor are used in a Space Division Multiplexing (SDM) of a conventional face detection algorithm to extract key points of a face, a Convolutional Neural Network (CNN) is used in the present application to replace the HOG feature extractor and the linear regressor in the SDM of the face detection algorithm, a CNN is used in the SDM of the face detection algorithm to extract key points of the face, the face is marked by 68 key points by the algorithm, as shown in fig. 9, the 68 key points locate feature contours such as a basic contour of the face, eyebrows/eyes/nose/lips, and the like. And then, according to the key points of the lips in the key points of the human face, the lip image of the target object can be intercepted from each frame of image. As shown in fig. 9, if 20 key points are used to locate the mouth contour in fig. 9, the lip image can be cut out from the image according to the 20 key points. And then storing the lip images into the lip image set according to the time sequence positions of the images in the video.

It should be understood by those skilled in the art that the total number of the above key points is not a unique value, and may be specifically set according to specific situations, and in other face key point identification technologies, more or less key points than 68 key points may be used to express the regions of the five sense organs of the face image, and the same applies to the lip language identification method provided in the present application.

In one embodiment, if a plurality of faces exist in video data, performing key point marking on the plurality of faces in the video by adopting the key point method, and detecting the state of lips; in another embodiment, since the target object is directly opposite to the camera in the process of human-computer interaction, the visual focus of the target object is often dropped on the human-computer interaction device, the target object can be identified by a visual focus positioning method. For example, the position relationship between the sight focus of each face in a plurality of faces in the video data and the human-computer interaction device can be calculated, and the user with the visual focus on the human-computer interaction device is taken as a target object.

In addition, in another embodiment, the position relationship between a plurality of human faces and the human-computer interaction device can be judged. For example, if only one face of the faces is facing the human-computer interaction device, the user corresponding to the face is used as the target object. And if a plurality of faces face the human-computer interaction equipment, the users corresponding to the faces are all used as target objects. Then, the target object making the lip language indication can be further screened out based on the subsequent lip language identification.

2. Lip image sequence interception using sliding window

This part can be divided into two parts, acquisition of the image sequence and downsampling. Downsampling may be implemented as an alternative.

2-1, acquiring a sequence of images

In one embodiment, the sequence of images in a short time includes lip language due to lip language having a certain continuity. Therefore, a sliding window can be adopted to intercept lip images of multiple frames from the lip image set to obtain a lip image sequence to be processed. The size of the sliding window can be determined according to an experiment result, and experiments show that the lip language-containing image sequence can be conveniently and simply acquired by adopting the sliding window mode, so that the lip language position can be positioned.

In another embodiment, the positioning accuracy of the lip language position is improved in order to be able to adapt to different situations. In the embodiment of the present application, the sliding window may have two dynamically adjustable parameters. One of the parameters is the number of frames and the other parameter is the step size.

The frame extraction quantity is used for determining the number of frames of a multi-frame lip image, and the step length is used for determining the number of frames of an interval between two adjacent sliding windows. Therefore, in implementation, the frame number of the lip images intercepted from the lip image set can be determined by adjusting the frame extraction number of the sliding window according to requirements, and the frame interval between two adjacent sliding windows is adjusted based on the adjustment of the step length of the sliding window, that is, if the lip image sequence intercepted by using one sliding window is regarded as sampling the video once, the adjustment of the sampling frequency of the video is realized by adjusting the step length of the sliding window.

In the implementation, because the sampling frame numbers corresponding to different speech speeds can be different during the later lip language recognition, for example, the sampling frame number can be properly reduced when the speech speed is faster. Therefore, in order to reasonably intercept the lip image sequence, the dynamic adjustment of the number of frames of the sliding window can be realized based on the speech speed. In practice, the number of frames may have an inverse relationship with the speech rate.

In order to adjust the number of frames extracted by using the speech rate, the following methods may be provided in the embodiments of the present application for illustration:

in one embodiment, the speech rate of the user is determined based on the audio and video data of the target object; for example, audio and video data of a plurality of users are collected in advance, facial features of the users are analyzed from the video, and the speech speed of the users is analyzed according to the audio data (for example, the number of characters spoken per unit time represents the speech speed). And then, associating the facial features of the user with the speech speed of the user and storing the facial features and the speech speed of the user in a database. When the face recognition method is used, face detection is carried out on a user, and then face recognition is carried out on the user based on the face features stored in the database. And obtaining the face features which are stored in the database and matched with the user, and further obtaining the speech speed associated with the matched face features. And then determining the number of the frames extracted by the user based on the corresponding relation between the preset speech rate and the number of the frames extracted. The corresponding relation between different speech rates and the number of frames can be obtained according to experiments.

Another embodiment is to determine the speech rate based on visual techniques. For example, when the target object is in a video conference, the change frequency of the lip movement is analyzed by adopting a visual technology, the higher the change frequency is, the faster the speech speed is, and then the frame extraction number is determined according to the speech speed.

Regarding the step size of the sliding window, for example: the lip image set comprises 30 frames of images, the number of frames is 20, the step size is 5, the first sliding window extracts 1-20 frames, and the second sliding window extracts 20 frames from the 26 th frame. In the embodiment of the present application, the step size may be dynamically adjusted based on the following method:

one embodiment is that the step length is positively correlated with the accuracy of lip identification, the smaller the step length, the more accurate the accuracy, and the higher the calculation amount, which can be determined by the staff in the field according to the empirical value after many tests; the user can set the step length according to the precision requirement of the user.

In another embodiment, the step size is adjusted according to the age and the sound level of the user; for example: the more accurate the tone level of the user is, the larger the step length is; the larger the age of the user, the smaller the step size.

2-2, down-sampling

In the embodiment of the application, multiple frames of lip images are intercepted from a lip image sequence in a lip image set according to a preset sliding window, and after the lip image sequence to be processed is obtained; in order to simplify the calculation, the lip image sequence to be processed may be sampled at equal intervals from a specified sampling start point, and a target number of lip image sequences to be processed may be obtained for lip language identification.

For example: intercepting 20 frames of lip images from a lip image sequence in a lip image set according to a preset sliding window; when the 20-frame lip image is sampled at equal intervals, if the starting point is designated as the 2 nd frame, the 20 nd frame is sampled at equal intervals from the 2 nd frame.

If the lip images of 20 frames are intercepted from the lip image sequence in the lip image set according to the preset sliding window; the starting point is designated as frame 1, and the sampling is performed at equal intervals from frame 1 to frame 20.

The selection of the designated starting point is random, i.e. the sampling can be performed at equal intervals from the 1 st frame or the sampling can be performed at equal intervals from the 2 nd frame.

Secondly, carrying out alignment processing on the lip image

As an optional scheme, the lip image alignment processing is performed, where the execution time of acquiring the image sequence by using the sliding window and the alignment processing is not limited, that is, the image sequence may be acquired by using the sliding window first and then the alignment processing is performed; it is also possible to perform the alignment process first and then acquire the image sequence using a sliding window.

In the embodiment of the present application, the downsampling process may be performed when downsampling is required, and the downsampling step may not be performed when downsampling is not required. When the downsampling is selected to be executed, the timing of executing the downsampling and the aligning process is not limited, but the downsampling is performed after the image sequence is acquired by using the sliding window.

Since the size and position of the lip images of the target object in the video data are different due to the fact that the target object is not stationary relative to the camera, the lip images need to be aligned to reduce or even overcome the deviation caused by the identification of the lip size of the target object when the target object is shot at different shooting angles. In the embodiment of the application, a mode parallel to the specified direction is adopted for solving the problem of deviation caused by the left-right or up-down shaking of a target object in a camera; in order to solve the problem of deviation caused by the fact that the distance between a target object and a camera is not fixed and lip size of the target object is identified, a mode of adjusting the proportion of a lip image is adopted; in order to solve the problem of deviation caused by the fact that the target object rotates left and right relative to the head of the camera to identify the lip size of the target object, a mode of adjusting the angle of a lip image is adopted; the following three modes are explained:

(1) parallel to the specified direction

As shown before the processing in fig. 10a, if the target object is shaken left and right or up and down with respect to the camera, the lip images of the target object are not in the same horizontal direction.

In the embodiment of the present application, a frame of image is selected in advance from the lip images as a reference frame, and the extending directions of the left and right end points of the lip of the reference frame are used as the designated directions, but of course, the extending directions of the upper and lower end points of the lip may also be used as the designated directions. And then, adjusting the lip boundaries in the lip images of different frames to be parallel to the specified direction by adopting a translation transformation processing mode. Taking the extending directions of the left and right ends of the lip as the designated directions as an example, the extending directions of the left and right ends of the lip of different frames are almost parallel by performing translation transformation on the lip images of different frames, as shown after processing in fig. 10 a.

During implementation, the distance between the lip key points of the reference frame and the lip key points of the non-reference frame can be calculated, a translation transformation matrix is determined according to the distance between the lip key points, and translation transformation processing is performed on the lip key points of different frames by using the translation transformation matrix, so that the extension directions of the left end point and the right end point of the lip of different frames are almost parallel.

(2) In proportion of

As shown before processing in fig. 10b, when the target object moves back and forth relative to the camera during speaking, the ratio of the lip image of the target object in the designated range is different;

in the embodiment of the application, a frame of image is selected from the lip images in advance as a reference frame, the lip images of the non-reference frame are scaled according to a preset proportion, and then the lip images of different frames are adjusted by using a rotation matrix so as to scale the lip images of different frames to a specified size. For example, after identifying the key points of the lips, the lip images are cut out from the original image based on the top, bottom, left and right vertices of the lips, and then the lip images of different frames are scaled to wait until the effect is shown in fig. 10b before and after the processing.

(3) Angle of rotation

As shown before processing in fig. 10c, when the target image is rotated left and right in the camera, the angle of the lip image relative to the lens changes.

In the embodiment of the present application, a frame of image is selected in advance from the lip images as a reference frame, distances and angles between the lip key points in the reference frame and the lip key points in the lip images except the reference frame are calculated, an affine matrix is determined according to the distances and angles, and the lip images of different frames are adjusted by using the affine matrix, so that the lens orientations of the lip images of different frames relative to the captured video data are the same as the reference frame, as shown after processing in fig. 10 c.

In the present application, the lip images are aligned by using the above-described alignment method, so that the size and the angle of the lip images in the lip image set are the same.

The lip images of different acquired frames are aligned, and the lip images after being processed are used for lip language recognition, so that the accuracy of the lip language recognition can be improved.

Positioning of start and stop positions of lip images

In general, when a user uses a lip language interaction, the user needs to open a mouth to transmit the lip language, and therefore the lip part is generally in an open state rather than a closed state in an image for transmitting the lip language. In view of this, in the embodiment of the present application, the start-stop position of the lip image can be located by detecting the open-close state of the lip.

In one embodiment, after the acquired lip images are aligned, the opening and closing states of the lip languages can be detected through classification identification so as to divide the lip images into voice frames or silent frames.

Wherein, the pronunciation frame indicates that the lip of the target object in the image for transferring the lip language is in a pronunciation state. The silent frame refers to that the lip of the target object in the image for transmitting the lip language is in a silent state. In practice, to improve the accuracy of classification, both voiced and silent frames are identified by a sequence of images containing the relevant frames. For example, whether the current frame is the pronunciation frame is divided by the current frame and at least one historical frame before the current frame. In practice, Image Dynamics of the current frame and at least one historical frame can be extracted and then classified and identified, and the at least one historical frame can be a continuous multi-frame Image. Or optical flow characteristics of the current frame and at least one historical frame can be extracted for classification and identification, and the at least one historical frame can be an image of a frame before the current frame or a historical frame separated from the current frame by multiple frames.

Or, in another embodiment, a 2D CNN (2-dimensional convolutional neural network) + shift form may also be adopted, where shift is typically a Temporal Shift Module (TSM) manner to classify a multi-frame image including a current frame, and divide the current frame into a sounding frame or a silence frame.

Since video is also a sequence of pictures, it is possible to locate a sequence of pictures that conveys lip language by dividing into voiced and silent frames.

In one embodiment, in order to accurately classify and identify the lip images, the lip images may be divided into voiced frames or silent frames based on a pre-trained lip image classification model in the embodiment of the present application. Wherein the lip image classification model can be obtained according to the following method:

firstly, a lip sample image is obtained, and a class label is labeled for the lip sample image. The class label is used to indicate whether the lip sample image belongs to a voiced frame or a silent frame.

In practice, the lip sample image may be manually labeled with a category label. In order to improve the labeling efficiency, a voice detection pronunciation position automatic labeling mode is adopted in the embodiment of the application. For example, the annotation can be automatically performed in an audio-video cooperation manner. If one or more video segments are recorded, lip images are extracted from each frame of image in the video as lip sample images. The video is recorded while the audio is recorded. Each frame of lip sample image will have a corresponding speech signal. In practice, it may be determined whether the corresponding lip sample image contains lip motion based on the speech signal.

As shown in fig. 11, the lip sample image can be labeled according to the voice signal corresponding to the lip sample image by the following method:

s1101, performing Voice Activity Detection (a-VAD, Audio Voice Activity Detection) on the Voice signal corresponding to the lip sample image to obtain a Voice Detection result.

In order to improve the detection accuracy, the voice signal of each frame of the lip sample image is a voice segment, and the voice segment comprises the lip sample image and the voice content of a plurality of frames of images before the lip sample image.

S1102, determining whether the voice detection result is voice, if not, performing step S1103, and if so, performing step S1104.

S1103, if the speech is not contained, the lip sample image is labeled as a silence frame.

In one embodiment, if the speech is contained, the corresponding lip sample image may be labeled as a pronunciation frame. However, the VAD detection technology has a defect that a detection result of a speech signal with relatively high noise is easily determined to be speech, so in the embodiment of the present application, in order to improve the accuracy of labeling, an energy detection technology is further used to accurately identify a speech frame. If the voice detection result is voice, the normalization result of the energy value of the voice signal is calculated in step S1104.

In an embodiment of the present application, after all energy values of a speech signal are determined by an energy detection technique, each energy value is divided by a maximum value of the energy values to obtain a normalization result of the energy values of the speech signal.

After step S1104 is executed, the following steps are continuously executed:

s1105, judging whether the normalization result of the energy value of the voice signal is larger than a preset threshold value; if not, step S1106 is executed, and if yes, step S1107 is executed.

And S1106, marking the lip sample image as a silent frame.

S1107, label the lip sample image as a voiced frame.

Exemplarily, as shown in fig. 12, labeled lip sample images are shown, wherein

reference numerals

1, 2, 10, and 11 are silence frames, and reference numerals 3 to 9 are pronunciation frames.

Here, the preset threshold may be determined through a plurality of experimental analyses. Further, a plurality of preset thresholds may be set according to the difference between different timbres, for example, corresponding preset thresholds may be set according to the difference between timbres of boys, girls, old people, children, and the like.

Therefore, the situation of judgment errors generated in the lip pronunciation process can be avoided by accurately determining the pronunciation frame and the silence frame of the lip image, and the accuracy of lip language recognition is improved.

After the lip sample image and the associated class label thereof are obtained, the lip sample image is input into a lip image classification model to be trained in the embodiment of the application, and a predicted class label of the lip sample image output by the lip image classification model to be trained is obtained; and determining the loss between the predicted class label and the class label according to a preset loss function, and training parameters of the lip image classification model to be trained to obtain the lip image classification model.

Wherein, the loss between the prediction category label and the category label can be determined by calculation modes such as cross entropy and the like. The lip image classification model may be a convolutional neural network model. The specific model structure of the lip image classification model is not limited herein, and may be adjusted according to the actual application.

Optionally, after the lip image classification model is trained, classification recognition of multiple frames of lip images can be realized. Due to the law of lip language interaction, a user often expresses a complete intention through lip language in a short time. Therefore, the image sequence of lip language should generally conform to the rule that the continuous multi-frame images are all sounding frames. Therefore, in order to accurately locate the lip language start-stop position, the change rule of the image sequence containing the lip language can be defined as that the silent frame is recognized firstly in the classification recognition result, then the silent frame is recognized again because the user stops speaking after recognizing the continuous pronunciation frame. Therefore, if the classification recognition result of the continuous multi-frame lip images meets the change rule from the silent frame to the pronunciation frame and then to the silent frame, the start-stop position of the lip language can be positioned from the continuous multi-frame lip images based on the change rule. In this way, the change from the silent frame to the pronunciation frame can be accurately positioned to the starting position of the lip language, and the change from the pronunciation frame to the silent frame can be accurately positioned to the ending position of the lip language.

Based on the above change rules, in one embodiment, the lip image firstly classified as the voiced frame in the continuous multi-frame lip images can be determined as the starting frame of the lip language; and determining the lip image which is classified as the pronunciation frame at last in the lip images of the continuous multiple frames as the ending frame of the lip language.

Illustratively, as in fig. 12, the lip image classified into the voiced frame for the first time (i.e., the image corresponding to reference numeral 3) is determined as the start frame of lip language, and the lip image classified into the voiced frame for the last time (i.e., the image corresponding to reference numeral 9) is determined as the end frame of lip language.

Alternatively, due to the random speaking of the user, a situation may occur in which a silent frame is mistakenly judged or the user pronounces slowly, so that a silent frame is accompanied between pronunciation frames of lip language. Therefore, in order to accurately identify the end frame of the lip language, the position of the end frame can be assisted based on the classification identification result of several frames after the end frame. As shown in fig. 13:

s1301, a silence frame appears for the first time after detecting a sounding frame.

S1302, detecting whether there is a sounding frame within a preset number of frames after the first-appearing silence frame, if there is no sounding frame, executing step S1303, if there is a sounding frame, returning to execute step S1301 from the sounding frame.

S1303, a frame preceding the first-appearing silence frame is determined as a lip image classified as a sounding frame at last.

Based on the accurate starting and stopping positions of the positioned lip language, the lip image sequence containing the lip language can be obtained to carry out subsequent lip language identification.

Four, lip language identification

In general, the pronunciation of different lip languages is similar, and it is considered that the lip languages that the man-machine interaction device may support are limited. In the embodiments of the present application, a noise sequence may be defined.

Therefore, the position of the lip language can be better positioned, and the efficiency of lip language identification is improved. In the embodiment of the application, after the lip image sequence is obtained, noise screening can be further realized, so that the lip image sequence containing the noise sequence is screened out.

In some embodiments, noise sequence screening may be implemented based on a classification recognition technique. If the two classification models can be trained to classify the images, the obtained classification result is used for indicating whether the input lip image sequence is a noise sequence. If the noise sequence is the noise sequence, the noise sequence is handed to a candidate lip language recognition model for lip language recognition, otherwise, the lip language image sequence can be discarded without lip language recognition.

In training a two-class model, a training sample has an associated label that indicates whether the training sample is a noisy sequence. In the model training stage, a training sample is input into a two-class model to be trained so that the two-class model predicts whether the training sample is a noise sequence, the prediction result is compared with a label to obtain loss, and then parameters of the two-class model are adjusted based on the loss until the training of the two-class model converges.

In one embodiment, the image features of the lip language image can be extracted by using HOG and LBP (Local Binary Pattern), and the features are sent to the CNN model for recognition, so as to determine whether the lip image sequence is a noise sequence; until the CNN model training converges.

In another embodiment, the binary classification model can be used to identify the proportion of the voiced frames in the lip image sequence, and can be classified as a non-noise sequence if the proportion of the voiced frames in the lip image sequence is greater than a threshold value, and classified as a noise sequence if the proportion is smaller.

After screening the lip image sequence, in order to accurately identify the user intention, the following operations to be performed may be determined by performing lip language identification through the following method, in the embodiment of the present application, the method of lip language identification is illustrated through the following embodiments:

in one embodiment, lip language recognition may be implemented based on a 3D Convolutional Neural network model (3D CNN). Illustratively, the number of lip images in the lip image sequence is taken as one dimension, a matrix of pixel points in the lip images is taken as two dimensions, the lip image sequence is processed, for example, the lip image sequence comprises 30 lip images, and the lip images are 64 × 64 pixel points, then, after the processing according to the dimensions, a 64 × 30 × 3 matrix corresponding to the lip image sequence is obtained, wherein 64 × 30 is a dimension value, 3 is a channel number, the matrix is input into a pre-trained 3D convolutional neural network model, and a vector (vector) of 11 × 1 dimension output by the 3D convolutional neural network model is obtained, where the first 10 × 1 dimension corresponds to different command words, respectively, and the last value represents noise data. In some embodiments, a plurality of convolutional layers may be set in the 3D convolutional neural network model, a pooling layer may be set between the convolutional layers, and a discarding layer (i.e., dropout) may be set in the fully-connected layer.

In training a 3D convolutional neural network model, a training sample has an associated label that indicates the command word class that the training sample contains. In the model training stage, a training sample is input into a 3D convolutional neural network model to be trained so that the 3D convolutional neural network model predicts the class of a command word contained in the training sample, the prediction result is compared with a label to obtain a loss, and then the parameters of the 3D convolutional neural network model are adjusted based on the loss until the 3D convolutional neural network model training converges.

In another embodiment, because the process of identifying the lip language by using the 3D Convolutional Neural network model is relatively large in calculation amount, the resource requirement is relatively high, the response speed is relatively slow, and a situation of blocking in the pressure test stage may occur in the practical application process, to solve the problem, the lip language may be identified by using a 2D Convolutional Neural network model (2D Convolutional Neural Networks, 2D CNN), because the process of identifying the lip language by using the 2D Convolutional Neural network model is relatively small in calculation amount, for example, a lightweight 2D Convolutional Neural network (squeeze/mobile/shuffle) is used, but the time correlation characteristic between the lip language image frame and the frame needs to be considered, the lip language is identified by using only the 2D Convolutional Neural network model, and the time correlation characteristic cannot be solved, for example, three-dimensional data is input by the 2D Convolutional Neural network model, as shown in fig. 14a, the three-dimensional data corresponding to the picture is [ width, height, number of input channels ], the size of a required convolution kernel corresponds to [ length of convolution kernel 1, length of convolution kernel 2, number of input channels ], a feature image is obtained by one convolution kernel, and if a plurality of feature images need to be output, a plurality of convolution kernels can be defined. While the 3D convolutional neural network model inputs four-dimensional data, as shown in fig. 14b, the four-dimensional data corresponding to the picture is [ depth, width, height, number of input channels ], that is, one dimension (that is, time dimension) is added to the three-dimensional data input by the 2D convolutional neural network model, so that when the 3D convolutional neural network model processes data, the required size of a convolution kernel corresponds to [ convolution kernel length 3, convolution kernel length 4, convolution kernel length 5, number of input channels ], the 3D convolutional neural network model can process the input image in the time dimension, the 2D convolutional neural network model, however, cannot process the input image in the time dimension, and therefore, the 2D convolutional neural network model and a neural network model (Transformer) based on the self-attention mechanism can be combined to replace the 3D convolutional neural network model for lip language recognition.

In some embodiments, a scheme of combining the 2D CNN and a Neural network model based on a self-attention mechanism may be replaced by using a combination of the 2D CNN and the TSM or a combination of the 2D CNN and the LSTM (Long stop Term Memory, rnn (recurrent Neural networks)), and a specific combination manner is not limited herein and may be adjusted according to an actual application situation.

In some embodiments, two-dimensional feature extraction is performed on each frame of lip image in the lip image sequence to obtain two-dimensional lip features respectively corresponding to each frame of lip image; determining three-dimensional lip features of the lip image sequence based on the incidence relation between the two-dimensional lip features respectively corresponding to each frame of lip image; and performing multi-classification recognition on the three-dimensional lip characteristics to obtain a lip language recognition result.

Exemplarily, a lip feature (namely a two-dimensional lip feature) in a lip image sequence is extracted through a pre-trained 2D convolutional neural network model, the lip feature is continuously input into the pre-trained neural network model based on a self-attention mechanism, a sequencing result (namely a three-dimensional lip feature) of the lip feature in a time sequence is output, multi-classification recognition processing is carried out on the three-dimensional lip feature, and a lip recognition result is obtained, so that the lip recognition method with balanced compromise on efficiency and precision is realized.

In another embodiment, the lip image sequence may be further subjected to a blocking process, where each block includes three-dimensional information, and the three-dimensional information includes image width direction information, image height direction information, and timing information; the method comprises the steps of respectively extracting features of plane information formed by any two dimensions in the three-dimensional information to obtain LBP-top (local binary patterns from three orthogonal planes, namely the LBP extracts features from three orthogonal planes) features, and performing multi-classification recognition based on the LBP-top features of a lip image sequence to obtain a lip language recognition result.

For example, a conventional machine learning algorithm may be used to classify the lip image sequence. For example: the LBP performs classification processing in conjunction with SVMs (support vector machines). The LBP is an operator for describing the local texture features of the image, has the remarkable advantages of rotation invariance, gray scale invariance and the like, and can be used for extracting the local texture features of the image. The lip language image sequence comprises time information and space information of lip motion change, so that the lip language image sequence is subjected to blocking processing by using LBP, then feature extraction is carried out, the features extracted in the space domain and the features extracted in the time domain are subjected to compression splicing to obtain one-dimensional features of the lip language image sequence, and finally the one-dimensional features of the lip language image sequence are subjected to classification and identification to obtain a lip language identification result. In the embodiment of the application, the service device firstly performs video acquisition on a target object, and then respectively executes the following steps on each frame of target image needing to extract lip information: extracting a lip image of a target object from a target image, classifying and identifying the lip image, dividing the lip image into a sounding frame or a silent frame, if the classification identification result of continuous multi-frame lip images meets the change rule from the silent frame to the sounding frame and then to the silent frame, positioning the starting and ending positions of lip language from the continuous multi-frame lip images based on the change rule, and after a lip image sequence between the starting positions is obtained, performing lip language identification to obtain a lip language identification result. Therefore, besides voice interaction, multi-modal signals based on lip language recognition results can be added, and the applicability and stability of human-computer interaction are improved.

To facilitate understanding, in the following, an application scenario of the service device provided in this embodiment is described, as shown in fig. 14c, in step S1401, the video collector 232 in the service device 200 collects video data in the surrounding environment where the service device 200 is located in real time, and sends the collected video data to the controller 250, the controller 250 sends the video data to the server 400 through the communicator 220, in step S1402, the server 400 identifies an instruction included in the video data, then the server 400 sends the identified instruction back to the service device 200, correspondingly, the service device 200 receives the instruction through the communicator 220 and sends the instruction to the controller 250, and finally in step S1403, the controller 250 may directly execute the received instruction.

An embodiment of the present invention further provides a computer storage medium, where computer program instructions are stored in the computer storage medium, and when the instructions are run on a computer, the instructions cause the computer to execute the steps of the above method for responding to the device.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A lip language identification method is characterized by comprising the following steps:

carrying out video acquisition on a target object;

extracting a lip image of the target object from the target image; classifying and identifying the lip images, and dividing the lip images into pronunciation frames or silence frames, wherein the pronunciation frames are used for indicating that the lips of the target object are in a pronunciation state, and the silence frames are used for indicating that the lips of the target object are in a silence state;

2. The method according to claim 1, wherein the classifying and identifying the lip images, and the dividing the lip images into voice frames or silence frames comprises:

3. The method according to claim 2, wherein labeling the lip sample image according to the voice signal corresponding to the lip sample image comprises:

if the voice signal is determined not to be a pronunciation signal based on the voice detection result, marking the lip sample image as a silent frame;

if the voice signal is determined to be a pronunciation signal based on the voice detection result, and the normalization result of the energy value of the voice signal is smaller than or equal to a preset threshold value, marking the lip sample image as a silence frame;

4. The method of claim 1, wherein prior to said classifying identification of the lip images, the method further comprises:

the lip images of different frames are aligned.

5. The method according to claim 4, wherein the aligning the lip images of different frames comprises any one or a combination of the following:

scaling the different lip images to a specified size;

6. The method according to claim 1, wherein the locating the starting and ending positions of the lip language from the continuous multiframe lip images based on the change rule comprises:

7. The method of claim 6, further comprising:

8. The method of claim 1, wherein after the obtaining the sequence of lip images between the start-stop positions, the method further comprises:

9. A service device, comprising: a memory and a controller;

the memory for storing a computer program;

the controller is connected with the memory and configured to perform the method of any of claims 1-8 based on the computer program.

10. A computer storage medium having computer program instructions stored therein, which when run on a computer, cause the computer to perform the method of any one of claims 1-8.