CN114822543A

CN114822543A - Lip language identification method, sample labeling method, model training method, device, equipment and storage medium

Info

Publication number: CN114822543A
Application number: CN202210573455.7A
Authority: CN
Inventors: 刘恒; 李志刚; 石磊; 刘腾
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-06-09
Filing date: 2022-05-24
Publication date: 2022-07-29

Abstract

The method comprises a lip language identification method, a sample labeling method, a model training method, a device, equipment and a storage medium. The application relates to a lip language identification technology in the field of artificial intelligence. In the embodiment of the application, the lip language video clip of the user is marked through the voice recognition text of the user, then the lip language recognition sample of the user is automatically obtained, the lip language recognition model is trained by using the lip language recognition sample, lip language recognition is carried out on the user through the lip language recognition model, automatic marking of the lip language recognition sample, self-supervision active learning and personalization of the lip language recognition model are achieved, the lip language recognition accuracy rate is effectively improved, the whole process does not need user participation, and user experience is improved.

Description

Lip language identification method, sample labeling method, model training method, device, equipment and storage medium

The present application claims priority of chinese patent application entitled "lip language identification method, sample labeling method, model training method and apparatus, device, and storage medium" filed on 9/6/2021 by the national intellectual property office, application No. 202110643378.3, which is incorporated herein by reference in its entirety.

Technical Field

The present application relates to a lip language identification technology in the field of artificial intelligence, and in particular, to a lip language identification method, a sample labeling method, a model training method, a device, an apparatus, and a storage medium.

Background

Lip language recognition combines computer vision with natural language processing, and speech content can be interpreted only by the facial features of the speaker in the video. The method has good application in the fields of intelligent human-computer interaction, audio damage and the like, and has important practical significance in the field of hearing-impaired and speech-lost people and the capture and recognition of the target language in a strong noise environment.

With the deep learning and the vigorous development of large-scale data sets, the lip language recognition technology based on the lip language recognition model has greatly improved performance compared with the traditional lip language recognition technology on the recognition task of words or sentences. However, because different hardware devices are different, lip static and dynamic characteristics, language habits and other aspects of different people when speaking are also different, so that the lip language recognition model has larger difference in recognition accuracy for different people, and the lip language recognition accuracy difference of various people is larger.

Disclosure of Invention

In view of the above problems in the prior art, the present application provides a lip language identification method, a sample labeling method, a model training method, a device, an apparatus, and a storage medium, which can automatically complete labeling of a lip language identification sample, improve identification accuracy of a lip language identification model, and further improve accuracy of lip language identification of various people.

In order to achieve the above object, a first aspect of the present application provides a sample labeling method applied to an electronic device, where the sample labeling method includes:

acquiring video and audio of a user in the process that the user speaks towards the electronic equipment;

extracting lip movement video clips in the video and voice clips in the audio;

selecting a lip motion video clip matched with the voice clip;

and marking the lip movement video clip by taking the voice recognition text of the voice clip as a label to obtain a lip language recognition sample of the user.

By simultaneously acquiring the video and the audio of the user during speaking, the voice recognition result of the voice segment in the audio is used as the label of the corresponding lip language video segment in the video. Therefore, the labeling of the lip language identification sample can be automatically finished without the participation of a user, the efficiency and the accuracy of sample labeling are improved, and the personalized labeling of the sample is realized.

As a possible implementation manner of the first aspect, the method further includes one or more of the following: detecting the volume of the environmental noise; acquiring a awakening voice confidence coefficient of a user; detecting whether a human face or a human mouth is included in the visual field range of the camera; acquiring the position of a speaker in a video and the sound source positioning direction of an audio; the acquiring of the video and audio of the user specifically includes: when the volume of the environmental noise is equal to or smaller than a preset noise threshold, the awakening voice confidence coefficient is larger than or equal to a preset first confidence coefficient threshold, the position of a speaker containing a human face or a human mouth in the visual field range of the camera is matched with the sound source positioning direction of the audio, the video and the audio of the user are acquired.

Therefore, the problem of inaccurate sample label caused by poor voice quality, influence of environmental noise, poor video quality and/or asynchronous audio and video can be avoided, and the accuracy of sample labeling can be effectively improved.

As a possible implementation manner of the first aspect, extracting a lip movement video segment in a video and a voice segment in an audio includes: carrying out endpoint detection and segmentation on the video in a lip movement human voice interval detection mode to obtain a lip movement video segment and a human voice interval of the lip movement video segment; and/or carrying out end point detection and segmentation on the audio frequency in a voice interval detection mode to obtain voice segments and voice intervals of the voice segments. Therefore, the segmentation of the video and the audio is realized through the endpoint detection, and the corresponding human voice interval can be obtained simultaneously so as to determine the matching relation between the voice segment and the lip movement video segment, in other words, to find the voice segment and the lip movement video segment corresponding to the same speaking content.

As a possible implementation manner of the first aspect, selecting a lip motion video segment matching the voice segment includes: determining the overlapping length of the voice segment and the lip movement video segment in the time dimension according to the voice interval of the voice segment and the voice interval of the lip movement video segment; when the overlapping length of the voice segment and the lip movement video segment in the time dimension is larger than or equal to a preset time threshold value, the voice segment and the lip movement video segment are matched. Therefore, the overlapping length of the voice segments and the lip motion video segments in the time dimension is determined through the voice interval, the lip motion video segments matched with the voice segments can be efficiently, quickly and accurately found, and the voice segments and the lip motion video segments corresponding to the same speaking content can be efficiently, quickly and accurately found.

As a possible implementation manner of the first aspect, the method further includes: and selecting a lip motion video segment with a lip language recognition confidence coefficient smaller than a preset second confidence coefficient threshold value from the lip motion video segments matched with the voice segments, wherein the lip language recognition confidence coefficient is obtained by performing lip language recognition on the lip motion video segment according to a pre-obtained lip language recognition model. Therefore, the lip motion video segment with the 'least definite' or the 'maximum information carrying' can be selected as a lip language recognition sample, and a supervised sample is provided for the transfer learning and the iterative optimization of a subsequent lip language recognition model.

As a possible implementation manner of the first aspect, the lip motion video segment includes a lip motion image sequence, and an image frame in the lip motion image sequence is a lip region image.

A second aspect of the present application provides a model training method applied to an electronic device, including: and updating parameters of the lip language identification model by using the lip language identification sample obtained by the sample labeling method of the first aspect. Therefore, automatic labeling and model optimization are carried out by using the voice modality to assist the visual modality, self-supervision, individuation and active learning of the lip language recognition model are achieved, the recognition accuracy and the individuation degree of the lip language recognition model are improved, user participation is not needed in the whole process, and user experience is improved.

As a possible implementation manner of the second aspect, the lip language recognition model includes a generic feature layer and a trainable layer, and the parameters of the lip language recognition model include trainable layer parameters and generic feature layer parameters; updating parameters of the lip language identification model, specifically: and updating trainable layer parameters of the lip language recognition model. Therefore, by updating the trainable layer parameters of the lip language recognition model for each user, the training efficiency is higher, the data volume of the lip language recognition model parameters of each user is relatively less, and the hardware resources are saved.

As a possible implementation manner of the second aspect, the model training method further includes: and associating the trainable layer parameters with preset information of the user and then storing. The trainable layer parameters of the lip language recognition model are associated with the preset information of the user, so that the lip language recognition model parameters of the user can be conveniently and quickly found through the preset information of the user.

As a possible implementation manner of the second aspect, the model training method further includes: storing preset information of a user in a registered information database; and storing the trainable layer parameters related to the preset information in a lip language model library. Therefore, whether the lip language identification model parameters of the user exist can be conveniently confirmed through the preset information of the user, and the lip language identification model parameters of the user can be quickly found through the preset information of the user.

As a possible implementation manner of the second aspect, before updating the parameters of the lip language recognition model, the method further includes: adjusting the parameter updating rate of the lip language identification model by comparing the lip language identification text of the lip language identification sample with the label of the lip language identification sample to obtain the parameter updating rate of the corresponding lip language identification sample; the lip language identification text is obtained by performing lip language identification on a lip language identification sample through a lip language identification model; updating parameters of the lip language identification model specifically comprises the following steps: and updating the parameters of the lip language identification model by using the lip language identification samples and the parameter updating rates of the corresponding lip language identification samples. Therefore, the optimization efficiency of the lip language recognition model can be improved, and the hardware resource consumption is reduced.

A third aspect of the present application provides a lip language identification method applied to an electronic device, including: when a user speaks towards the electronic equipment is detected, acquiring a video of the user; extracting lip movement video clips in the video; and operating the lip language recognition model based on the parameters of the lip language recognition model obtained by the model training method in the second aspect, and performing lip language recognition on the lip movement video clip to obtain a lip language recognition text. Therefore, the lip language identification of various people can be efficiently completed through the electronic equipment, and the lip language identification accuracy of various people is improved.

As a possible implementation manner of the third aspect, the lip language recognition model includes a generic feature layer and a trainable layer, and the parameters of the lip language recognition model include trainable layer parameters and generic feature layer parameters; the method for carrying out the lip language recognition on the lip motion video segment by operating the lip language recognition model based on the parameters of the lip language recognition model obtained by the model training method comprises the following steps: acquiring preset information of a user; acquiring trainable layer parameters associated with preset information; and loading trainable layer parameters and pre-configured general characteristic layer parameters to operate a lip language identification model to carry out lip language identification on the lip motion video segment. Therefore, trainable layer parameters of the speaker can be quickly found through preset information of the speaker, and a lip language recognition model is operated by utilizing the trainable layer parameters and general characteristic layer parameters shared by various people to recognize the lip language of the speaker, which is equivalent to the lip language recognition of a user by using a customized lip language recognition model of the user, so that the recognition accuracy rate of the lip language in a vertical domain aiming at individuals is effectively improved.

As a possible implementation manner of the third aspect, the preset information includes a face ID; acquiring preset information of a user, specifically comprising: and carrying out face recognition on the video to obtain face feature data of the user, and inquiring a face ID corresponding to the face feature data from a registered face database. Therefore, the preset information of the user can be obtained through the video, the processing efficiency of lip language identification is improved, and the accuracy of individual lip language identification is further improved.

As a possible implementation manner of the third aspect, the operating the lip language recognition model based on the parameters of the lip language recognition model obtained by the model training method to perform lip language recognition on the lip movement video clip further includes: and loading the trainable layer parameters and the universal characteristic layer parameters which are stored locally when the trainable layer parameters related to the preset information do not exist so as to operate the lip language identification model to carry out lip language identification on the lip motion video segments. Therefore, when trainable layer parameters of a user do not exist, the lip language recognition can be completed by using the universal lip language recognition model parameters, which is equivalent to performing the lip language recognition on the user by using a universal lip language recognition model, so that the lip language recognition of various people can be efficiently completed through the electronic equipment.

The fourth aspect of the present application provides a lip language identification device, which is applied to an electronic device, and the lip language identification device includes:

the video acquisition unit is configured to acquire a video of a user in the process that the user speaks towards the electronic equipment;

the audio acquisition unit is configured to acquire the audio of a user in the process that the user speaks towards the electronic equipment;

the lip motion extraction unit is configured to extract a lip motion video segment in the video;

a voice extraction unit configured to extract a voice segment in the audio;

a selection unit configured to select a lip motion video segment matching the voice segment;

and the marking unit is configured to mark the lip movement video clip by taking the voice recognition text of the voice clip as a label to obtain a lip language recognition sample of the user.

As a possible implementation manner of the fourth aspect, the lip language recognition apparatus further includes one or more of the following:

a noise detection unit configured to detect a volume of the environmental noise;

the awakening voice confidence coefficient acquisition unit is configured to acquire the awakening voice confidence coefficient of the user;

the human face detection unit is configured to detect whether a human face or a human mouth is included in the visual field range of the camera;

the positioning unit is configured to acquire the position of a speaker in the video and the sound source positioning direction of the audio;

the video acquisition unit is specifically configured to: when the volume of the environmental noise is equal to or smaller than a preset noise threshold, the awakening voice confidence coefficient is larger than or equal to a preset first confidence coefficient threshold, the position of a speaker containing a human face or a human mouth and/or video in the visual field range of the camera is matched with the sound source positioning direction of the audio, the video of a user is acquired; and/or the presence of a gas in the gas,

the audio acquisition unit is specifically configured to: and acquiring the audio of the user when the volume of the environmental noise is equal to or less than a preset noise threshold, the awakening voice confidence is greater than or equal to a preset first confidence threshold, the position of a speaker containing a human face or a human mouth and/or video in the visual field range of the camera is matched with the sound source positioning direction of the audio.

As a possible implementation manner of the fourth aspect, the lip motion extracting unit is specifically configured to: carrying out endpoint detection and segmentation on the video in a lip movement human voice interval detection mode to obtain a lip movement video segment and a human voice interval of the lip movement video segment; and/or the voice extraction unit is specifically configured to perform endpoint detection and segmentation on the audio frequency in a voice interval detection mode to obtain a voice segment and a voice interval of the voice segment.

As a possible implementation manner of the fourth aspect, the selecting unit is specifically configured to: determining the overlapping length of the voice segment and the lip movement video segment in the time dimension according to the voice interval of the voice segment and the voice interval of the lip movement video segment; when the overlapping length of the voice segment and the lip movement video segment in the time dimension is larger than or equal to a preset time threshold value, the voice segment and the lip movement video segment are matched.

As a possible implementation manner of the fourth aspect, the selecting unit is further configured to: and selecting a lip motion video segment with a lip language recognition confidence coefficient smaller than a preset second confidence coefficient threshold value from the lip motion video segments matched with the voice segments, wherein the lip language recognition confidence coefficient is obtained by performing lip language recognition on the lip motion video segment according to a pre-obtained lip language recognition model.

As a possible implementation manner of the fourth aspect, the lip motion video segment includes a lip motion image sequence, and the image frames in the lip motion image sequence are lip region images.

As a possible implementation manner of the fourth aspect, the lip language recognition apparatus further includes: and the parameter updating unit is configured to update the parameters of the lip language identification model by using the lip language identification sample obtained by the labeling unit.

As a possible implementation manner of the fourth aspect, the lip language identification model includes a generic feature layer and a trainable layer, and the parameters of the lip language identification model include trainable layer parameters and generic feature layer parameters; the parameter updating unit is specifically configured to: and updating trainable layer parameters of the lip language recognition model.

As a possible implementation manner of the fourth aspect, the lip language recognition apparatus further includes: and the storage unit is configured to store the trainable layer parameters after being associated with the preset information of the user.

As a possible implementation manner of the fourth aspect, the storage unit is specifically configured to: storing preset information of a user in a registered information database; and storing the trainable layer parameters related to the preset information in a lip language model library.

As a possible implementation manner of the fourth aspect, the parameter updating unit is specifically configured to adjust the parameter updating rate of the lip language identification model by comparing the lip language identification text of the lip language identification sample with the tag of the lip language identification sample, so as to obtain the parameter updating rate of the corresponding lip language identification sample; updating parameters of the lip language identification model by using the lip language identification samples and the parameter updating rates of the corresponding lip language identification samples; the lip language identification text is obtained by performing lip language identification on a lip language identification sample through a lip language identification model.

As a possible implementation manner of the fourth aspect, the video obtaining unit is further configured to obtain a video of the user when detecting that the user speaks towards the electronic device; the lip language recognition device further includes: and the lip language identification unit is configured to operate the lip language identification model according to the parameters of the lip language identification model obtained by updating the parameter updating unit so as to perform lip language identification on the lip movement video clip and obtain a lip language identification text.

As a possible implementation manner of the fourth aspect, the lip language identification model includes a generic feature layer and a trainable layer, and the parameters of the lip language identification model include trainable layer parameters and generic feature layer parameters; the lip language recognition device further includes: the preset information acquisition unit is configured to acquire preset information of a user; the lip language identification unit is specifically configured to: acquiring preset information of a user; and acquiring trainable layer parameters related to the preset information, and loading the trainable layer parameters and the pre-configured general feature layer parameters so as to operate a lip language identification model to carry out lip language identification on the lip motion video segment.

As a possible implementation manner of the fourth aspect, the preset information includes a face ID; the preset information acquisition unit is specifically configured to: the method comprises the steps of carrying out face recognition on image frames in a video to obtain face feature data of a user, and inquiring face ID corresponding to the face feature data from a registered face database.

A fifth aspect of the present application provides an electronic device, comprising: a processor; and a memory storing a computer program which, when executed by the processor, causes the processor to perform the sample annotation method of the first aspect, the model training method of the second aspect and/or the lip language recognition method of the third aspect.

A sixth aspect of the present application provides a computer-readable storage medium having stored thereon program instructions, which, when executed by a computer, cause the computer to perform the sample labeling method of the first aspect, the model training method of the second aspect, and/or the lip language recognition method of the third aspect.

The embodiment of the application captures the video and the audio of the user during speaking at the same time, and the voice recognition result of the voice fragment in the audio is used as the label of the corresponding lip language video fragment in the video, so that the lip language recognition sample of the user is automatically obtained, the lip language recognition model is optimized or trained through the lip language recognition sample of the user, the customization of the lip language recognition model is realized, the lip language recognition is carried out on the user through the lip language recognition model, and finally, the accuracy rate of the lip language recognition of a specific user or a specific scene is effectively improved.

Drawings

The individual features and the connections between the individual features of the present application are further explained below with reference to the drawings. The figures are exemplary, some features are not shown to scale, and some of the figures may omit features that are conventional in the art to which the application relates and are not essential to the application, or show additional features that are not essential to the application, and the combination of features shown in the figures is not intended to limit the application. In addition, the same reference numerals are used throughout the specification to designate the same components. The specific drawings are illustrated as follows:

fig. 1 is a schematic flow chart of a sample annotation method according to an embodiment of the present application.

Fig. 2 is a schematic diagram of an embodiment of the present application in which a lip motion video segment and a voice segment overlap in a time dimension.

Fig. 3 is a schematic flowchart of a model training method according to an embodiment of the present application.

Fig. 4 is a schematic flowchart of a lip language identification method according to an embodiment of the present application.

Fig. 5 is a schematic flowchart of a lip language identification apparatus according to an embodiment of the present application.

Fig. 6 is a schematic diagram of an implementation process of sample labeling and model training in an exemplary application scenario according to an embodiment of the present application.

FIG. 7 is a flowchart illustrating exemplary embodiments of lip language recognition, model training, and sample labeling according to an embodiment of the present disclosure.

Fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Fig. 9 is an exemplary structural schematic diagram of an electronic device provided in an embodiment of the present application.

Fig. 10 is a schematic diagram of an exemplary software architecture of an electronic device provided in an embodiment of the present application.

Detailed Description

The terms "first, second, third and the like" or "module a, module B, module C and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order, it being understood that specific orders or sequences may be interchanged where permissible to effect embodiments of the present application in other than those illustrated or described herein.

In the following description, reference to reference numerals indicating steps, such as S110, S120 … …, etc., does not necessarily indicate that the steps are performed in this order, and the order of the preceding and following steps may be interchanged or performed simultaneously, where permitted.

The term "comprising" as used in the specification and claims should not be construed as being limited to the contents listed thereafter; it does not exclude other elements or steps. It should therefore be interpreted as specifying the presence of the stated features, integers, steps or components as referred to, but does not preclude the presence or addition of one or more other features, integers, steps or components, and groups thereof. Thus, the expression "an apparatus comprising the devices a and B" should not be limited to an apparatus consisting of only the components a and B.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the application. Thus, appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments, as would be apparent to one of ordinary skill in the art from this disclosure.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. In the case of inconsistency, the meaning described in the present specification or the meaning derived from the content described in the present specification shall control. In addition, the terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

To accurately describe the technical contents in the present application and to accurately understand the present application, the terms used in the present specification are given the following explanations or definitions before the description of the specific embodiments.

The Speech Recognition (ASR) model, a machine learning model that is capable of recognizing Speech as text.

A Voice Activity Detection (VAD) model identifies a starting point and an end point of human speaking by using an endpoint Detection algorithm, and further divides Voice or video into a plurality of continuous segments, so that the discrimination of a human Voice segment and a non-human Voice segment can be realized. In colloquial, the end point detection is to accurately locate the starting point and the ending point of the voice from the voice or video with noise, and find a voice segment or a video segment which really contains the voice content.

Sound source localization techniques, techniques that employ sound source localization algorithms to determine the direction or position of an object (e.g., a speaker). In the embodiment of the application, the sound source positioning technology adopts a sound source positioning algorithm based on a microphone array, and the sound source positioning algorithm can be a sound source positioning algorithm based on beam forming, a sound source positioning algorithm based on high-resolution spectrum estimation, or a sound source positioning algorithm based on sound arrival delay difference.

At present, the lip language recognition model is generally realized by adopting a deep learning network based on supervision. In order to adapt the lip language recognition model to the needs of a specific scene or the personal habits of a specific user, samples with labels (namely, labeled samples) are often required to be collected for the specific scene or the specific user to carry out fine adjustment on the lip language recognition model, and only if the samples with the labels are large in scale, the performance of the lip language recognition model can be effectively improved, and the situation of low accuracy rate of lip language recognition can be solved. However, at present, samples of a lip language identification model are usually manually collected and labeled, and the sample collection difficulty of a specific scene or a specific user is large, the labeling time cost is high, and the labeling accuracy is low, so that the lip language identification model is difficult to optimize, the performance of the lip language identification model is difficult to effectively improve, and the problem of low lip language identification accuracy of the specific scene or the specific user is difficult to solve. In addition, since the labeled sample usually carries information related to the user privacy, collecting a large number of labeled samples in advance is not beneficial to protecting the user privacy.

In view of this, the present application provides a lip language identification method, a sample identification method, a model training method, a device, a computer readable storage medium, and a device, a computer readable storage medium, where the lip language identification method, the sample identification method, the model training method, the device, and the computer readable storage medium are provided, in the present application, a video and an audio of a user speaking are captured simultaneously, and a voice identification result of a voice segment in the audio is used as a tag of a corresponding lip language video segment in the video, so that a lip language identification sample of the user is automatically obtained, a lip language identification model is optimized or trained through the lip language identification sample of the user, thereby realizing user customization of the lip language identification model, and then the lip language identification is performed on the user through the lip language identification model, so that the accuracy of lip language identification of a specific user or a specific scene is effectively improved finally.

According to the embodiment of the application, the labeling of the lip language identification sample can be automatically completed without the participation of a user, the efficiency is high, the accuracy is high, the method and the device can be suitable for different scenes and different users to generate the specific lip language identification sample, and the individuation of the labeling of the sample is realized. In addition, the whole process of the embodiment of the application is insensitive to the user, the specific user does not need to be calibrated in advance, the problem does not need to be thrown out, and an expert is not required to determine the label, so that the method is more friendly, and the user experience can be effectively improved. According to the embodiment of the application, the self-supervision active learning and individuation of the lip language recognition model are realized through the automatic labeling of the sample, the customization of the lip language recognition model and the recognition accuracy rate of the lip language recognition model for a specific user or a specific scene can be effectively improved, and the lip language recognition accuracy rate of various scenes and various people is further improved.

The embodiment of the application is suitable for various scenes needing lip language identification. Specifically, the embodiment of the application can be applied to various scenes in which people speak towards the electronic equipment. For example, waking up the electronic device by speaking, talking on the electronic device, playing an interactive game using the electronic device, doing housework or other daily activities using the electronic device, controlling the electronic device by speaking to the electronic device (e.g., playing media content, etc.), or other similar scenarios.

The embodiments of the present application are applicable to various electronic devices, and specific details of the electronic devices may be referred to in the following description, which is not repeated herein.

The method can be independently realized through the electronic equipment, and can also be realized through a system comprising the electronic equipment and the cloud server. In some embodiments, the sample marking, the model training and the lip language recognition can be completed through the electronic equipment, so that the lip language recognition model training and the lip language recognition based on the lip language recognition model can be performed without uploading the lip language recognition sample of the user, the data can be ensured not to be outgoing, and the user privacy can be protected. In some embodiments, the sample marking and the lip language recognition can be completed through the electronic device, the electronic device can provide the locally obtained lip language recognition sample for the cloud server, and the cloud server updates parameters of the lip language recognition model and issues the parameters of the lip language recognition model to the electronic device. Therefore, processing that the computation complexity is possibly relatively high and the data volume is possibly large, such as model training, can be realized by the cloud server with high computing capacity, so that the processing efficiency is improved, and the resource consumption of the electronic equipment is reduced.

The following describes in detail specific embodiments of examples of the present application.

Fig. 1 shows a schematic flow chart of a sample annotation method provided in an embodiment of the present application. The sample annotation method of the embodiment of the present application can be performed by an electronic device, and for technical details of the electronic device, reference may be made to the following description. Referring to fig. 1, a sample labeling method according to an embodiment of the present application may include the following steps:

step S110, acquiring the video and audio of the user in the process that the user speaks towards the electronic equipment.

The video can be collected by a camera of the electronic device. To avoid missing key frames from the captured video, the user's mouth or entire face needs to be within the field of view of the camera. In some embodiments, in order to simultaneously acquire preset information of a user through a video, it is preferable to include the entire face of the user in the visual field of the camera. When a user speaks towards the electronic equipment, the electronic equipment detects that the user speaks, namely starts to collect video of the user, the video comprises mouth movements of the user in the speaking process, and if lip movements cannot be detected, the collection of the video and the audio can be stopped. Here, the detection of whether the user is speaking can be realized by performing lip movement feature detection on an image including the mouth or face of the user collected by the camera.

The audio may be collected by a microphone of the electronic device. When a user speaks into the electronic device, the electronic device detects that the user speaks, namely, the microphone is controlled to collect audio of the user, wherein the audio at least comprises voice of the user speaking. The audio capture may be stopped if no lip motion is detected.

The specific format, length, etc. of the video and audio, the embodiments of the present application are not limited. The audio may be short audio or long audio, for example, the audio may be 30s audio, audio containing a sentence, ten minutes audio, audio containing a paragraph of a word, or audio containing an article. The video may be a short video or a long video, for example, the video may be a 30s video, a video containing all mouth movements of a sentence, a dozen minutes video, a video containing a paragraph, or a video containing an article.

For example, for a scenario of "wake up by speaking" or "control of an electronic device by speaking", the audio may be an audio comprising a wake up word or a voice command, and the video may comprise a sequence of mouth moving images during the speaking of the wake up word or a sequence of mouth moving images of the speaking voice command. For the "talk using electronic device" scenario, the audio may comprise the complete voice of a call and the video may comprise video of the mouth movement during a call.

In some embodiments, the video and audio collected in this step may be of a fixed length. Here, the fixed length of the audio and video may be flexibly set according to various factors such as a specific application scenario, a processing capability of the electronic device, a limitation of the VAD model, a limitation of the ASR model, and a user requirement. Therefore, after the video with a certain length and the audio with a corresponding length are collected, the subsequent steps are skipped to process, and thus, the lip language identification sample can be generated in real time while the audio and the video are collected in real time in the speaking process of the user. In addition, the method is convenient for parallel processing of multiple sections of videos and multiple sections of audios, and is beneficial to improving the processing efficiency, saving the time and improving the hardware resource utilization rate of the electronic equipment.

In the case that the quality of the speech uttered by the user is poor (e.g., too low voice, ambiguous speech, poor quality of recognized text, etc.), the recognition accuracy of the speech segment is affected, so that the speech recognition result is not accurate enough, and the accuracy of the tag in the lip language recognition sample may be reduced. To avoid this problem, before step S120 or step S110, the method may further include: and acquiring the awakening voice confidence of the user. In this way, the confidence of the wake-up voice may be compared with a preset first confidence threshold, and when the confidence of the wake-up voice is greater than or equal to the first confidence threshold, step S110 is executed to obtain the video and audio of the user. If the confidence of the wake-up voice is less than the first confidence threshold, which indicates that the quality of the voice currently uttered by the user is poor, the labeling process of the lip-language recognition sample may not be performed, that is, step S110 may not be performed, or the video and the audio related thereto may be directly discarded without performing step S120 and the subsequent processes. Therefore, the problem that the label accuracy is low due to poor voice quality can be solved by awakening the confidence coefficient of the voice and a threshold value, the accuracy of sample labeling can be effectively improved, the method is easy to realize, and the method is convenient to flexibly adjust according to actual conditions.

The wake-up voice refers to a voice uttered by a user for waking up the electronic device. The wake speech confidence may be obtained by speech recognition of the wake speech. When the electronic device is awakened by the awakening voice, the electronic device needs to perform voice recognition on the awakening voice to obtain the content of the awakening voice and the confidence level of the awakening voice. Generally, when the confidence of the wake-up voice is greater than or equal to a preset wake-up threshold and the content of the wake-up voice includes a preset wake-up word or wake-up sentence, the electronic device is woken up.

Generally, the higher the confidence of the speech, the better the speech quality. In some embodiments, the first confidence threshold may be a value greater than the wake-up threshold, so that the sample labeling according to the embodiment of the present application may be performed when the voice quality of the user is better, so as to obtain a lip language identification sample with good quality and higher tag accuracy. For example, the wake confidence threshold is usually set to a value of 0.5 or lower, and the first confidence threshold of the embodiment of the present application may be set to a higher value such as 0.7, 0.8, 0.85, etc. In specific application, a specific value of the first confidence threshold can be flexibly set according to one or more factors such as a specific application scene, the accuracy of voice endpoint detection, the accuracy of voice recognition, user requirements and the like. The embodiment of the present application is not limited to a specific configuration manner and determination manner of the first confidence threshold.

The audio may include environmental noise, and in order to avoid the environmental noise from affecting the recognition accuracy of the speech segment, and further reduce the accuracy of the sample label, before step S120 or step S110, the method may further include: the volume of the ambient noise is detected. In this way, the volume of the environmental noise may be compared with a preset noise threshold, and when the environmental noise is less than or equal to the noise threshold, step S110 is executed to obtain the video and audio of the user. If the environmental noise is greater than the noise threshold, which indicates that the environmental noise may interfere with the speech recognition of the speech segment, the labeling process of the lip language recognition sample may not be performed, that is, step S110 may not be performed, or the video and the audio related thereto may be directly discarded, and the processes in step S120 and the subsequent steps may not be performed. Therefore, the influence of the environmental noise on the voice recognition result of the voice fragment can be avoided by measuring the volume of the environmental noise and setting a threshold value, so that the accuracy of the label in the lip language recognition sample is improved, the method is easy to realize, and flexible adjustment according to actual conditions is facilitated.

The environmental noise can be detected by a decibel meter built in the electronic equipment or an external volume detector. In some embodiments, the ambient noise may be detected a predetermined time period (e.g., 1 second, 2 seconds, 0.5 seconds) before step S110, before step S120, or after step S120, such that the detected ambient noise volume is closest to the ambient noise volume when the user speaks, with greater accuracy.

In practical applications, the noise threshold of the environmental noise may be flexibly set according to one or more factors such as a specific application scenario, the accuracy of the voice endpoint detection, the accuracy of the voice recognition in step S140, and user requirements. In some embodiments, the noise threshold of the environmental noise may be a value obtained by analyzing and counting the speech recognition result of the speech segment. In some embodiments, the noise threshold may be a dynamic value, or a fixed value set artificially. The embodiment of the present application is not limited to a specific configuration and determination manner of the noise threshold.

In order to avoid missing key frames (for example, image frames of a certain key lip motion during the user speaking) in the video and ensure that the video contains the user face or the user mouth, before step S110 or step S120, the method may further include: and detecting whether the visual field range of the camera contains a human face or a human mouth or not, wherein the camera is used for collecting the video in the step S110. If the face or the mouth is not in the visual field of the camera, the loss of the key frame is likely to be caused, at this time, the sample labeling in the embodiment of the present application may not be executed, and if the face or the mouth is in the visual field of the camera, the sample labeling in the embodiment of the present application may be continued, and step S110 and the subsequent steps thereof are executed.

Whether the visual field range of the camera contains the human face or the human mouth can be detected in various applicable modes. In some embodiments, whether the image acquired by the camera contains the facial features of the person or the mouth features of the person may be detected, if the facial features of the person are contained, the visual field range of the camera is indicated to contain the mouth of the person, and if the mouth features of the person are contained, the visual field range of the camera is indicated to contain the mouth of the person. Here, the face feature and the mouth feature of the person may be implemented by a general face feature recognition algorithm, a mouth detection algorithm, or a pre-trained neural network model.

Alternatively, in order to facilitate acquisition of preset information (for example, a face ID hereinafter) of the user through the video in step S110, a face may be included in the visual field of the camera.

To ensure that the speaker in the video and the sound source of the audio are the same in step S110, before step S110 or step S120, the method may further include: and acquiring the position of a speaker in the video and the sound source positioning direction of the audio. Here, the sound source localization direction may indicate a position of the speaker relative to the electronic device (relative to a position of a microphone in the electronic device), and the speaker position may indicate a position of the speaker relative to the electronic device (e.g., relative to a position of a camera in the electronic device). Therefore, the speaker position can be compared with the sound source localization direction, and if the speaker position in the video matches the sound source localization direction of the audio, it indicates that the speaker and the audio in the video are the same in sound source, at this time, the sample labeling in the embodiment of the present application can be continued, that is, step S110 can be executed to obtain the video and the audio of the user. If the speaker position in the video and the sound source localization direction of the audio are not matched, it is indicated that the speaker and the sound source of the audio in the video are different, at this time, the sample labeling in the embodiment of the present application may not be performed, that is, step S110 may not be performed, or the video and the audio related thereto may be directly discarded, and step S120 and the subsequent processing are not performed. In the embodiment of the application, the accuracy of lip language identification samples can be improved by labeling the samples on the premise that the speaker in the video is consistent with the audio sound source.

The sound source localization direction of audio can be obtained by a sound source localization technique. The speaker position in the video may be, but is not limited to, a face position or a mouth position in the video, and may be obtained by performing face recognition on the video, such as a human body feature detection algorithm, an image recognition algorithm, and the like.

And step S120, extracting lip movement video segments in the video and voice segments in the audio.

Before step S120, the method may further include: and cutting the video. Specifically, the cutting may include: detecting and extracting lip regions of image frames in the video within the mouth movement period to form one or more mouth movement videos. In practice, the preprocessing can be implemented by various applicable methods, such as clustering, neural network-based feature extraction model, and the like. Therefore, the size of the image frame in the video can be reduced, the calculation complexity and the data volume are reduced, the processing efficiency is improved, and the hardware resource is saved.

Specifically, a mouth motion video of each person can be formed by performing lip motion feature detection on each image frame in the video, extracting a lip region of each image frame in a mouth motion period in units of each face region position based on a result of the lip motion feature detection. For example, the image frame size of the originally captured video may be 1980 × 1024, and the image frame size of the mouth motion video obtained after the cutting may be 112 × 112, so that the image data is greatly reduced.

In step S120, endpoint detection and segmentation may be performed on the video (for example, the cut mouth motion video) in a lip motion vocal interval detection manner, so as to obtain a lip motion video segment and a vocal interval of the lip motion video segment. Specifically, the endpoint detection and segmentation of the mouth motion video obtained by the segmentation can be performed through a lip motion VAD model or VAD algorithm, so as to obtain the lip motion video segment and VAD values of the lip motion video segment (i.e. voice interval of the lip motion video segment). Here, the VAD value of the lip motion video segment may indicate a start point and an end point of the lip motion video segment, and the image frame in the lip motion video segment is a lip region image.

In step S120, the audio may be subjected to endpoint detection and segmentation by a voice interval detection method, so as to obtain a voice segment and a voice interval of the voice segment. Specifically, the audio may be end-point detected and segmented by a voice VAD model or VAD algorithm to obtain a voice segment and a VAD value of the voice segment (i.e., a voice interval of the voice segment). Here, the VAD value of the voice segment may indicate a start point and an end point of the voice segment.

In the embodiment of the present application, the VAD model or VAD algorithm, the voice VAD model or VAD algorithm for performing endpoint detection and segmentation on audio, may adopt an endpoint detection method based on short-time energy and zero-crossing rate, a method for classifying voice and non-voice based on a neural network model, or any other endpoint detection method that is applicable. The present application is not limited thereto.

In the embodiment of the present application, the lengths of the lip motion video segment and the voice segment can be determined by the VAD model or VAD algorithm used. For example, the lip movement video segment may include, but is not limited to, a lip movement image sequence of a sentence or a lip movement image sequence of a word, and the voice segment may include, but is not limited to, a voice segment of a sentence or a voice of a word. The lip motion video segment and the voice segment may be the same length or different lengths. Here, the image frame in the lip moving image sequence is a lip region image.

In step S120, the extraction of the lip movement video segment and the extraction of the voice segment may be performed synchronously, or may be performed according to a certain sequence. In step S120, using the VAD algorithm, a voice interval value (VAD value) may be obtained while implementing the segmentation of the video and the audio, so as to find out the lip movement video segments and the voice segments corresponding to the same utterance (e.g., the same sentence or the same word spoken by the user), thereby determining the matching relationship between each voice segment and each lip movement video segment.

And step S130, selecting the lip movement video clip matched with the voice clip.

In some embodiments, lip movement video segments matched with each voice segment can be determined according to the voice interval of the lip movement video segment and the voice interval of the voice segment, and the lip movement video segments matched with each voice segment can be used as candidate lip movement video segments.

Generally, as long as there is a certain degree of overlap in the time dimension, the lip movement video segment and the voice segment can be considered to be synchronous, that is, the lip movement video segment and the voice segment are the same sentence or word corresponding to the same speaker. Therefore, lip movement video segments matching the voice segments can be found by the degree of overlap in the time dimension.

In some embodiments, step S130 may include: according to the voice interval (such as VAD value) of the lip movement video segment and the voice interval (such as VAD value) of the voice segment, the overlapping length of the lip movement video segment and the voice segment in the time dimension is determined, when the overlapping length of the voice segment and the lip movement video segment in the time dimension is larger than or equal to a preset duration threshold value, the lip movement video segment is matched with the voice segment, and if the overlapping length of the voice segment and the lip movement video segment in the time dimension is smaller than the duration threshold value, the lip movement video segment is not matched with the voice segment. In other words, the lip movement video segment with which the voice segment matches satisfies: the length of the overlap in the time dimension is greater than or equal to a preset duration threshold. Therefore, the lip movement video segments matched with the voice segments can be efficiently and accurately found, or the lip language video segments and the synchronous voice segments thereof can be efficiently and accurately determined.

The specific value of the duration threshold can be set according to one or more factors such as the specific application scenario, the precision of the VAD model or VAD algorithm, the length of the voice segment, the user requirement and the like. In some embodiments, the duration threshold may be a predetermined proportion of the length of the speech segment. For example, if the length of the speech segment is 20s and the predetermined ratio is 80%, the duration threshold is 16 s. In practical application, the length of the voice segment is dynamically changed, so that the time length threshold is determined in real time through the preset proportion and the length of the voice segment, the lip motion video segment matched with the voice segment can be found more efficiently and accurately, and the accuracy of labeling a lip language recognition sample is improved.

For example, fig. 2 shows a schematic diagram of the overlapping length of a voice segment and a lip motion video segment in a time dimension t. Referring to fig. 2, assuming that a starting point y1 and an ending point y2 of a lip motion video segment in a time dimension t are obtained through a lip motion VAD model, a vocal range value S1 of the lip motion video segment can represent [ y1, y2], a starting point x1 and an ending point x2 of a voice segment in the time dimension t are obtained through a language VAD module, a vocal range value S2 of the voice segment can represent [ x1, x2], when an overlap length S12[ x1, y2] of S1 and S2 in the time dimension is greater than or equal to a preset duration threshold, the lip motion video segment is matched with the voice segment, that is, the lip motion video segment can be used as a candidate lip motion video segment.

After step S130, the method may further include: in the lip motion video segments matched with the voice segments (i.e., in all candidate lip motion video segments obtained in step S130), a lip motion video segment with a lip language recognition confidence smaller than a preset second confidence threshold is selected, and the lip language recognition confidence is obtained by performing lip language recognition on the lip motion video segment according to a lip language recognition model obtained in advance. Therefore, the lip motion video segment which is the most uncertain or carries the most information can be selected as the lip language identification sample, namely the negative sample can be selected to form a negative sample set, so that the finally obtained lip language identification sample is the sample which is the most difficult to distinguish by the model or the sample which is the most advanced to the model, a supervised sample is provided for the subsequent optimization aiming at the specific lip language identification model, or a supervised sample is provided for the transfer learning and the iterative optimization aiming at the lip language identification model of a specific person.

Similarly, after step S130, the method may further include: in the lip motion video segments matched with the voice segments (i.e., in all the candidate lip motion video segments obtained in step S130), the lip motion video segments with the lip language recognition confidence greater than the preset third confidence threshold may also be selected. Therefore, a positive sample can be selected to form a positive sample set, so that the finally obtained lip language recognition sample is a sample which is beneficial to training a lip language recognition model, a supervised sample is provided for the subsequent optimization aiming at a specific lip language recognition model or the training of a new lip language recognition model, or a supervised sample is provided for the transfer learning and the iterative optimization aiming at the lip language recognition model of a specific person.

Here, the confidence of the lip movement video segment can be obtained by performing lip language recognition on the lip movement video segment of the user through the lip language recognition model parameters associated with the preset information of the user. Or the confidence of the lip movement video segment can be obtained by performing lip language recognition on the lip movement video segment of the user through the general lip language recognition model parameters.

The specific value of the second confidence threshold and/or the third confidence threshold may be set according to one or more factors such as the specific application scenario, the accuracy of the VAD model or VAD algorithm, the length of the voice segment, and the user requirement. For example, the second confidence threshold may be 0.5, 0.4, 0.3, or any other value less than 1. The third confidence threshold may be 0.5, 0.6, 0.7, or any other value less than 1.

And step S140, marking the lip language identification fragment by taking the identification text of the voice fragment as a label to obtain a lip language identification sample of the user.

Still taking fig. 2 as an example, in step S140, a voice segment with a voice interval value of S2[ x1, x2] is subjected to voice recognition to obtain a recognition text, and the recognition text is used as a label (label) of a lip movement video segment with a voice interval value of S1[ y1, y2] to label the lip movement video segment, so as to form a lip movement recognition sample.

In or before step S140, speech recognition may be performed on the speech segment through a pre-obtained ASR model to obtain a recognition text of the speech segment, where the recognition text includes the content of the utterance in the speech segment. In step S140, the recognition text of the voice segment is used as a tag to label the lip language recognition segment matched with the voice segment, so as to obtain a lip language recognition sample of the user. Here, the ASR model may be a neural network based ASR model or any other type of ASR model.

In practice, speech recognition may be performed by an electronic device. Or the electronic device may upload the voice fragment to the cloud server, the cloud server performs processing such as voice recognition and semantic analysis on the voice fragment, issues the obtained recognition text to the electronic device, and the electronic device labels the lip language recognition fragment. In order to protect the personal privacy of the user and avoid the risk caused by data outgoing, the voice recognition of the voice segments is preferably completed through the electronic equipment.

In step S140, the lip movement video segments may be further screened according to the confidence of the voice segments. Specifically, the confidence of the voice segment is compared with a preset third confidence threshold, if the confidence of the voice segment is greater than or equal to the third confidence threshold, the lip movement video segment matched with the voice segment is retained, and if the confidence of the voice segment is less than the third confidence threshold, the lip movement video segment matched with the voice segment is discarded. Therefore, the lip motion video clip corresponding to the voice clip with better voice quality can be selected as the lip language identification sample, and the accuracy of the lip language identification sample can be improved.

In the embodiment of the application, the electronic equipment can automatically acquire the video and audio of the speaker, automatically cut out the lip movement video segment of the speaker through technologies such as VAD (voice activity detection), voice recognition and the like and mark out positive and negative samples, namely, the voice mode can be used for assisting the visual mode to automatically mark, and samples for lip language recognition are provided for supervised transfer learning and iterative optimization of the speaker. In addition, the whole process is insensitive to the user, the user does not need to be calibrated in advance aiming at a specific user, the problem does not need to be thrown out, an expert can determine the label of the data, the method is more friendly, and the user experience can be effectively improved.

Fig. 3 shows a schematic flowchart of a model training method provided in an embodiment of the present application. The model training method according to the embodiment of the present application may be executed by an electronic device, and as shown in fig. 3, the model training method according to the embodiment of the present application may include the following steps S110 to S140, and step S150.

And step S150, updating parameters of the lip language identification model by using the lip language identification sample obtained in the step S140.

The lip language recognition model in the embodiment of the present application may be any type of model. In particular, the lip language recognition model may be, but is not limited to, a neural network based lip language recognition model, a sequence-to-sequence based lip language recognition model, a connection-sense based time classification loss model. It is understood that any other lip language identification model that can be applied to the embodiments of the present application is included in the scope of the present application and is not listed here.

In one implementation, in step S150, the lip language recognition model may be trained by using the lip language recognition sample obtained in step S140 to determine parameters of the lip language recognition model. In another implementation manner, in step S150, the lip recognition model may be optimized by using the lip recognition sample obtained in step S140, so as to update parameters of the lip recognition model (for example, parameters of a trainable layer thereof), implement user customization of the lip recognition model, and improve recognition accuracy of the lip recognition model for a specific user.

In step S150, parameters of the lip language recognition model may be updated by using a gradient descent method. For example, a batch gradient descent, a random gradient descent, or a small batch gradient descent may be acquired, and the parameters of the lip language identification model may be updated using the lip language identification sample obtained in step S140. Here, the loss function employed may be a cross entropy loss function or any other type of loss function.

In general, the lip language recognition model may include a generic feature layer and a trainable layer, and the parameters of the lip language recognition model may include trainable layer parameters and a generic feature layer. As the name implies, the trainable layer parameters refer to the parameters of the trainable layer in the lip language recognition model, and the general feature layer parameters refer to the parameters of the general feature layer in the lip language recognition model. Typically, the generic feature layer parameters are fixed and the trainable layer parameters may be optimized. Therefore, the parameters for updating the lip language identification model in step S150 may specifically be: and updating trainable layer parameters of the lip language recognition model. Therefore, only trainable layer parameters of the lip language recognition model can be updated for each user, so that the parameter optimization efficiency is higher, the data volume of the lip language recognition model parameters for each user is relatively less, the size of the lip language recognition model for each user in the electronic equipment is reduced, and hardware resources can be saved.

In step S150, lip language identification model parameters (e.g., trainable layer parameters) of the user may be obtained according to preset information of the user, and the lip language identification model parameters are updated by using the lip language identification sample of the user, so that the lip language identification model parameters with higher identification precision for the user are obtained, and the identification accuracy of the lip language identification model for the individual is improved. That is, in step S150, the parameters of the lip language identification model are lip language identification model parameters matched with or associated with the preset information of the user. Therefore, the lip language identification model parameters of the user are continuously updated and optimized by using the lip language identification sample of the specific user, and the lip language identification model parameters of the user can be dynamically adjusted along with the dynamic change of the user, so that the continuous learning and optimization of the lip language identification model are realized, the identification accuracy of the lip language identification model for the specific user is continuously improved, and the problem that the lip language identification model cannot be continuously optimized along with the user is solved.

The preset information of the user may be, but is not limited to, a voiceprint, a face ID, a lip print, or other information, and the preset information may be obtained through a corresponding module in the electronic device. In some embodiments, the preset information may be a face ID, a voiceprint, a lip print, and the like, so that the preset information of the user may be directly obtained from the video or audio in step S110, which not only may reduce the data amount, but also may ensure that the preset information is consistent with the user of the lip language identification sample, and the lip language identification sample is more beneficial to obtaining a lip language identification model with high accuracy for identifying a specific user.

Step S150 may further include: the lip language identification result of the lip language identification sample and the label thereof are compared, and the hyper-parameter of the lip language identification model is adjusted to obtain the hyper-parameter of each lip language identification sample, so that the lip language identification sample and the corresponding hyper-parameter are used for updating the parameters of the lip language identification model. Here, the hyper-parameter may include, but is not limited to, a parameter update rate (i.e., a learning rate).

The parameter update rate (i.e., learning rate) is an important hyper-parameter in the optimization of the lip language recognition model. In the gradient descent method, the value of the parameter update rate is very critical, if the value is too large, convergence cannot be achieved, and if the value is too small, the convergence speed is too slow. Before updating the parameters of the lip language identification model in step S150, the method may further include: adjusting the parameter updating rate of the lip language identification model by comparing the lip language identification text of the lip language identification sample with the label of the lip language identification sample to obtain the parameter updating rate of the corresponding lip language identification sample; the lip language identification text is obtained by performing lip language identification on a lip language identification sample through a lip language identification model. In step S150, updating parameters of the lip language identification model specifically includes: and updating the parameters of the lip language identification model by using the lip language identification samples and the parameter updating rates of the corresponding lip language identification samples. Specifically, when the lip language identification result of the lip language identification sample is consistent with the label of the lip language identification sample, the parameter update rate of the lip language identification model is reduced to obtain the parameter update rate corresponding to the lip language identification sample; and when the lip language identification result of the lip language identification sample is inconsistent with the label of the lip language identification sample, increasing the parameter updating rate of the lip language identification model to obtain the parameter updating rate corresponding to the lip language identification sample. Therefore, the optimization efficiency of the lip language recognition model can be improved, the hardware resource consumption is reduced, and the accuracy of the lip language recognition model for individuals is further improved.

In a specific application, the adjustment of the parameter update rate can be realized by learning rate attenuation, learning rate preheating, periodic learning rate adjustment, or some method for adaptively adjusting the learning rate.

In or after step S150, may further include: trainable layer parameters of the lip language recognition model are stored after being associated with preset information (for example, a face ID hereinafter) of a user. Specifically, preset information of a user may be stored in a registered information database (e.g., a registered face database hereinafter), and trainable layer parameters associated with the preset information may be stored in a lip language model database. Therefore, the trainable layer parameters of the user can be conveniently and quickly found through the preset information of the user, so that the lip language identification of the user is carried out by using the trainable layer parameters of the user, and the accuracy rate of the lip language identification of an individual is improved. Meanwhile, whether the lip language identification model parameters of the user exist can be obtained only by inquiring the registered information database, so that the method is faster and more efficient.

According to the model training method, the voice modality is used for assisting the visual modality to automatically label and optimize the model, self-supervision, individuation and active learning of the lip language recognition model are achieved, the recognition accuracy and the individuation degree of the lip language recognition model can be improved simultaneously, the whole process of model training does not need user participation, the model training is not sensible to users, the model training does not need to calibrate specific users in advance, and user experience is improved. In addition, the model training method can be automatically completed by the electronic equipment, the lip language recognition model can be trained online by using the lip language recognition sample of the user, parameters of the lip language recognition model are updated in real time, the lip language recognition accuracy is improved, customization of the lip language recognition model is realized, data can be prevented from being output, and the privacy of the user is effectively protected.

Fig. 4 shows a flowchart of a lip language identification method provided in an embodiment of the present application. The model training method according to the embodiment of the present application may be executed by an electronic device, and as shown in fig. 4, the model training method according to the embodiment of the present application may include the following steps:

step S410, when the fact that the user speaks towards the electronic equipment is detected, the video of the user is collected.

And step S420, extracting lip movement video clips in the video.

In some embodiments, step S420 may include: step a1, cutting the video through lip motion detection to obtain a lip motion video of the mouth motion period; step a2, performing lip motion VAD detection on the lip motion video to obtain lip motion video segments, wherein in the image sequence of each lip motion video segment, each frame of image can only contain the lip region of the user.

In one implementation, in step a1, the mouth movement time period of the user may be determined according to the real-time lip movement detection result of step S410, and a video of a lip region within the mouth movement time period is extracted from the video in units of face region bits, so as to obtain a lip movement video. Each lip movement video corresponds to a face, and a single-frame image in each lip movement video only contains image data of a face lip region. Therefore, the lip movement video segment obtained through the lip movement video is small in data size, operation complexity is reduced, interference caused by other human faces, lips and other data in the video is reduced, and therefore processing efficiency and accuracy of user lip language recognition are improved.

For example, if two persons are contained in the video, if there is only one speaker, the cut lip movement video will only contain the image of the lip region of the speaker. If there are N speakers (N is an integer greater than 1), the lip movement videos obtained by the segmentation are divided into N groups, each group of lip movement videos corresponds to one speaker (for example, may be associated with preset information of the speaker), and each lip movement video only includes a lip region image of the corresponding speaker.

Step S430, operating the lip language recognition model based on the parameters of the lip language recognition model obtained by the above model training method to perform lip language recognition on the lip movement video clip to obtain a lip language recognition text, wherein the lip language recognition text can indicate the speaking content of the user.

In some embodiments, step S430 may include: acquiring preset information of a user; acquiring trainable layer parameters associated with the preset information; and loading trainable layer parameters and pre-configured general characteristic layer parameters to operate a lip language identification model to carry out lip language identification on the lip language video clip. Therefore, the trainable layer parameters of the speaker are quickly searched through the preset information of the speaker, the lip language recognition model is operated by utilizing the trainable layer parameters of the speaker and the general characteristic layer parameters which can be shared by all users to carry out lip language recognition on the speaker, which is equivalent to carrying out the lip language recognition on the speaker by using the customized lip language recognition model of the speaker, thereby effectively improving the recognition accuracy rate of the lip language in the vertical domain aiming at the individual.

In a specific application, a lip language model library and a registered face database may be created in advance, lip language recognition model parameters (i.e., trainable layer parameters of a lip language recognition model) of each user are stored in the lip language model library, and preset information (e.g., a face ID) associated with the lip language recognition model parameters is stored in the registered face database. Therefore, when a specific user is subjected to lip language identification, whether the lip language identification model parameters of the user exist or not can be determined by inquiring whether the preset information of the user exists in the registered face database or not, if so, the preset information of the user can be utilized to quickly acquire the lip language identification model parameters of the user from the lip language model database, and therefore the lip language identification model can be operated by using the lip language identification model parameters of the user to identify the lip language of the user, which is equivalent to the lip language identification of the user by using the customized lip language identification model of the user, so that the customized lip language identification of the user is realized, and the accuracy rate of the lip language identification of various people is further improved.

In some embodiments, step S430 may further include: when the lip language identification model parameters matched with the preset information cannot be inquired (namely, the trainable layer parameters associated with the preset information do not exist), the lip language identification model parameters of the user do not exist, the pre-configured trainable layer parameters and the general feature layer parameters can be directly loaded, the lip language identification model is operated through the parameters to carry out lip language identification, and the lip language identification is equivalent to the lip language identification of the lip language video clip of the user by using the general lip language identification model. Therefore, when the lip language identification model parameters of the user do not exist, the lip language identification can be carried out on the user through the universal lip language identification model parameters, and the lip language identification of the user can be efficiently completed.

Here, the preset information about the user may refer to the above related description, and is not described in detail. In some embodiments, the preset information in step S410 may be a face ID or a lip print, which is conveniently obtained by performing face recognition on the video in step S410.

The lip language identification method can efficiently finish lip language identification of various people through the electronic equipment, and has high lip language identification accuracy.

Fig. 5 shows an exemplary structure of a lip language recognition apparatus 50 provided in an embodiment of the present application. The lip language recognition apparatus 50 may be applied to an electronic device. Specifically, the lip language identification apparatus 50 may be disposed in the electronic device or directly implemented by software and/or hardware of the electronic device, and for specific details of the electronic device, reference may be made to the following description, which is not repeated herein. Referring to fig. 5, the lip language recognition apparatus 50 may include:

a video acquiring unit 51 configured to acquire a video of a user while the user speaks into the electronic device;

the audio acquiring unit 52 is configured to acquire the audio of the user during the process that the user speaks towards the electronic device;

a lip motion extraction unit 53 configured to extract a lip motion video segment in the video;

a voice extracting unit 54 configured to extract a voice segment in the audio;

a selection unit 55 configured to select a lip movement video segment matching the voice segment;

and the labeling unit 56 is configured to label the lip movement video segment by using the voice recognition text of the voice segment as a label, so as to obtain a lip language recognition sample of the user.

In some embodiments, lip recognition device 50 may also include one or more of the following:

a noise detection unit 57 configured to detect a volume of the environmental noise;

a wake-up voice confidence level obtaining unit 58 configured to obtain a wake-up voice confidence level of the user;

a face detection unit 59 configured to detect whether a face or a mouth is included in a visual field range of the camera;

a positioning unit 510 configured to acquire a speaker position in the video and a sound source positioning direction of the audio.

The video acquiring unit 51 may be specifically configured to: when the volume of the environmental noise is equal to or smaller than a preset noise threshold, the confidence of the awakening voice is larger than or equal to a preset first confidence threshold, a human face or a human mouth is contained in the visual field range of the camera, and/or the position of a speaker of the video is matched with the sound source positioning direction of the audio, the video of the user is obtained; and/or the presence of a gas in the gas,

the audio obtaining unit 52 may be specifically configured to: and acquiring the audio of the user when the volume of the environmental noise is equal to or less than a preset noise threshold, the confidence of the awakening voice is greater than or equal to a preset first confidence threshold, the visual field range of the camera comprises a human face or a human mouth, and/or the position of the speaker of the video is matched with the sound source positioning direction of the audio.

In some embodiments, the lip motion extraction unit 53 may be specifically configured to: carrying out endpoint detection and segmentation on the video in a lip movement human voice interval detection mode to obtain a lip movement video segment and a human voice interval of the lip movement video segment; and/or, the voice extracting unit 54 may be specifically configured to perform endpoint detection and segmentation on the audio through a voice interval detection manner, so as to obtain the voice segment and the voice interval of the voice segment.

In some embodiments, the selection unit 55 is specifically configured to: determining the overlapping length of the voice segment and the lip movement video segment in the time dimension according to the voice interval of the voice segment and the voice interval of the lip movement video segment; when the overlapping length of the voice segment and the lip motion video segment in the time dimension is larger than or equal to a preset time threshold value, the voice segment and the lip motion video segment are matched.

In some embodiments, the selection unit 55 may be further configured to: and selecting the lip motion video segment with the lip language recognition confidence coefficient smaller than a preset second confidence coefficient threshold value from the lip motion video segments matched with the voice segments, wherein the lip language recognition confidence coefficient is obtained by performing lip language recognition on the lip motion video segment according to a pre-obtained lip language recognition model.

In some embodiments, the lip motion video segment comprises a lip motion image sequence in which image frames are lip region images.

In some embodiments, the lip language recognition device 50 may further include: the parameter updating unit 511 is configured to update the parameters of the lip language identification model by using the lip language identification samples obtained by the labeling unit 56.

In some embodiments, the lip language recognition model comprises a generic feature layer and a trainable layer, and the parameters of the lip language recognition model comprise trainable layer parameters and generic feature layer parameters; the parameter updating unit 511 is specifically configured to: and updating trainable layer parameters of the lip language recognition model.

In some embodiments, the lip language recognition device 50 may further include: and the storage unit 512 is configured to associate the trainable layer parameters of the lip language recognition model with preset information of the user and store the associated parameters.

In some embodiments, the storage unit 512 may be specifically configured to: storing preset information of a user in a registered information database; and storing the trainable layer parameters related to the preset information in a lip language model library.

In some embodiments, the parameter updating unit 511 may be specifically configured to: adjusting the parameter update rate of a lip language identification model by comparing a lip language identification text of a lip language identification sample with a label of the lip language identification sample to obtain the parameter update rate corresponding to the lip language identification sample; updating parameters of the lip language identification model by using the lip language identification samples and the parameter updating rates of the corresponding lip language identification samples; the lip language identification text is obtained by performing lip language identification on a lip language identification sample through a lip language identification model.

In some embodiments, the video acquiring unit 51 may be further configured to acquire a video of the user when detecting that the user speaks into the electronic device; a lip motion extracting unit 53, which may be further configured to extract a lip motion video segment in the video; the lip language recognition device 50 may further include: and a lip language identification unit 513 configured to run the lip language identification model according to the parameters of the lip language identification model updated by the parameter updating unit 511 to perform lip language identification on the lip movement video segment, so as to obtain a lip language identification text.

In some embodiments, the lip language recognition model may include a generic feature layer and a trainable layer, and the parameters of the lip language recognition model include trainable layer parameters and generic feature layer parameters. The lip language recognition device 50 may further include: a preset information acquiring unit 514 configured to acquire preset information of a user; the lip language recognition unit 513 may be specifically configured to: acquiring trainable layer parameters associated with the preset information; and loading the trainable layer parameters and the pre-configured general feature layer parameters to operate a lip language identification model to carry out lip language identification on the lip motion video segment.

Specifically, the lip language identification unit 513 is specifically configured to perform lip language identification on the lip movement video segment obtained by the lip movement extraction unit 53 according to the lip language identification model parameter trained by the parameter updating unit 511, so as to obtain a lip language identification result, where the lip language identification result may include a lip language identification text and a lip language identification confidence. For the lip recognition text and the lip recognition confidence, reference may be made to the context-dependent description, which is not repeated here.

In the embodiment of the present application, other technical details of the lip language identification device 50 may refer to the above method and the related description of the following embodiments, and are not repeated.

Fig. 6 shows an exemplary application scenario and an exemplary processing procedure thereof according to an embodiment of the present application.

In the example of fig. 6, the user a controls the playback content of the robot by voice. The user a first speaks a first sentence "xiaozhi" containing a specific command word to the robot, then speaks a second sentence "don't speak" to the robot to control the robot to stop the currently played content, and finally speaks a third sentence "play music" to the robot to control the robot to play music.

When the user A and the robot perform the face-to-face communication, the robot captures videos and audios through a camera and a microphone of the robot respectively, three lip movement video segments and VAD values thereof and three voice segments and VAD values thereof of the user A are obtained through corresponding VAD processing, the three lip movement video segments respectively correspond to the three sentences, namely 'Xiao Yi', 'Do not say' and 'play music', and the VAD values of the three lip movement video segments respectively comprise lip movement starting points and lip movement ending points of corresponding sentences. Similarly, three speech segments correspond to the three sentences respectively, and the VAD values thereof respectively include the speech start point and the speech end point of the corresponding sentence. Then, the robot finds the lip motion video segment corresponding to each voice segment according to the overlap length of the VAD value, performs voice recognition on the voice segments to obtain voice recognition texts of the voice segments, wherein the voice recognition texts are label data of the lip motion video segments corresponding to the voice segments, and labels the three lip motion video segments by using the label data to obtain three lip motion recognition samples class1, class2 and class3 of the user a.

As shown in fig. 6, the label of the lip motion recognition sample class1 is a text with content of "art and art", the lip motion video clip includes a sequence of images of the lip region when the user a speaks the sentence, the label of the lip motion recognition sample class2 is a text with content of "don't speak", the lip motion video clip includes a sequence of images of the lip region when the user a speaks the sentence, the label of the lip motion recognition sample class3 is a text with content of "play music", and the video clip includes a sequence of images of the lip region when the user a speaks the sentence.

The robot updates trainable layer parameters of the user A or universal trainable layer parameters by using three lip motion recognition samples class1, class2 and class3 of the user A, and stores the updated trainable layer parameters in a lip language model library of the robot after being associated with the face ID of the user A. Therefore, online optimization and active learning of the lip language recognition model are achieved, the accuracy of the lip language recognition model for the user A is improved, namely, customized lip language recognition model parameters of the user A are obtained, and user customization of the lip language recognition model is achieved.

Specifically, the robot may perform lip motion recognition and face recognition on the video to obtain face feature data of the user a, query whether a corresponding face ID exists in a registered face database by using the face feature data of the user a, query trainable layer parameters of the user a to a lip language model library by using the face ID of the user a if the face ID of the user a exists in the registered face database, update the trainable layer parameters of the user a by using three lip motion recognition samples class1, class2 and class3 of the user a and store the updated trainable layer parameters into the lip language model library. If the face ID of the user A does not exist in the registered face database, the three lip motion recognition samples class1, class2 and class3 of the user A can be used for updating the locally stored trainable layer parameters, the face ID corresponding to the face feature data of the user A is configured to obtain the face ID of the user A, the face ID of the user A is stored in the registered face database, and the updated trainable layer parameters are stored in the lip language model library after being associated with the face ID of the user A.

After the process, when the user A speaks to the robot again, the robot can obtain the face ID of the user A by carrying out face recognition on the user A, obtain trainable layer parameters of the user A through the face ID of the user A, load the trainable layer parameters of the user A and the universal characteristic layer parameters stored in the local in advance to operate a lip language recognition model to carry out lip language recognition on the user A, which is equivalent to carrying out real-time lip language recognition on the user A by using a customized lip language recognition model of the user A, so that the lip language recognition accuracy of the user A is effectively improved.

Fig. 7 shows an exemplary specific implementation flow of lip language recognition, sample labeling, and model training in the embodiment of the present application.

Referring to fig. 7, the process of lip language identification through the lip language identification model may include the following steps:

in step S711, lip movement detection is performed to detect whether the speaker is speaking at any time.

Step S712, video segmentation to obtain lip movement video segments of the speaker, that is, based on the result of lip movement detection, extracting a video sequence (for example, 60 frames 112 × 112 image data) of the lip area in the mouth movement time period (i.e., lip movement video) with each face area bit as a unit;

in step S713, face recognition is performed, that is, face recognition is performed on the lip movement video segment of the speaker, and face feature data of the speaker (for example, the user a in fig. 6) is obtained.

Here, the face feature data may be, but is not limited to, face key point data, mouth key point data, or others. In practical application, the face recognition can be performed on the image frames in the video through the existing face key point detection algorithm, so that the face feature data can be obtained.

Step S714, using the face feature data of the speaker to inquire the face ID of the speaker in a registered face database, if the face ID of the speaker is hit, indicating that the lip language identification model parameter of the speaker exists, continuing to step S715, if the face ID of the speaker is not hit, indicating that the lip language identification model parameter of the speaker does not exist, skipping to step S716;

the registered face database contains all face IDs that have been associated with the parameters of the lip language recognition model (or the trainable layer parameters of the lip language recognition model), and if the face IDs can be hit in the registered face database, it indicates that the parameters of the lip language recognition model of the speaker (for example, the trainable layer parameters of the lip language recognition model) already exist, that is, the customized lip language recognition model corresponding to the speaker already exists. If the face ID can not be hit in the registered face database, the lip language identification model parameters of the speaker do not exist, and the lip language identification model parameters are equivalent to that the customized lip language identification model of the speaker does not exist.

Step S715, loading the parameters of the lip language recognition model corresponding to the face ID of the speaker into a memory, and operating the lip language recognition model by using the parameters;

the lip language identification model parameters corresponding to the face ID of the speaker comprise trainable layer parameters corresponding to the face ID of the speaker and locally stored universal feature layer parameters. The method comprises the steps of obtaining trainable layer parameters of a speaker from a lip language model library by using a face ID of the speaker, loading the trainable layer parameters and locally stored parameters of a universal feature layer (namely, the universal feature layer parameters are suitable for various users, and can also be regarded as the universal feature layer parameters shared by various users) into a memory, and operating a lip language identification model by using the parameters by a processor of the electronic equipment to identify the lip language, namely, performing the lip language identification on a lip language video clip of the user by using a customized lip language identification model of the user, wherein the customized lip language identification model has higher accuracy and better effect when performing the lip language identification on a specific user. It should be noted that the customized lip language recognition model not only can perform lip language recognition on a specific user, but also can be used for realizing lip language recognition of other various users.

Step S716, loading the parameters of the general lip language recognition model into a memory, and operating the lip language recognition model by using the parameters;

the general lip language recognition model parameters include locally stored trainable layer parameters (i.e., trainable layer parameters general for various users) and locally stored general feature layer parameters (i.e., general feature layer parameters suitable for various users, which may also be regarded as general feature layer parameters common for various users). The method includes the steps that trainable layer parameters stored locally and general feature layer parameters stored locally are loaded into a memory, a processor in the electronic equipment uses the parameters to operate a lip language recognition model to perform lip language recognition, namely, a general lip language recognition model is used to perform lip language recognition on a lip language video clip of a user, the general lip language recognition model can be a lip language recognition model suitable for various users, the lip language recognition of various users can be achieved, and the accuracy of the general lip language recognition model for the lip language recognition of some users or a certain user (or a majority of users) may be low.

In step S717, performing lip recognition on the lip language video clip obtained in step S712, and outputting a result, where the result may include a lip language recognition text and a lip language recognition confidence level, and the lip language recognition confidence level may indicate a confidence level of the lip language recognition text obtained by performing lip language recognition on the lip language video clip.

It should be noted that the general lip language recognition model and the customized lip language recognition model of each user are substantially the same lip language recognition model, and the general feature layer parameters are the same, and the trainable layer parameters are different.

Referring to fig. 7, the sample labeling and model training may be performed in a quiet environment, and the specific implementation flow thereof may include the following steps:

step S721, voice wakeup is successful: the speaker speaks a wake-up voice containing a specific command word (e.g., a small art) to the electronic device to successfully wake up the electronic device.

Step S722, ambient noise perception: and sensing the noise of the surrounding environment, continuing to step S723 when the noise is lower than the noise threshold and the awakening confidence is higher than the first confidence threshold, otherwise, exiting the current process.

Step S723, sample validity judgment: detecting whether the face is in the visual field range and whether the sound source positioning direction is consistent with the face position, if the face is in the visual field range and the sound source positioning direction is consistent with the face position, indicating that the sample is valid, and continuing to step S724; and if the sound source positioning direction is not consistent with the face position and/or the face is not in the visual field range, the sample is invalid, and the current process is exited.

Step S724, a speaker speaks towards the electronic equipment, the camera collects videos of the speaker, the microphone collects audios of the speaker, lip movement video segments are extracted from the videos through a lip movement VAD model, voice segments are extracted from the audios through a voice VAD model, lip movement video segments corresponding to all voice segments are determined according to VAD values of the lip movement video segments and VAD values of the voice segments, voice recognition texts of the voice segments are used as labels to mark the corresponding lip movement video segments, and therefore lip language recognition samples of the speaker are obtained.

Specifically, carrying out endpoint detection and segmentation on the video of the speaker through a lip activity VAD model to obtain a lip activity video segment of the speaker and a VAD value thereof; carrying out end point detection and segmentation on the audio frequency of the speaker through a voice VAD model to obtain a voice segment of the speaker and a VAD value thereof; determining the lip motion video segments with the overlapping length larger than the time length threshold value in the time dimension as the lip motion video segments corresponding to the voice segments, taking the lip motion video segments corresponding to the voice segments as candidate lip motion video segments of the lip language recognition sample, performing voice recognition on the voice segments to obtain voice recognition texts of the voice segments, and marking the corresponding candidate lip motion video segments with the voice recognition texts of the voice segments to obtain the lip language recognition sample of the speaker.

Step S725, performing lip language identification on the lip language identification samples (i.e., the lip movement video segments screened in step S724) to obtain identification results of the lip language identification samples, where the identification results may include lip movement identification texts and lip language identification confidence levels;

here, the lip recognition model may be operated in the manner of steps S713 to S715 to perform lip recognition on the lip recognition samples to obtain the recognition results of the lip recognition samples.

Step S726, setting a parameter update rate of the corresponding lip language identification sample according to the identification result obtained in step S725 and the label marked in step S724;

specifically, the lip language identification text of the lip movement video segment is compared with the voice identification text of the corresponding voice segment, if the lip language identification text is consistent with the voice identification text of the corresponding voice segment, the parameter updating rate of the lip language identification model can be reduced to obtain the parameter updating rate of the corresponding lip language identification sample, if the lip language identification sample is inconsistent with the voice identification text of the corresponding lip language identification sample or the difference is larger, the parameter updating rate of the lip language identification model can be increased to obtain the parameter updating rate of the corresponding lip language identification sample.

In step S727, parameters of the lip language recognition model are updated (for example, trainable layer parameters associated with the face ID of the speaker or general trainable layer parameters are updated) by using the lip language recognition samples and the parameter update rates of the lip language recognition samples obtained in step S724, the updated parameters of the lip language recognition model (for example, trainable layer parameters associated with the face ID of the speaker) are associated with the face ID of the speaker and updated in a lip language model library, the face ID of the speaker corresponds to the face feature data of the speaker, and the face ID of the speaker is stored in a registered face database, so that the parameters of the lip language recognition model of the speaker can be obtained.

As can be seen from the examples in fig. 6 and fig. 7, model training and optimization and lip language recognition can be completed by the electronic device, so that it is ensured that a legally compliant user updates the model locally, and data is not outgoing (i.e., the data does not need to be uploaded to the cloud), thereby effectively protecting user privacy. Meanwhile, when the speaker communicates with the electronic equipment (such as a robot) face to face, the lip language recognition accuracy of the speaker can be continuously and effectively improved for different speakers. In addition, under the noisy environment, the electronic equipment can also improve the awakening rate of the electronic equipment by carrying out high-accuracy lip language identification on the speaker.

Fig. 8 is a schematic structural diagram of an electronic device 80 provided in an embodiment of the present application. The electronic device 80 includes: a processor 81, and a memory 82. When the electronic device 80 is running, the processor 81 executes the computer-executable instructions in the memory 82 to perform the steps of the above-mentioned sample labeling method, model training method and/or lip recognition method.

The processor 81 may be connected to a memory 82. Memory 82 may be used to store the program codes and data. Therefore, the memory 102 may be a storage unit inside the processor 101, may be an external storage unit independent of the processor 81, or may be a component including a storage unit inside the processor 81 and an external storage unit independent of the processor 81.

It should be understood that, in the embodiment of the present application, the processor 81 may adopt a Central Processing Unit (CPU). The processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. Or the processor 81 is one or more integrated circuits, and is configured to execute the relevant programs, so as to implement the technical solutions provided in the embodiments of the present application.

The memory 82 may include a read-only memory and a random access memory, and provides instructions and data to the processor 81. A portion of processor 81 may also include non-volatile random access memory. For example, the processor 81 may also store information of the device type.

The electronic device 80 may also include a communication interface 83. It should be understood that the communication interface 83 in the electronic device 80 shown in fig. 8 may be used for communication with other devices.

Optionally, the electronic device 80 may also include a bus. The memory 82 and the communication interface 83 may be connected to the processor 81 via a bus. The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one line is shown in FIG. 8, but this does not represent only one bus or one type of bus.

It should be understood that the electronic device 80 according to the embodiment of the present application may correspond to a corresponding main body for executing the method according to the embodiments of the present application, and the above and other operations and/or functions of each module in the electronic device 80 are respectively for implementing corresponding processes of each method of the embodiment, and are not described herein again for brevity.

The electronic device 80 may be, but is not limited to, at least one of a mobile phone, a foldable electronic device, a tablet computer, a desktop computer, a laptop computer, a handheld computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), an Augmented Reality (AR) device, a Virtual Reality (VR) device, an Artificial Intelligence (AI) device, a wearable device, an in-vehicle device, a smart home device, or a smart city device. The embodiment of the present application does not specifically limit the specific type of the electronic device 80.

Fig. 9 illustrates an exemplary implementation structure of the electronic device 80 according to the embodiment of the present application, that is, an electronic device 90.

The electronic device 90 may include a processor 910, an external memory interface 920, an internal memory 921, a Universal Serial Bus (USB) connector 930, a charging management module 940, a power management module 941, a battery 942, an antenna 1, an antenna 2, a mobile communication module 950, a wireless communication module 960, an audio module 970, a speaker 970A, a receiver 970B, a microphone 970C, an earphone interface 970D, a sensor module 980, a key 990, a motor 991, an indicator 992, a camera 993, a display 994, and a Subscriber Identification Module (SIM) card interface 995, etc. Wherein sensor module 980 may include a pressure sensor 980A, a gyroscope sensor 980B, an air pressure sensor 980C, a magnetic sensor 980D, an acceleration sensor 980E, a distance sensor 980F, a proximity light sensor 980G, a fingerprint sensor 980H, a temperature sensor 980J, a touch sensor 980K, an ambient light sensor 980L, a bone conduction sensor 980M, and the like.

It is to be understood that the illustrated structure of the embodiment of the present application does not specifically limit the electronic device 90. In other embodiments of the present application, the electronic device 90 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 910 may include one or more processing units, such as: the processor 910 may include an Application Processor (AP), a modem processor, a Graphics Processor (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processor (NPU), among others. The different processing units may be separate devices or may be integrated into one or more processors.

The processor 910 may generate operation control signals according to the instruction operation code and the timing signals, so as to perform instruction fetching and instruction execution control.

A memory may also be provided in processor 910 for storing instructions and data. In some embodiments, the memory in the processor 910 may be a cache memory. The memory may store instructions or data that are used or used more frequently by processor 910. If the processor 910 needs to use the instructions or data, it can call directly from the memory. Avoiding repeated accesses reduces the latency of the processor 910, thereby increasing the efficiency of the system.

In some embodiments, processor 910 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose-input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, a bus or Universal Serial Bus (USB) interface, and the like. The processor 910 may be connected to modules such as a touch sensor, an audio module, a wireless communication module, a display, a camera, etc. through at least one of the above interfaces.

It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present application is only an illustration, and does not limit the structure of the electronic device 90. In other embodiments of the present application, the electronic device 90 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The USB connector 930 is an interface conforming to the USB standard specification, and may be used to connect the electronic device 90 and a peripheral device, and specifically may be a Mini USB connector, a Micro USB connector, a USB Type C connector, and the like. The USB connector 930 may be used to connect a charger to charge the electronic device 90, or may be used to connect other electronic devices to transmit data between the electronic device 90 and other electronic devices. And the audio output device can also be used for connecting a headset and outputting audio stored in the electronic equipment through the headset. The connector can also be used to connect other electronic devices, such as VR devices and the like. In some embodiments, the standard specifications for the universal serial bus may be USB1.x, USB2.0, USB3.x, and USB 4.

The charging management module 940 is used for receiving charging input of the charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 940 may receive charging input of a wired charger through the USB interface 130. In some wireless charging embodiments, the charging management module 940 may receive wireless charging input through a wireless charging coil of the electronic device 90. The charging management module 940 may also supply power to the electronic device through the power management module 941 while charging the battery 942.

The power management module 941 is configured to connect the battery 942, the charging management module 940 and the processor 910. The power management module 941 receives input from the battery 942 and/or the charging management module 940 and provides power to the processor 910, the internal memory 921, the display 994, the camera 993, and the wireless communication module 960. The power management module 941 may also be used to monitor parameters such as battery capacity, battery cycle number, and battery health (leakage, impedance). In some other embodiments, a power management module 941 may also be disposed in the processor 910. In other embodiments, the power management module 941 and the charging management module 940 may be disposed in the same device.

The wireless communication function of the electronic device 90 may be implemented by the antenna 1, the antenna 2, the mobile communication module 950, the wireless communication module 960, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 90 may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 950 may provide a solution including 2G/3G/4G/5G wireless communication applied to the electronic device 90. The mobile communication module 950 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. The mobile communication module 950 can receive electromagnetic waves from the antenna 1, filter, amplify and transmit the received electromagnetic waves to the modem processor for demodulation. The mobile communication module 950 can also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave through the antenna 1 to radiate the electromagnetic wave. In some embodiments, at least some of the functional modules of the mobile communication module 950 may be disposed in the processor 910. In some embodiments, at least some of the functional modules of the mobile communication module 950 may be disposed in the same device as at least some of the modules of the processor 910.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating a low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then passes the demodulated low frequency baseband signal to a baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs a sound signal through an audio device (not limited to the speaker 970A, the receiver 970B, etc.) or displays an image or video through the display screen 994. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be separate from the processor 910 and may be located in the same device as the mobile communication module 950 or other functional modules.

The wireless communication module 960 may provide a solution for wireless communication applied to the electronic device 90, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), Bluetooth (BT), Bluetooth Low Energy (BLE), Ultra Wide Band (UWB), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), short-range wireless communication (NFC), Infrared (IR), and the like. The wireless communication module 960 may be one or more devices integrating at least one communication processing module. The wireless communication module 960 receives an electromagnetic wave via the antenna 2, performs frequency modulation and filtering on an electromagnetic wave signal, and transmits the processed signal to the processor 910. The wireless communication module 960 may also receive signals to be transmitted from the processor 910, frequency modulate, amplify, and convert the signals to electromagnetic waves via the antenna 2 for radiation.

In some embodiments, antenna 1 of electronic device 90 is coupled to mobile communication module 950 and antenna 2 is coupled to wireless communication module 960 so that electronic device 90 may communicate with networks and other electronic devices via wireless communication techniques. The wireless communication technology may include global system for mobile communications (GSM), General Packet Radio Service (GPRS), code division multiple access (code division multiple access, CDMA), Wideband Code Division Multiple Access (WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), Long Term Evolution (LTE), LTE, BT, GNSS, WLAN, NFC, FM, and/or IR technologies, etc. The GNSS may include a Global Positioning System (GPS), a global navigation satellite system (GLONASS), a beidou navigation satellite system (BDS), a quasi-zenith satellite system (QZSS), and/or a Satellite Based Augmentation System (SBAS).

The electronic device 90 may implement display functions via the GPU, the display screen 994, and the application processor, among other things. The GPU is a microprocessor for image processing, and is connected to the display screen 994 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 910 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 994 is used to display images, video, and the like. The display screen 994 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), and the like. In some embodiments, the electronic device 90 may include 1 or more display screens 994.

The electronic device 90 may implement a camera function through the camera 993, the ISP, the video codec, the GPU, the display screen 994, the AP, the NPU, and the like.

The camera 993 may be used to acquire color image data and depth data of a subject. The ISP may be used to process color image data collected by the camera 993. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also perform algorithm optimization on noise, brightness, etc. of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera 993.

In some embodiments, the camera 993 may include a color camera module and a 3D sensing module.

In some embodiments, the light sensing element of the color camera module may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to be converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats.

In some embodiments, the 3D sensing module may be a time of flight (TOF) 3D sensing module or a structured light (structured light)3D sensing module. The structured light 3D sensing is an active depth sensing technology, and the basic components of the structured light 3D sensing module may include an Infrared (infra) emitter, an IR camera module, and the like. The TOF 3D sensing module may be an active depth sensing technology, and the basic components of the TOF 3D sensing module may include an Infrared (infra) emitter, an IR camera module, and the like. The structured light 3D sensing module can also be applied to the fields of face recognition, motion sensing game machines, industrial machine vision detection and the like. The TOF 3D sensing module can also be applied to the fields of game machines, Augmented Reality (AR)/Virtual Reality (VR), and the like.

In other embodiments, the camera 993 may also be comprised of two or more cameras. The two or more cameras may include color cameras that may be used to collect color image data of the object being photographed.

In some embodiments, the electronic device 90 may include 1 or more cameras 993. Specifically, the electronic device 90 may include 1 front camera 993 and 1 rear camera 993. The front camera 993 is generally used to collect color image data and depth data of a photographer facing the display screen 994, and the rear camera module is used to collect color image data and depth data of a photographic subject (such as a person, a landscape, etc.) facing the photographer.

The digital signal processor is used for processing digital signals, and can also process other digital signals. For example, when the electronic device 90 selects at a frequency bin, the digital signal processor is used to perform a fourier transform or the like on the frequency bin energy.

Video codecs are used to compress or decompress digital video. The electronic device 90 may support one or more video codecs. In this way, the electronic device 90 may play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can implement applications such as intelligent recognition of the electronic device 90, for example: image recognition, face recognition, voice recognition, text understanding, lip language recognition, and the like.

The external memory interface 920 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the electronic device 90. The external memory card communicates with the processor 910 through the external memory interface 920 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card. Or files such as music, video, etc. are transferred from the electronic device to the external memory card.

The internal memory 921 may be used to store computer-executable program code, which includes instructions. The internal memory 921 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The data storage area may store data created during use of the electronic device 90 (e.g., audio data, phone book, etc.), and the like. In addition, the internal memory 921 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like. The processor 910 performs various functional methods or data processing of the electronic device 90 by executing instructions stored in the internal memory 921 and/or instructions stored in a memory provided in the processor.

The electronic device 90 may implement audio functions through an audio module 970, a speaker 970A, a receiver 970B, a microphone 970C, an earphone interface 970D, an application processor, and the like. Such as music playing, recording, etc.

The audio module 970 is used for converting digital audio information into an analog audio signal output and also for converting an analog audio input into a digital audio signal. The audio module 970 may also be used to encode and decode audio signals. In some embodiments, the audio module 970 may be disposed in the processor 910, or some functional modules of the audio module 970 may be disposed in the processor 910.

The speaker 970A, also called a "horn", is used to convert audio electrical signals into sound signals. The electronic apparatus 90 can listen to music through the speaker 970A or output an audio signal for a handsfree call.

Receiver 970B, also referred to as an "earpiece," is used to convert the electrical audio signal into an acoustic signal. When the electronic device 90 receives a call or voice information, it can receive voice by placing the receiver 970B close to the ear of the person.

Microphone 970C, also known as a "microphone," is used to convert acoustic signals into electrical signals. When making a call or sending voice information, the user can input a voice signal into the microphone 970C by making a sound by approaching the microphone 970C through the mouth of the user. The electronic device 90 may be provided with at least one microphone 970C. In other embodiments, the electronic device 90 may be provided with two microphones 970C to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 90 may further include three, four or more microphones 970C to collect sound signals, reduce noise, identify sound sources, perform directional recording, and the like.

The earphone interface 970D is used to connect a wired earphone. The headset interface 970D may be the USB interface 130, or may be an open mobile electronic device platform (OMTP) standard interface of 3.5mm, or a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

Pressure sensor 980A is configured to sense a pressure signal, which may be converted to an electrical signal. In some embodiments, the pressure sensor 980A may be disposed on the display screen 994. Pressure sensor 980A can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a sensor comprising at least two parallel plates having an electrically conductive material. When a force acts on the pressure sensor 980A, the capacitance between the electrodes changes. The electronics 90 determine the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 994, the electronic apparatus 90 detects the intensity of the touch operation based on the pressure sensor 980A. The electronic device 90 may also calculate the location of the touch based on the detection signal of the pressure sensor 980A. In some embodiments, the touch operations that are applied to the same touch position but different touch operation intensities may correspond to different operation instructions. For example: and when the touch operation with the touch operation intensity smaller than the first pressure threshold value acts on the short message application icon, executing an instruction for viewing the short message. And when the touch operation with the touch operation intensity larger than or equal to the first pressure threshold value acts on the short message application icon, executing an instruction of newly building the short message.

The gyro sensor 980B may be used to determine the motion pose of the electronic device 90. In some embodiments, the angular velocity of the electronic device 90 about three axes (i.e., x, y, and z axes) may be determined by the gyro sensor 980B. The gyro sensor 980B may be used for photographing anti-shake. Illustratively, when the shutter is pressed, the gyro sensor 980B detects the shake angle of the electronic device 90, calculates the distance to be compensated for by the lens module according to the shake angle, controls the lens to move in the opposite direction to counteract the shake of the electronic device 90, and realizes anti-shake. The gyro sensor 980B may also be used for navigation, somatosensory gaming scenarios.

Barometric pressure sensor 980C is used to measure barometric pressure. In some embodiments, electronic device 90 calculates altitude, aiding in positioning and navigation based on barometric pressure values measured by barometric pressure sensor 980C.

The magnetic sensor 980D includes a hall sensor. The electronic device 90 may detect the opening and closing of the flip holster using the magnetic sensor 980D. When the electronic device is a foldable electronic device, the magnetic sensor 980D may be used to detect the folding or unfolding, or folding angle, of the electronic device. In some embodiments, when the electronic device 90 is a flip phone, the electronic device 90 may detect the opening and closing of the flip according to the magnetic sensor 980D. And then according to the opening and closing state of the leather sheath or the opening and closing state of the flip cover, the automatic unlocking of the flip cover is set.

The acceleration sensor 980E can detect the magnitude of acceleration of the electronic device 90 in various directions (typically three axes). The magnitude and direction of gravity can be detected when the electronic device 90 is stationary. The method can also be used for recognizing the posture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.

A distance sensor 980F for measuring distance. The electronic device 90 may measure distance by infrared or laser. In some embodiments, taking a picture of a scene, the electronic device 90 may utilize the range sensor 980F to measure distances to achieve fast focus.

The proximity light sensor 980G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The electronic device 90 emits infrared light to the outside through the light emitting diode. The electronic device 90 uses a photodiode to detect infrared reflected light from nearby objects. When the intensity of the detected reflected light is greater than a threshold value, it may be determined that there is an object near the electronic device 90. When the intensity of the detected reflected light is less than the threshold, the electronic device 90 may determine that there are no objects near the electronic device 90. The electronic device 90 can utilize the proximity light sensor 980G to detect that the user holds the electronic device 90 close to the ear for talking, so as to automatically turn off the screen to achieve the purpose of saving power. The proximity light sensor 980G may also be used in a holster mode, a pocket mode automatically unlocks and locks the screen.

The ambient light sensor 980L may be used to sense ambient light levels. The electronic device 90 may adaptively adjust the brightness of the display 994 based on the perceived ambient light level. The ambient light sensor 980L can also be used to automatically adjust the white balance when taking a picture. Ambient light sensor 980L may also cooperate with proximity light sensor 980G to detect whether electronic device 90 is obscured, such as when the electronic device is in a pocket. When the electronic equipment is detected to be shielded or in a pocket, part of functions (such as a touch function) can be in a disabled state to prevent misoperation.

The fingerprint sensor 980H is used to capture a fingerprint. The electronic device 90 can utilize the collected fingerprint characteristics to unlock the fingerprint, access the application lock, photograph the fingerprint, answer an incoming call with the fingerprint, and the like.

The temperature sensor 980J is used to detect temperature. In some embodiments, the electronic device 90 implements a temperature processing strategy using the temperature detected by the temperature sensor 980J. For example, when the temperature detected by the temperature sensor 980J exceeds a threshold, the electronic device 90 performs a reduction in the performance of the processor in order to reduce the power consumption of the electronic device to implement thermal protection. In other embodiments, the electronics 90 heat the battery 942 when the temperature detected by the temperature sensor 980J is below another threshold. In other embodiments, the electronic device 90 may boost the output voltage of the battery 942 when the temperature is below a further threshold.

Touch sensor 980K, also referred to as a "touch device". The touch sensor 980K may be disposed on the display screen 994, and the touch sensor 980K and the display screen 994 form a touch screen, which is also referred to as a "touch screen". The touch sensor 980K is used to detect a touch operation applied thereto or nearby. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided through the display screen 994. In other embodiments, the touch sensor 980K can be disposed on a surface of the electronic device 90 at a different location than the display screen 994.

Bone conduction sensor 980M may acquire a vibration signal. In some embodiments, the bone conduction sensor 980M can acquire a vibration signal of a vibrating bone mass of a human voice. The bone conduction sensor 980M may also be in contact with the human pulse to receive blood pressure pulsation signals. In some embodiments, the bone conduction sensor 980M may also be provided in a headset, integrated into a bone conduction headset. The audio module 970 may analyze a voice signal based on the vibration signal of the bone mass vibrated by the sound part acquired by the bone conduction sensor 980M, thereby implementing a voice function. The application processor can analyze heart rate information based on the blood pressure beating signal acquired by the bone conduction sensor 980M, and the heart rate detection function is realized.

Keys 990 may include a power-on key, a volume key, and the like. Keys 990 may be mechanical keys. Or may be touch keys. The electronic device 90 may receive key inputs, generate key signal inputs related to user settings and function controls of the electronic device 90.

The motor 991 may generate a vibration cue. The motor 991 may be used for incoming call vibration prompts, as well as for touch vibration feedback. For example, touch operations applied to different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 991 may also respond to different vibration feedback effects when it is operated by touching different areas of the display screen 994. Different application scenes (such as time reminding, receiving information, alarm clock, game and the like) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

The indicator 992 may be an indicator light, and may be used to indicate a charging status, a change in power, or a message, a missed call, a notification, or the like.

The SIM card interface 995 is used to connect SIM cards. The SIM card can be brought into and out of contact with the electronic device 90 by being inserted into the SIM card interface 995 or being pulled out of the SIM card interface 995. The electronic device 90 may support 1 or more SIM card interfaces. The SIM card interface 995 may support a Nano SIM card, a Micro SIM card, a SIM card, etc. The same SIM card interface 995 can be used to insert multiple cards at the same time. The types of the plurality of cards may be the same or different. The SIM card interface 995 may also be compatible with different types of SIM cards. The SIM card interface 995 may also be compatible with external memory cards. The electronic device 90 interacts with the network through the SIM card to implement functions such as communication and data communication. In some embodiments, the electronic device 90 employs esims, namely: an embedded SIM card. The eSIM card can be embedded in the electronic device 90 and cannot be separated from the electronic device 90.

The software system of the electronic device 90 may employ a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the present application exemplifies a software structure of the electronic device 90 by taking an Android system with a layered architecture as an example.

Fig. 10 is a block diagram of an exemplary software structure of the electronic device 90 according to the embodiment of the present application.

The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into five layers, from top to bottom, an application Layer, an application framework Layer, an Android Runtime (ART) and native C/C + + libraries, a Hardware Abstraction Layer (HAL), and a kernel Layer.

The application layer may include a series of application packages.

As shown in fig. 10, the application package may include camera, gallery, calendar, phone call, map, navigation, WLAN, bluetooth, music, video, short message, etc. applications.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 10, the application framework layers may include a window manager, content provider, view system, resource manager, notification manager, activity manager, input manager, and the like.

The Window Manager provides a Window Management Service (WMS), which may be used for Window management, Window animation management, surface management, and a relay station as an input system.

The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.

The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.

The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a short dwell, and does not require user interaction. Such as a notification manager used to inform download completion, message alerts, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, prompting text information in the status bar, sounding a prompt tone, vibrating the electronic device, flashing an indicator light, etc.

The campaign Manager may provide a campaign Manager Service (AMS), which may be used for the start-up, switching, scheduling of system components (e.g., campaigns, services, content providers, broadcast receivers), and the management and scheduling of application processes.

The Input Manager may provide an Input Manager Service (IMS) that may be used to manage inputs to the system, such as touch screen inputs, key inputs, sensor inputs, and the like. The IMS takes the event from the input device node and assigns the event to the appropriate window by interacting with the WMS.

The android runtime comprises a core library and an android runtime. Android runtime is responsible for converting source code into machine code. Android runtime mainly includes adopting Advanced (AOT) compilation technology and Just In Time (JIT) compilation technology.

The core library is mainly used for providing basic functions of the Java class library, such as basic data structure, mathematics, IO, tool, database, network and the like. The core library provides an API for android application development of users. .

The native C/C + + library may include a plurality of functional modules. For example: surface manager (surface manager), Media Framework (Media Framework), libc, OpenGL ES, SQLite, Webkit, etc.

Wherein the surface manager is used for managing the display subsystem and providing the fusion of the 2D and 3D layers for a plurality of application programs. The media framework supports playback and recording of a variety of commonly used audio and video formats, as well as still image files, and the like. The media library may support a variety of audio-video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, and the like. OpenGL ES provides for the rendering and manipulation of 2D graphics and 3D graphics in applications. SQLite provides a lightweight relational database for applications of electronic device 90.

The hardware abstraction layer runs in a user space (user space), encapsulates the kernel layer driver, and provides a calling interface for an upper layer.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

The workflow of the software and hardware of the electronic device 90 is illustratively described below in connection with capturing a photo scene.

When touch sensor 980K receives a touch operation, a corresponding hardware interrupt is issued to the kernel layer. The kernel layer processes the touch operation into an original input event (including touch coordinates, a time stamp of the touch operation, and other information). The raw input events are stored at the kernel layer. And the application program framework layer acquires the original input event from the kernel layer and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and taking a control corresponding to the click operation as a control of a camera application icon as an example, the camera application calls an interface of an application framework layer, starts the camera application, further starts a camera drive by calling a kernel layer, and captures a still image or a video through the camera 993.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiments of the present application also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is used to execute the sample labeling method, the model training method, and/or the lip language recognition method described in the foregoing embodiments when executed by a processor.

The computer storage media of the embodiments of the present application may take any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application.

Claims

1. A sample labeling method is applied to electronic equipment, and the sample labeling method comprises the following steps:

extracting a lip movement video segment in the video and a voice segment in the audio;

selecting the lip movement video segment matched with the voice segment;

2. The method of claim 1,

the method further comprises one or more of:

detecting the volume of the environmental noise;

acquiring a awakening voice confidence coefficient of a user;

detecting whether a human face or a human mouth is included in the visual field range of the camera;

acquiring the position of a speaker in the video and the sound source positioning direction of the audio;

the acquiring of the video and the audio of the user specifically includes: and when the volume of the environmental noise is equal to or less than a preset noise threshold, the confidence of the awakening voice is greater than or equal to a preset first confidence threshold, the visual field range of the camera comprises a human face or a human mouth, and/or the position of a speaker of the video is matched with the sound source positioning direction of the audio, acquiring the video and the audio of the user.

3. The method of claim 1 or 2, wherein the extracting of the lip motion video segment in the video and the voice segment in the audio comprises:

carrying out endpoint detection and segmentation on the video in a lip movement human voice interval detection mode to obtain a lip movement video segment and a human voice interval of the lip movement video segment; and/or the presence of a gas in the gas,

and carrying out end point detection and segmentation on the audio frequency in a voice interval detection mode to obtain the voice fragments and the voice intervals of the voice fragments.

4. The method according to any one of claims 1 to 3, wherein said selecting said lip motion video segment matching said speech segment comprises:

determining the overlapping length of the voice fragment and the lip movement video fragment in the time dimension according to the voice interval of the voice fragment and the voice interval of the lip movement video fragment;

when the overlapping length of the voice segment and the lip motion video segment in the time dimension is larger than or equal to a preset time threshold value, the voice segment and the lip motion video segment are matched.

5. The method according to any one of claims 1 to 4, further comprising: and selecting the lip motion video segment with the lip language recognition confidence coefficient smaller than a preset second confidence coefficient threshold value from the lip motion video segments matched with the voice segments, wherein the lip language recognition confidence coefficient is obtained by performing lip language recognition on the lip motion video segment according to a pre-obtained lip language recognition model.

6. A model training method is applied to electronic equipment and comprises the following steps: updating parameters of a lip language identification model by using the lip language identification sample obtained by the sample labeling method of any one of claims 1 to 5.

7. The method of claim 6,

the lip language identification model comprises a general feature layer and a trainable layer, and the parameters of the lip language identification model comprise trainable layer parameters and general feature layer parameters;

the parameters for updating the lip language identification model specifically include: updating the trainable layer parameters of the lip language recognition model.

8. The method of claim 6 or 7, further comprising:

storing preset information of a user in a registered information database;

and storing the trainable layer parameters related to the preset information in a lip language model library.

9. The method according to any one of claims 6 to 8,

before updating the parameters of the lip language identification model, the method further comprises the following steps: adjusting the parameter update rate of the lip language identification model by comparing the lip language identification text of the lip language identification sample with the label of the lip language identification sample to obtain the parameter update rate corresponding to the lip language identification sample; the lip language identification text is obtained by performing lip language identification on the lip language identification sample through the lip language identification model;

the updating of the parameters of the lip language identification model specifically includes: and updating the parameters of the lip language identification model by using the lip language identification samples and the parameter updating rate corresponding to the lip language identification samples.

10. A lip language identification method is applied to electronic equipment and comprises the following steps:

when a user speaks towards the electronic equipment is detected, acquiring a video of the user;

extracting lip movement video clips in the video;

operating the lip language recognition model based on the parameters of the lip language recognition model obtained by the model training method according to any one of claims 6 to 9 to perform lip language recognition on the lip movement video segment to obtain a lip language recognition text.

11. The method of claim 10,

the operating the lip language recognition model based on the parameters of the lip language recognition model obtained by the model training method according to any one of claims 6 to 9 to perform lip language recognition on the lip movement video clip specifically includes:

acquiring preset information of a user;

acquiring trainable layer parameters associated with the preset information;

and loading the trainable layer parameters and the pre-configured general feature layer parameters to operate the lip language identification model to carry out lip language identification on the lip motion video clip.

12. The method of claim 11,

the preset information comprises a face ID;

the acquiring of the preset information of the user specifically includes: and carrying out face recognition on the video to obtain face feature data of the user, and inquiring a face ID corresponding to the face feature data from a registered face database.

13. A lip language recognition device is applied to an electronic device, and comprises:

a lip motion extraction unit configured to extract a lip motion video segment in the video;

a voice extraction unit configured to extract a voice segment in the audio;

a selection unit configured to select the lip movement video segment matching the voice segment;

14. The apparatus of claim 13,

the lip language recognition device further comprises one or more of the following items:

the video acquisition unit is specifically configured to: when the volume of the environmental noise is equal to or smaller than a preset noise threshold, the confidence coefficient of the awakening voice is larger than or equal to a preset first confidence coefficient threshold, the visual field range of the camera comprises a human face or a human mouth, and/or the position of a speaker of the video is matched with the sound source positioning direction of the audio, the video of a user is obtained; and/or the presence of a gas in the gas,

the audio acquisition unit is specifically configured to: and when the volume of the environmental noise is equal to or smaller than a preset noise threshold, the confidence coefficient of the awakening voice is larger than or equal to a preset first confidence coefficient threshold, the visual field range of the camera comprises a human face or a human mouth, and/or the position of the speaker of the video is matched with the sound source positioning direction of the audio, acquiring the audio of the user.

15. The apparatus of claim 13 or 14,

the lip movement extraction unit is specifically configured to: carrying out endpoint detection and segmentation on the video in a lip movement human voice interval detection mode to obtain a lip movement video segment and a human voice interval of the lip movement video segment;

and/or the voice extraction unit is specifically configured to perform endpoint detection and segmentation on the audio frequency in a voice interval detection mode to obtain the voice segment and the voice interval of the voice segment.

16. The apparatus according to any one of claims 13 to 15, wherein the selection unit is specifically configured to:

determining the overlapping length of the voice segment and the lip movement video segment in the time dimension according to the voice interval of the voice segment and the voice interval of the lip movement video segment;

17. The apparatus according to any one of claims 13 to 16, wherein the selecting unit is further configured to: and selecting the lip motion video segment with the lip language recognition confidence coefficient smaller than a preset second confidence coefficient threshold value from the lip motion video segments matched with the voice segments, wherein the lip language recognition confidence coefficient is obtained by performing lip language recognition on the lip motion video segment according to a pre-obtained lip language recognition model.

18. The apparatus of any one of claims 13 to 17, further comprising: and the parameter updating unit is configured to update the parameters of the lip language recognition model by using the lip language recognition samples obtained by the labeling unit.

19. The apparatus of claim 18,

the parameter updating unit is specifically configured to: updating the trainable layer parameters of the lip language recognition model.

20. The apparatus of claim 18 or 19, further comprising:

the storage unit is configured to store preset information of a user in a registered information database; and storing the trainable layer parameters related to the preset information in a lip language model library.

21. The apparatus according to any one of claims 18 to 20, wherein the parameter updating unit is specifically configured to:

adjusting the parameter update rate of the lip language identification model by comparing the lip language identification text of the lip language identification sample with the label of the lip language identification sample to obtain the parameter update rate corresponding to the lip language identification sample;

updating the parameters of the lip language identification model by using the lip language identification samples and the parameter updating rates corresponding to the lip language identification samples;

the lip language identification text is obtained by performing lip language identification on the lip language identification sample through the lip language identification model.

22. The apparatus of any one of claims 13 to 21,

the video acquisition unit is further configured to acquire a video of the user when the user speaks towards the electronic equipment;

the lip language recognition device further includes: and the lip language identification unit is configured to operate the lip language identification model according to the parameters of the lip language identification model updated by the parameter updating unit so as to perform lip language identification on the lip movement video clip to obtain a lip language identification text.

23. The apparatus of claim 22,

the lip language recognition device further includes: the preset information acquisition unit is configured to acquire preset information of a user;

the lip language identification unit is specifically configured to: and acquiring the trainable layer parameters related to the preset information, and loading the trainable layer parameters and the pre-configured general feature layer parameters so as to operate the lip language identification model to carry out lip language identification on the lip motion video clip.

24. The apparatus of claim 23,

the preset information comprises a face ID;

the preset information obtaining unit is specifically configured to: and carrying out face recognition on the image frames in the video to obtain face feature data of the user, and inquiring a face ID corresponding to the face feature data from a registered face database.

25. An electronic device, comprising:

a processor; and

a memory storing a computer program that, when executed by the processor, causes the processor to perform the sample labeling method of any one of claims 1 to 5, the model training method of any one of claims 6 to 9, and/or the lip language identification method of any one of claims 10 to 12.

26. A computer-readable storage medium having stored thereon program instructions, which, when executed by a computer, cause the computer to perform the sample labeling method of any of claims 1 to 5, the model training method of any of claims 6 to 9, and/or the lip language identification method of any of claims 10 to 12.