CN111914803B

CN111914803B - Lip language keyword detection method, device, equipment and storage medium

Info

Publication number: CN111914803B
Application number: CN202010827853.8A
Authority: CN
Inventors: 杜吉祥; 陈雪娟; 张洪博; 翟传敏
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2023-06-13
Anticipated expiration: 2040-08-17
Also published as: CN111914803A

Abstract

The invention provides a lip language keyword detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: training by a DNN method to obtain a classified DNN model; judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video; extracting the speaking segment, and extracting the characteristics of the query sample and each frame of lip picture of the speaking segment through a lip language identification model to serve as posterior probability characteristics; constructing a similarity matrix graph based on the posterior probability characteristics; and carrying out two-classification on the similarity matrix diagram through a convolutional neural network classification model, and judging whether keywords exist in the lip language video. According to the invention, through endpoint detection and feature extraction by the lip language identifier and similarity matrix diagram construction, the influence of the non-speaking fragments on the lip language keyword detection performance can be reduced, and the keyword detection performance is improved.

Description

Lip language keyword detection method, device, equipment and storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to lip language keyword detection, device, equipment and storage medium.

Background

In recent years, economic construction develops rapidly, the information technology and technology level are continuously improved, the network speed is continuously improved, the storage cost is continuously reduced, and the monitoring cameras are spread over various places. In the face of such a large number of monitoring cameras, most of monitoring is limited by cost or technology, voice content of a user speaking cannot be obtained, so that the speaking content cannot be recognized according to the voice, and the lip language recognition can recognize the content only by seeing the mouth shape, so that the method can play a great role in the security field. However, in terms of the utilization level of the monitoring video, the complete lip language identification is not needed, the complete speaking content is known, and only the identification and detection of a few keywords are needed. The lip language keyword detection can play an important role in the security field. However, the lip recognition technology has many difficulties in practical application, so it is a difficult thing to accurately recognize the lip of the monitoring videos.

At present, the research of lip keyword detection is less, some non-speaking fragments exist in the data set of the lip keyword detection, and if the fragments are relatively long, the fragments can influence the keyword detection, however, the keyword detection has been well developed in the field of voice recognition. In the field of speech recognition, keyword detection methods are mainly of three types: a method based on a complementary white model, a method based on a sample, and a method based on a large vocabulary continuous speech recognition system. According to the voice keyword detection method based on the sample, the input query sample is a small number of voice fragments containing the keyword sample, similarity calculation is carried out on the voice fragments and the test voice fragments, and if the similarity exceeds a certain threshold value, the test voice frequency is considered to contain keywords. One type of method commonly used is a dynamic time warping (dynamic time warping, DTW) based method, which uses the DTW algorithm to calculate the similarity between two sequences of audio features, often using acoustic features as audio features in the early days, but is susceptible to external factors such as environment, channel, speaker, etc. Posterior probability features are introduced later, so that the influence of a speaker and the environment on a keyword detection system is reduced. For the calculation of posterior probability features, the keyword audio and test audio are typically converted into fixed length embedded vectors by building a phoneme decoder. Early artificial neural networks were used, and later, as deep learning progressed, phoneme recognizers were typically built using deep neural networks, LSTM, etc.

Disclosure of Invention

The invention aims to provide a lip language keyword detection method, a lip language keyword detection device, lip language keyword detection equipment and a storage medium for solving the problems.

The embodiment of the invention provides a lip language keyword detection method, which comprises the following steps:

training by a DNN method to obtain a classified DNN model;

judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video;

extracting the speaking segment, and extracting the characteristics of the query sample and each frame of lip picture of the speaking segment through a lip language identification model to serve as posterior probability characteristics;

constructing a similarity matrix graph based on the posterior probability characteristics;

and carrying out two-classification on the similarity matrix diagram through a convolutional neural network classification model, and judging whether keywords exist in the lip language video.

Further, the whole lip language video is divided into 8 states:

unknown state, talk start state, talk end state, non-talk start state, non-talk end state, end state; wherein:

and supplementing a section of non-speaking frame before the speaking start state and after the speaking end state so as to prevent misjudgment in the judging process.

Further, the lip language recognition model is specifically used for:

extracting visual features of the lip picture sequence through three-dimensional convolution and two-dimensional Densenet;

decoding the visual features by resBi-lstm;

the decoded visual features are trained by CTC loss functions.

Further, the constructing the similarity matrix graph based on the posterior probability features specifically includes:

carrying out vector dot product operation on posterior probability characteristics of the query sample and posterior probability characteristics of the speaking segment, and taking logarithms to obtain a similarity matrix diagram; the vector dot product operation and logarithm taking process comprises the following steps:

d(q _i ,x _j )＝log(q _i ·x _j )

wherein i is more than or equal to 1 and less than or equal to m, j is more than or equal to 1 and less than or equal to n, m and n respectively represent the number of frames of the query sample and the lip segment, and the higher the calculated distance matrix d is, the higher the similarity between the two vectors is.

Still further, still include:

and carrying out normalization calculation on the values of the distance matrix d to ensure that all the values in the similarity matrix are in the [ -1,1] interval, wherein the calculation process is as follows:

further, the convolutional neural network classification model is specifically used for:

constructing by 6 layers of convolution, 2 layers of maxpool, self-adaptive mean pooling and full connection layers;

training was performed by a negative log likelihood (Negative Log Likelihood, NLL) loss function.

Further, the negative log likelihood loss function performs a loss function calculation on a value obtained by taking the logarithm of the output probability of softmax, and the formula is as follows:

wherein N represents N data, y _i Is the one-hot code corresponding to the real label, and the representative label is the i-th class, q _i Is the log-derived output of softmax.

The embodiment of the invention also provides a lip language keyword detection device, which comprises:

the training module is used for training by a DNN method to obtain a classified DNN model;

the separation module is used for judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video;

the extraction module is used for extracting the speaking segment, and extracting the characteristic of each frame of lip picture of the query sample and the speaking segment through the lip recognition model to serve as a posterior probability characteristic;

the construction module is used for constructing a similarity matrix graph based on the posterior probability characteristics;

and the classification module is used for carrying out two classification on the similarity matrix diagram through a convolutional neural network classification model and judging whether keywords exist in the lip language video.

The embodiment of the invention also provides lip keyword detection equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor is used for running the computer program to realize the lip keyword detection method.

The embodiment of the invention also provides a storage medium which is used for storing at least one program and at least one instruction, and the at least one program and the instruction are executed to realize the lip keyword detection method.

The embodiment of the invention has the following beneficial technical effects:

performing lip language activity endpoint detection through DNN, analyzing the state of lip language video, removing the part which is not speaking in the lip language video, and improving the performance of keyword detection; by using the lip language identifier to extract posterior probability characteristics in the lip language video, the characteristics can better express semantic characteristics of the lip language video; according to posterior probability characteristics, a similarity matrix diagram is constructed, then a CNN classifier is used for carrying out two-classification on the similarity diagram, whether keywords exist in lip language videos or not is judged, and the method is similar to other DTW methods, so that the keyword detection performance can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some examples of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a method for detecting a lip keyword according to a first embodiment of the present invention.

Fig. 2 is a schematic flow chart of another method for detecting a lip keyword according to the first embodiment of the present invention.

Fig. 3 is a schematic flow chart of a lip language recognition model according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a CNN classifier according to an embodiment of the present invention.

Fig. 5 is a schematic flow chart of a lip keyword detection apparatus according to a second embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Details of embodiments of the present invention are described below.

As shown in fig. 1-2, a first embodiment of the present invention provides a method for detecting a lip language keyword, including the steps of:

s11, training by a DNN method to obtain a classified DNN model;

in this embodiment, by using the data set, the tag is a non-speaking frame slice, and the speaking frame no-slice, the network training is performed, and a two-class DNN model is obtained after the completion.

S12, judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video;

in this embodiment, for the whole lip language video, according to whether each frame is speaking, the judgment of the video state is performed, and the whole judgment process is divided into 8 states:

the processing rules for the video are: the lip language video is in an unknown state at the beginning, and if the number of non-speaking frames in the window is greater than the number of speaking frames, the lip language video is considered to be in a non-speaking starting state; if the continuous 10 frames are judged as speaking frames, the non-speaking end state and the speaking start state are set. And vice versa. In order to solve the problem of too short lip speech transition time caused by the rule, a section of non-speech frame is supplemented before the speech starting state and after the speech ending state, so that erroneous judgment in the judging process is prevented.

S13, extracting the speaking segment, and extracting the characteristic of each frame of lip picture of the query sample and the speaking segment through a lip recognition model to serve as a posterior probability characteristic;

in this embodiment, as shown in fig. 3, the lip language recognition model uses three-dimensional convolution and two-dimensional Densenet to extract visual features of the lip picture sequence, then uses resBi-sLSTM to decode the visual features, and uses CTC loss function for training. And calculating posterior probability characteristics of each frame of lip picture for the lip query sample without the non-speaking segment and the lip segment by using the model.

S14, constructing a similarity matrix diagram based on the posterior probability characteristics;

in this embodiment, according to the posterior probability feature, the posterior probability feature of the lip query sample is assumed to be q= [ Q ] ₁ ,q ₂ ,…q _m ]And the posterior probability characteristic of the lip segment is x= [ X ₁ ,x ₂ ,…,x _n ]Where m and n represent the number of frames of the query sample and the lip segment, respectively. Given any two posterior probability feature vectors q in query samples and lip segments _i and x_j For the distance between two vectors, the distance is calculated by taking the logarithm after the dot product operation of the two vectorsThe formula is as follows:

d(q _i ,x _j )＝log(q _i ·x _j )

wherein i is equal to or less than 1 and equal to or less than m, j is equal to or less than 1 and equal to or less than n, and d (q _i ,x _j ) The higher the value of (c) the higher the similarity between the two vectors. To better handle the variation of similarity scores between different lip query examples and lip segments, values within the distance matrix d are further normalized so that all values in the similarity matrix are at [ -1,1]In this interval, the mathematical calculation formula is as follows:

the images corresponding to the calculated similarity matrix can be divided into two types, wherein one type refers to a positive sample, namely a query sample appears in the lip test fragment; the other is a negative sample, i.e., the query sample does not appear in the lip test segment.

And S15, performing two-class classification on the similarity matrix diagram through a convolutional neural network classification model, and judging whether keywords exist in the lip language video.

In this embodiment, as shown in FIG. 4, a simple convolutional neural network is constructed, consisting of 6 layers of convolutional, 2 layers of maxpool, and adaptive mean pooling and full-connected layers. In training, the network model is trained using a negative log likelihood (Negative Log Likelihood, NLL) loss function. The negative log likelihood loss function is a loss function calculation of the value after taking the logarithm of the output probability of softmax, where the label is encoded using one-hot. Assuming N data, y _i Is the one-hot code corresponding to the real label, and the representative label is the i-th class, q _i Is the output of softmax after taking the logarithm, the mathematical formula of the negative log likelihood loss function is as follows:

in the embodiment, the state of the lip language video is analyzed, so that the part which is not speaking in the lip language video is removed, and the keyword detection performance is improved; by using the lip language identifier to extract posterior probability characteristics in the lip language video, the characteristics can better express semantic characteristics of the lip language video; according to posterior probability characteristics, a similarity matrix diagram is constructed, then a CNN classifier is used for carrying out two-classification on the similarity diagram, whether keywords exist in lip language videos or not is judged, and the method is similar to other DTW methods, so that the keyword detection performance can be improved.

A second embodiment of the present invention provides a lip keyword detection apparatus, as shown in fig. 5, including:

the training module 110 is configured to train by a DNN method to obtain a classified DNN model;

The separation module 120 is configured to determine a speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separate a speaking segment and a non-speaking segment of the lip video;

The extracting module 130 is configured to extract the speaking segment, and extract, through a lip recognition model, a query sample and features of each frame of lip picture of the speaking segment as posterior probability features;

A construction module 140, configured to construct a similarity matrix map based on the posterior probability features;

in this embodiment, according to the posterior probability feature, the posterior probability feature of the lip query sample is assumed to be q= [ Q ] ₁ ,q ₂ ,…q _m ]And the posterior probability characteristic of the lip segment is x= [ X ₁ ,x ₂ ,…,x _n ]Where m and n represent the number of frames of the query sample and the lip segment, respectively. Given any two posterior probability feature vectors q in query samples and lip segments _i and x_j For the distance between two vectors, the distance is obtained by taking the logarithm after dot product operation of the two vectors, and the calculation formula is as follows:

d(q _i ,x _j )＝log(q _i ·x _j )

And the classification module 150 is used for performing two-classification on the similarity matrix diagram through a convolutional neural network classification model and judging whether keywords exist in the lip language video.

in the embodiment, the state of the lip language video is analyzed, the part which is not speaking in the lip language video is removed, and the keyword detection performance is improved; by using the lip language identifier to extract posterior probability characteristics in the lip language video, the characteristics can better express semantic characteristics of the lip language video; according to posterior probability characteristics, a similarity matrix diagram is constructed, then a CNN classifier is used for carrying out two-classification on the similarity diagram, whether keywords exist in lip language videos or not is judged, and the method is similar to other DTW methods, so that the keyword detection performance can be improved.

The third embodiment of the invention provides a lip keyword detection device, which comprises a memory and a processor, wherein a computer program is stored in the memory, and the processor is used for running the computer program to realize the lip keyword detection method.

The fourth embodiment of the present invention further provides a storage medium, where the storage medium stores a computer program, where the computer program can be executed by a processor of a device where the storage medium is located, so as to implement the method for detecting a lip language keyword.

In the several embodiments provided by the embodiments of the present invention, it should be understood that the provided apparatus and method may be implemented in other manners. The apparatus and method embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present invention may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, an electronic device, a network device, or the like) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The lip language keyword detection method is characterized by comprising the following steps of:

training by a DNN method to obtain a classified DNN model;

judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video; the whole lip language video is divided into 8 states: unknown state, talk start state, talk end state, non-talk start state, non-talk end state, end state;

extracting the speaking segment, and extracting the characteristics of the query sample and each frame of lip picture of the speaking segment through a lip language identification model to serve as posterior probability characteristics; the lip language identification model is specifically used for: extracting visual features of the lip picture sequence through three-dimensional convolution and two-dimensional Densenet; decoding the visual features by resBi-lstm; training the decoding visual characteristics through a CTC loss function;

constructing a similarity matrix graph based on the posterior probability characteristics; wherein, specifically, it is:

posterior probability characterization of query samples

Posterior probability characteristics of speaking segment->

Performing vector dot product operation, and taking the logarithm to obtain a similarity matrix diagram; the vector dot product operation and logarithm taking process comprises the following steps:

wherein ,

and->

，/>

and />

Representing the number of frames of the query sample and the lip segment, respectively, and calculating the obtained similarity matrix +.>

The higher the value of (c) the more similarity between the two vectors is representedHigh;

2. The method for detecting a lip-language keyword according to claim 1, further comprising:

3. The method for detecting a lip-language keyword according to claim 1, further comprising:

for the similarity matrix

Normalized calculation is performed on the values of (2) so that all values in the similarity matrix are at +.>

In the interval, the calculation process is as follows:

4. the method for detecting a lip language keyword according to claim 1, wherein the convolutional neural network classification model is specifically configured to:

training is performed by a negative log-likelihood loss function.

5. The method for detecting a lip keyword according to claim 4, wherein the negative log likelihood loss function performs a loss function calculation on a value obtained by taking a logarithm of an output probability of softmax, and the formula is as follows:

wherein ,Nrepresentative ofNThe data of the plurality of data,

is the one-hot code corresponding to the authentic tag, representing the tag is +.>

Class (I)>

Is the log-derived output of softmax.

6. A lip language keyword detection device comprises

the separation module is used for judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video; the whole lip language video is divided into 8 states: unknown state, talk start state, talk end state, non-talk start state, non-talk end state, end state;

the extraction module is used for extracting the speaking segment, and extracting the characteristic of each frame of lip picture of the query sample and the speaking segment through the lip recognition model to serve as a posterior probability characteristic; the lip language identification model is specifically used for: extracting visual features of the lip picture sequence through three-dimensional convolution and two-dimensional Densenet; decoding the visual features by resBi-lstm; training the decoding visual characteristics through a CTC loss function;

the construction module is used for constructing a similarity matrix graph based on the posterior probability characteristics; posterior probability characterization of query samples

Posterior probability characteristics of speaking segment->

wherein ,

and->

，/>

and />

The higher the value of (c), the higher the similarity between the two vectors;

7. A lip keyword detection apparatus, comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to implement a lip keyword detection method as claimed in any one of claims 1 to 5.

8. A storage medium storing a computer program executable by a processor of a device in which the storage medium is located to implement a method of lip keyword detection as claimed in any one of claims 1 to 5.