CN111914803B - Lip language keyword detection method, device, equipment and storage medium - Google Patents

Lip language keyword detection method, device, equipment and storage medium Download PDF

Info

Publication number
CN111914803B
CN111914803B CN202010827853.8A CN202010827853A CN111914803B CN 111914803 B CN111914803 B CN 111914803B CN 202010827853 A CN202010827853 A CN 202010827853A CN 111914803 B CN111914803 B CN 111914803B
Authority
CN
China
Prior art keywords
lip
speaking
video
similarity matrix
posterior probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010827853.8A
Other languages
Chinese (zh)
Other versions
CN111914803A (en
Inventor
杜吉祥
陈雪娟
张洪博
翟传敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaqiao University
Original Assignee
Huaqiao University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaqiao University filed Critical Huaqiao University
Priority to CN202010827853.8A priority Critical patent/CN111914803B/en
Publication of CN111914803A publication Critical patent/CN111914803A/en
Application granted granted Critical
Publication of CN111914803B publication Critical patent/CN111914803B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a lip language keyword detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: training by a DNN method to obtain a classified DNN model; judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video; extracting the speaking segment, and extracting the characteristics of the query sample and each frame of lip picture of the speaking segment through a lip language identification model to serve as posterior probability characteristics; constructing a similarity matrix graph based on the posterior probability characteristics; and carrying out two-classification on the similarity matrix diagram through a convolutional neural network classification model, and judging whether keywords exist in the lip language video. According to the invention, through endpoint detection and feature extraction by the lip language identifier and similarity matrix diagram construction, the influence of the non-speaking fragments on the lip language keyword detection performance can be reduced, and the keyword detection performance is improved.

Description

Lip language keyword detection method, device, equipment and storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to lip language keyword detection, device, equipment and storage medium.
Background
In recent years, economic construction develops rapidly, the information technology and technology level are continuously improved, the network speed is continuously improved, the storage cost is continuously reduced, and the monitoring cameras are spread over various places. In the face of such a large number of monitoring cameras, most of monitoring is limited by cost or technology, voice content of a user speaking cannot be obtained, so that the speaking content cannot be recognized according to the voice, and the lip language recognition can recognize the content only by seeing the mouth shape, so that the method can play a great role in the security field. However, in terms of the utilization level of the monitoring video, the complete lip language identification is not needed, the complete speaking content is known, and only the identification and detection of a few keywords are needed. The lip language keyword detection can play an important role in the security field. However, the lip recognition technology has many difficulties in practical application, so it is a difficult thing to accurately recognize the lip of the monitoring videos.
At present, the research of lip keyword detection is less, some non-speaking fragments exist in the data set of the lip keyword detection, and if the fragments are relatively long, the fragments can influence the keyword detection, however, the keyword detection has been well developed in the field of voice recognition. In the field of speech recognition, keyword detection methods are mainly of three types: a method based on a complementary white model, a method based on a sample, and a method based on a large vocabulary continuous speech recognition system. According to the voice keyword detection method based on the sample, the input query sample is a small number of voice fragments containing the keyword sample, similarity calculation is carried out on the voice fragments and the test voice fragments, and if the similarity exceeds a certain threshold value, the test voice frequency is considered to contain keywords. One type of method commonly used is a dynamic time warping (dynamic time warping, DTW) based method, which uses the DTW algorithm to calculate the similarity between two sequences of audio features, often using acoustic features as audio features in the early days, but is susceptible to external factors such as environment, channel, speaker, etc. Posterior probability features are introduced later, so that the influence of a speaker and the environment on a keyword detection system is reduced. For the calculation of posterior probability features, the keyword audio and test audio are typically converted into fixed length embedded vectors by building a phoneme decoder. Early artificial neural networks were used, and later, as deep learning progressed, phoneme recognizers were typically built using deep neural networks, LSTM, etc.
Disclosure of Invention
The invention aims to provide a lip language keyword detection method, a lip language keyword detection device, lip language keyword detection equipment and a storage medium for solving the problems.
The embodiment of the invention provides a lip language keyword detection method, which comprises the following steps:
training by a DNN method to obtain a classified DNN model;
judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video;
extracting the speaking segment, and extracting the characteristics of the query sample and each frame of lip picture of the speaking segment through a lip language identification model to serve as posterior probability characteristics;
constructing a similarity matrix graph based on the posterior probability characteristics;
and carrying out two-classification on the similarity matrix diagram through a convolutional neural network classification model, and judging whether keywords exist in the lip language video.
Further, the whole lip language video is divided into 8 states:
unknown state, talk start state, talk end state, non-talk start state, non-talk end state, end state; wherein:
and supplementing a section of non-speaking frame before the speaking start state and after the speaking end state so as to prevent misjudgment in the judging process.
Further, the lip language recognition model is specifically used for:
extracting visual features of the lip picture sequence through three-dimensional convolution and two-dimensional Densenet;
decoding the visual features by resBi-lstm;
the decoded visual features are trained by CTC loss functions.
Further, the constructing the similarity matrix graph based on the posterior probability features specifically includes:
carrying out vector dot product operation on posterior probability characteristics of the query sample and posterior probability characteristics of the speaking segment, and taking logarithms to obtain a similarity matrix diagram; the vector dot product operation and logarithm taking process comprises the following steps:
d(q i ,x j )=log(q i ·x j )
wherein i is more than or equal to 1 and less than or equal to m, j is more than or equal to 1 and less than or equal to n, m and n respectively represent the number of frames of the query sample and the lip segment, and the higher the calculated distance matrix d is, the higher the similarity between the two vectors is.
Still further, still include:
and carrying out normalization calculation on the values of the distance matrix d to ensure that all the values in the similarity matrix are in the [ -1,1] interval, wherein the calculation process is as follows:
Figure BDA0002636881240000031
further, the convolutional neural network classification model is specifically used for:
constructing by 6 layers of convolution, 2 layers of maxpool, self-adaptive mean pooling and full connection layers;
training was performed by a negative log likelihood (Negative Log Likelihood, NLL) loss function.
Further, the negative log likelihood loss function performs a loss function calculation on a value obtained by taking the logarithm of the output probability of softmax, and the formula is as follows:
Figure BDA0002636881240000032
wherein N represents N data, y i Is the one-hot code corresponding to the real label, and the representative label is the i-th class, q i Is the log-derived output of softmax.
The embodiment of the invention also provides a lip language keyword detection device, which comprises:
the training module is used for training by a DNN method to obtain a classified DNN model;
the separation module is used for judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video;
the extraction module is used for extracting the speaking segment, and extracting the characteristic of each frame of lip picture of the query sample and the speaking segment through the lip recognition model to serve as a posterior probability characteristic;
the construction module is used for constructing a similarity matrix graph based on the posterior probability characteristics;
and the classification module is used for carrying out two classification on the similarity matrix diagram through a convolutional neural network classification model and judging whether keywords exist in the lip language video.
The embodiment of the invention also provides lip keyword detection equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor is used for running the computer program to realize the lip keyword detection method.
The embodiment of the invention also provides a storage medium which is used for storing at least one program and at least one instruction, and the at least one program and the instruction are executed to realize the lip keyword detection method.
The embodiment of the invention has the following beneficial technical effects:
performing lip language activity endpoint detection through DNN, analyzing the state of lip language video, removing the part which is not speaking in the lip language video, and improving the performance of keyword detection; by using the lip language identifier to extract posterior probability characteristics in the lip language video, the characteristics can better express semantic characteristics of the lip language video; according to posterior probability characteristics, a similarity matrix diagram is constructed, then a CNN classifier is used for carrying out two-classification on the similarity diagram, whether keywords exist in lip language videos or not is judged, and the method is similar to other DTW methods, so that the keyword detection performance can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some examples of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a method for detecting a lip keyword according to a first embodiment of the present invention.
Fig. 2 is a schematic flow chart of another method for detecting a lip keyword according to the first embodiment of the present invention.
Fig. 3 is a schematic flow chart of a lip language recognition model according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a CNN classifier according to an embodiment of the present invention.
Fig. 5 is a schematic flow chart of a lip keyword detection apparatus according to a second embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Details of embodiments of the present invention are described below.
As shown in fig. 1-2, a first embodiment of the present invention provides a method for detecting a lip language keyword, including the steps of:
s11, training by a DNN method to obtain a classified DNN model;
in this embodiment, by using the data set, the tag is a non-speaking frame slice, and the speaking frame no-slice, the network training is performed, and a two-class DNN model is obtained after the completion.
S12, judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video;
in this embodiment, for the whole lip language video, according to whether each frame is speaking, the judgment of the video state is performed, and the whole judgment process is divided into 8 states:
unknown state, talk start state, talk end state, non-talk start state, non-talk end state, end state; wherein:
the processing rules for the video are: the lip language video is in an unknown state at the beginning, and if the number of non-speaking frames in the window is greater than the number of speaking frames, the lip language video is considered to be in a non-speaking starting state; if the continuous 10 frames are judged as speaking frames, the non-speaking end state and the speaking start state are set. And vice versa. In order to solve the problem of too short lip speech transition time caused by the rule, a section of non-speech frame is supplemented before the speech starting state and after the speech ending state, so that erroneous judgment in the judging process is prevented.
S13, extracting the speaking segment, and extracting the characteristic of each frame of lip picture of the query sample and the speaking segment through a lip recognition model to serve as a posterior probability characteristic;
in this embodiment, as shown in fig. 3, the lip language recognition model uses three-dimensional convolution and two-dimensional Densenet to extract visual features of the lip picture sequence, then uses resBi-sLSTM to decode the visual features, and uses CTC loss function for training. And calculating posterior probability characteristics of each frame of lip picture for the lip query sample without the non-speaking segment and the lip segment by using the model.
S14, constructing a similarity matrix diagram based on the posterior probability characteristics;
in this embodiment, according to the posterior probability feature, the posterior probability feature of the lip query sample is assumed to be q= [ Q ] 1 ,q 2 ,…q m ]And the posterior probability characteristic of the lip segment is x= [ X 1 ,x 2 ,…,x n ]Where m and n represent the number of frames of the query sample and the lip segment, respectively. Given any two posterior probability feature vectors q in query samples and lip segments i and xj For the distance between two vectors, the distance is calculated by taking the logarithm after the dot product operation of the two vectorsThe formula is as follows:
d(q i ,x j )=log(q i ·x j )
wherein i is equal to or less than 1 and equal to or less than m, j is equal to or less than 1 and equal to or less than n, and d (q i ,x j ) The higher the value of (c) the higher the similarity between the two vectors. To better handle the variation of similarity scores between different lip query examples and lip segments, values within the distance matrix d are further normalized so that all values in the similarity matrix are at [ -1,1]In this interval, the mathematical calculation formula is as follows:
Figure BDA0002636881240000071
the images corresponding to the calculated similarity matrix can be divided into two types, wherein one type refers to a positive sample, namely a query sample appears in the lip test fragment; the other is a negative sample, i.e., the query sample does not appear in the lip test segment.
And S15, performing two-class classification on the similarity matrix diagram through a convolutional neural network classification model, and judging whether keywords exist in the lip language video.
In this embodiment, as shown in FIG. 4, a simple convolutional neural network is constructed, consisting of 6 layers of convolutional, 2 layers of maxpool, and adaptive mean pooling and full-connected layers. In training, the network model is trained using a negative log likelihood (Negative Log Likelihood, NLL) loss function. The negative log likelihood loss function is a loss function calculation of the value after taking the logarithm of the output probability of softmax, where the label is encoded using one-hot. Assuming N data, y i Is the one-hot code corresponding to the real label, and the representative label is the i-th class, q i Is the output of softmax after taking the logarithm, the mathematical formula of the negative log likelihood loss function is as follows:
Figure BDA0002636881240000072
in the embodiment, the state of the lip language video is analyzed, so that the part which is not speaking in the lip language video is removed, and the keyword detection performance is improved; by using the lip language identifier to extract posterior probability characteristics in the lip language video, the characteristics can better express semantic characteristics of the lip language video; according to posterior probability characteristics, a similarity matrix diagram is constructed, then a CNN classifier is used for carrying out two-classification on the similarity diagram, whether keywords exist in lip language videos or not is judged, and the method is similar to other DTW methods, so that the keyword detection performance can be improved.
A second embodiment of the present invention provides a lip keyword detection apparatus, as shown in fig. 5, including:
the training module 110 is configured to train by a DNN method to obtain a classified DNN model;
in this embodiment, by using the data set, the tag is a non-speaking frame slice, and the speaking frame no-slice, the network training is performed, and a two-class DNN model is obtained after the completion.
The separation module 120 is configured to determine a speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separate a speaking segment and a non-speaking segment of the lip video;
in this embodiment, for the whole lip language video, according to whether each frame is speaking, the judgment of the video state is performed, and the whole judgment process is divided into 8 states:
unknown state, talk start state, talk end state, non-talk start state, non-talk end state, end state; wherein:
the processing rules for the video are: the lip language video is in an unknown state at the beginning, and if the number of non-speaking frames in the window is greater than the number of speaking frames, the lip language video is considered to be in a non-speaking starting state; if the continuous 10 frames are judged as speaking frames, the non-speaking end state and the speaking start state are set. And vice versa. In order to solve the problem of too short lip speech transition time caused by the rule, a section of non-speech frame is supplemented before the speech starting state and after the speech ending state, so that erroneous judgment in the judging process is prevented.
The extracting module 130 is configured to extract the speaking segment, and extract, through a lip recognition model, a query sample and features of each frame of lip picture of the speaking segment as posterior probability features;
in this embodiment, as shown in fig. 3, the lip language recognition model uses three-dimensional convolution and two-dimensional Densenet to extract visual features of the lip picture sequence, then uses resBi-sLSTM to decode the visual features, and uses CTC loss function for training. And calculating posterior probability characteristics of each frame of lip picture for the lip query sample without the non-speaking segment and the lip segment by using the model.
A construction module 140, configured to construct a similarity matrix map based on the posterior probability features;
in this embodiment, according to the posterior probability feature, the posterior probability feature of the lip query sample is assumed to be q= [ Q ] 1 ,q 2 ,…q m ]And the posterior probability characteristic of the lip segment is x= [ X 1 ,x 2 ,…,x n ]Where m and n represent the number of frames of the query sample and the lip segment, respectively. Given any two posterior probability feature vectors q in query samples and lip segments i and xj For the distance between two vectors, the distance is obtained by taking the logarithm after dot product operation of the two vectors, and the calculation formula is as follows:
d(q i ,x j )=log(q i ·x j )
wherein i is equal to or less than 1 and equal to or less than m, j is equal to or less than 1 and equal to or less than n, and d (q i ,x j ) The higher the value of (c) the higher the similarity between the two vectors. To better handle the variation of similarity scores between different lip query examples and lip segments, values within the distance matrix d are further normalized so that all values in the similarity matrix are at [ -1,1]In this interval, the mathematical calculation formula is as follows:
Figure BDA0002636881240000091
the images corresponding to the calculated similarity matrix can be divided into two types, wherein one type refers to a positive sample, namely a query sample appears in the lip test fragment; the other is a negative sample, i.e., the query sample does not appear in the lip test segment.
And the classification module 150 is used for performing two-classification on the similarity matrix diagram through a convolutional neural network classification model and judging whether keywords exist in the lip language video.
In this embodiment, as shown in FIG. 4, a simple convolutional neural network is constructed, consisting of 6 layers of convolutional, 2 layers of maxpool, and adaptive mean pooling and full-connected layers. In training, the network model is trained using a negative log likelihood (Negative Log Likelihood, NLL) loss function. The negative log likelihood loss function is a loss function calculation of the value after taking the logarithm of the output probability of softmax, where the label is encoded using one-hot. Assuming N data, y i Is the one-hot code corresponding to the real label, and the representative label is the i-th class, q i Is the output of softmax after taking the logarithm, the mathematical formula of the negative log likelihood loss function is as follows:
Figure BDA0002636881240000092
in the embodiment, the state of the lip language video is analyzed, the part which is not speaking in the lip language video is removed, and the keyword detection performance is improved; by using the lip language identifier to extract posterior probability characteristics in the lip language video, the characteristics can better express semantic characteristics of the lip language video; according to posterior probability characteristics, a similarity matrix diagram is constructed, then a CNN classifier is used for carrying out two-classification on the similarity diagram, whether keywords exist in lip language videos or not is judged, and the method is similar to other DTW methods, so that the keyword detection performance can be improved.
The third embodiment of the invention provides a lip keyword detection device, which comprises a memory and a processor, wherein a computer program is stored in the memory, and the processor is used for running the computer program to realize the lip keyword detection method.
The fourth embodiment of the present invention further provides a storage medium, where the storage medium stores a computer program, where the computer program can be executed by a processor of a device where the storage medium is located, so as to implement the method for detecting a lip language keyword.
In the several embodiments provided by the embodiments of the present invention, it should be understood that the provided apparatus and method may be implemented in other manners. The apparatus and method embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present invention may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, an electronic device, a network device, or the like) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. The lip language keyword detection method is characterized by comprising the following steps of:
training by a DNN method to obtain a classified DNN model;
judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video; the whole lip language video is divided into 8 states: unknown state, talk start state, talk end state, non-talk start state, non-talk end state, end state;
extracting the speaking segment, and extracting the characteristics of the query sample and each frame of lip picture of the speaking segment through a lip language identification model to serve as posterior probability characteristics; the lip language identification model is specifically used for: extracting visual features of the lip picture sequence through three-dimensional convolution and two-dimensional Densenet; decoding the visual features by resBi-lstm; training the decoding visual characteristics through a CTC loss function;
constructing a similarity matrix graph based on the posterior probability characteristics; wherein, specifically, it is:
posterior probability characterization of query samples
Figure QLYQS_1
Posterior probability characteristics of speaking segment->
Figure QLYQS_2
Performing vector dot product operation, and taking the logarithm to obtain a similarity matrix diagram; the vector dot product operation and logarithm taking process comprises the following steps:
Figure QLYQS_3
wherein ,
Figure QLYQS_4
and->
Figure QLYQS_5
,/>
Figure QLYQS_6
and />
Figure QLYQS_7
Representing the number of frames of the query sample and the lip segment, respectively, and calculating the obtained similarity matrix +.>
Figure QLYQS_8
The higher the value of (c) the more similarity between the two vectors is representedHigh;
and carrying out two-classification on the similarity matrix diagram through a convolutional neural network classification model, and judging whether keywords exist in the lip language video.
2. The method for detecting a lip-language keyword according to claim 1, further comprising:
and supplementing a section of non-speaking frame before the speaking start state and after the speaking end state so as to prevent misjudgment in the judging process.
3. The method for detecting a lip-language keyword according to claim 1, further comprising:
for the similarity matrix
Figure QLYQS_9
Normalized calculation is performed on the values of (2) so that all values in the similarity matrix are at +.>
Figure QLYQS_10
In the interval, the calculation process is as follows:
Figure QLYQS_11
4. the method for detecting a lip language keyword according to claim 1, wherein the convolutional neural network classification model is specifically configured to:
constructing by 6 layers of convolution, 2 layers of maxpool, self-adaptive mean pooling and full connection layers;
training is performed by a negative log-likelihood loss function.
5. The method for detecting a lip keyword according to claim 4, wherein the negative log likelihood loss function performs a loss function calculation on a value obtained by taking a logarithm of an output probability of softmax, and the formula is as follows:
Figure QLYQS_12
wherein ,Nrepresentative ofNThe data of the plurality of data,
Figure QLYQS_13
is the one-hot code corresponding to the authentic tag, representing the tag is +.>
Figure QLYQS_14
Class (I)>
Figure QLYQS_15
Is the log-derived output of softmax.
6. A lip language keyword detection device comprises
The training module is used for training by a DNN method to obtain a classified DNN model;
the separation module is used for judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video; the whole lip language video is divided into 8 states: unknown state, talk start state, talk end state, non-talk start state, non-talk end state, end state;
the extraction module is used for extracting the speaking segment, and extracting the characteristic of each frame of lip picture of the query sample and the speaking segment through the lip recognition model to serve as a posterior probability characteristic; the lip language identification model is specifically used for: extracting visual features of the lip picture sequence through three-dimensional convolution and two-dimensional Densenet; decoding the visual features by resBi-lstm; training the decoding visual characteristics through a CTC loss function;
the construction module is used for constructing a similarity matrix graph based on the posterior probability characteristics; posterior probability characterization of query samples
Figure QLYQS_16
Posterior probability characteristics of speaking segment->
Figure QLYQS_17
Performing vector dot product operation, and taking the logarithm to obtain a similarity matrix diagram; the vector dot product operation and logarithm taking process comprises the following steps:
Figure QLYQS_18
wherein ,
Figure QLYQS_19
and->
Figure QLYQS_20
,/>
Figure QLYQS_21
and />
Figure QLYQS_22
Representing the number of frames of the query sample and the lip segment, respectively, and calculating the obtained similarity matrix +.>
Figure QLYQS_23
The higher the value of (c), the higher the similarity between the two vectors;
and the classification module is used for carrying out two classification on the similarity matrix diagram through a convolutional neural network classification model and judging whether keywords exist in the lip language video.
7. A lip keyword detection apparatus, comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to implement a lip keyword detection method as claimed in any one of claims 1 to 5.
8. A storage medium storing a computer program executable by a processor of a device in which the storage medium is located to implement a method of lip keyword detection as claimed in any one of claims 1 to 5.
CN202010827853.8A 2020-08-17 2020-08-17 Lip language keyword detection method, device, equipment and storage medium Active CN111914803B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010827853.8A CN111914803B (en) 2020-08-17 2020-08-17 Lip language keyword detection method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010827853.8A CN111914803B (en) 2020-08-17 2020-08-17 Lip language keyword detection method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111914803A CN111914803A (en) 2020-11-10
CN111914803B true CN111914803B (en) 2023-06-13

Family

ID=73279084

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010827853.8A Active CN111914803B (en) 2020-08-17 2020-08-17 Lip language keyword detection method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111914803B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633208A (en) * 2020-12-30 2021-04-09 海信视像科技股份有限公司 Lip language identification method, service equipment and storage medium
CN113065444A (en) * 2021-03-26 2021-07-02 北京大米科技有限公司 Behavior detection method and device, readable storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992813A (en) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 A kind of lip condition detection method and device
CN109409195A (en) * 2018-08-30 2019-03-01 华侨大学 A kind of lip reading recognition methods neural network based and system
CN110633683A (en) * 2019-09-19 2019-12-31 华侨大学 Chinese sentence-level lip language recognition method combining DenseNet and resBi-LSTM
CN110767228A (en) * 2018-07-25 2020-02-07 杭州海康威视数字技术股份有限公司 Sound acquisition method, device, equipment and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107492379B (en) * 2017-06-30 2021-09-21 百度在线网络技术(北京)有限公司 Voiceprint creating and registering method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992813A (en) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 A kind of lip condition detection method and device
CN110767228A (en) * 2018-07-25 2020-02-07 杭州海康威视数字技术股份有限公司 Sound acquisition method, device, equipment and system
CN109409195A (en) * 2018-08-30 2019-03-01 华侨大学 A kind of lip reading recognition methods neural network based and system
CN110633683A (en) * 2019-09-19 2019-12-31 华侨大学 Chinese sentence-level lip language recognition method combining DenseNet and resBi-LSTM

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LipreadingwithDenseNetandresBi-LSTM;Xuejuan Chen;《Signal,Imageand Video Processing》;1-9 *

Also Published As

Publication number Publication date
CN111914803A (en) 2020-11-10

Similar Documents

Publication Publication Date Title
CN110852215B (en) Multi-mode emotion recognition method and system and storage medium
Natarajan et al. Multimodal feature fusion for robust event detection in web videos
CN108962227B (en) Voice starting point and end point detection method and device, computer equipment and storage medium
CN111524527B (en) Speaker separation method, speaker separation device, electronic device and storage medium
CN106297828B (en) Detection method and device for false sounding detection based on deep learning
CN112289323B (en) Voice data processing method and device, computer equipment and storage medium
CN112668559B (en) Multi-mode information fusion short video emotion judgment device and method
US20100057452A1 (en) Speech interfaces
CN111914803B (en) Lip language keyword detection method, device, equipment and storage medium
CN108877812B (en) Voiceprint recognition method and device and storage medium
CN112735385A (en) Voice endpoint detection method and device, computer equipment and storage medium
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN111477219A (en) Keyword distinguishing method and device, electronic equipment and readable storage medium
Ghaemmaghami et al. Complete-linkage clustering for voice activity detection in audio and visual speech
CN114627868A (en) Intention recognition method and device, model and electronic equipment
CN112802498B (en) Voice detection method, device, computer equipment and storage medium
CN111462762B (en) Speaker vector regularization method and device, electronic equipment and storage medium
CN110569908B (en) Speaker counting method and system
KR20220047080A (en) A speaker embedding extraction method and system for automatic speech recognition based pooling method for speaker recognition, and recording medium therefor
US11238289B1 (en) Automatic lie detection method and apparatus for interactive scenarios, device and medium
CN112634869A (en) Command word recognition method, device and computer storage medium
Chaloupka A prototype of audio-visual broadcast transcription system
CN112035670A (en) Multi-modal rumor detection method based on image emotional tendency
CN113129874B (en) Voice awakening method and system
CN116229943B (en) Conversational data set generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant