CN111914803B - Lip language keyword detection method, device, equipment and storage medium - Google Patents
Lip language keyword detection method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN111914803B CN111914803B CN202010827853.8A CN202010827853A CN111914803B CN 111914803 B CN111914803 B CN 111914803B CN 202010827853 A CN202010827853 A CN 202010827853A CN 111914803 B CN111914803 B CN 111914803B
- Authority
- CN
- China
- Prior art keywords
- lip
- speaking
- video
- similarity matrix
- posterior probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 40
- 238000000034 method Methods 0.000 claims abstract description 50
- 239000011159 matrix material Substances 0.000 claims abstract description 35
- 238000010586 diagram Methods 0.000 claims abstract description 24
- 239000012634 fragment Substances 0.000 claims abstract description 21
- 238000012549 training Methods 0.000 claims abstract description 19
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 15
- 238000013145 classification model Methods 0.000 claims abstract description 9
- 238000010276 construction Methods 0.000 claims abstract description 5
- 238000000605 extraction Methods 0.000 claims abstract description 3
- 230000006870 function Effects 0.000 claims description 23
- 239000013598 vector Substances 0.000 claims description 17
- 238000004364 calculation method Methods 0.000 claims description 13
- 230000000007 visual effect Effects 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 10
- 238000011176 pooling Methods 0.000 claims description 4
- 238000000926 separation method Methods 0.000 claims description 3
- 230000001502 supplementing effect Effects 0.000 claims description 2
- 238000012512 characterization method Methods 0.000 claims 2
- 238000012360 testing method Methods 0.000 description 7
- 238000012544 monitoring process Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention provides a lip language keyword detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: training by a DNN method to obtain a classified DNN model; judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video; extracting the speaking segment, and extracting the characteristics of the query sample and each frame of lip picture of the speaking segment through a lip language identification model to serve as posterior probability characteristics; constructing a similarity matrix graph based on the posterior probability characteristics; and carrying out two-classification on the similarity matrix diagram through a convolutional neural network classification model, and judging whether keywords exist in the lip language video. According to the invention, through endpoint detection and feature extraction by the lip language identifier and similarity matrix diagram construction, the influence of the non-speaking fragments on the lip language keyword detection performance can be reduced, and the keyword detection performance is improved.
Description
Technical Field
The invention relates to the field of artificial intelligence, in particular to lip language keyword detection, device, equipment and storage medium.
Background
In recent years, economic construction develops rapidly, the information technology and technology level are continuously improved, the network speed is continuously improved, the storage cost is continuously reduced, and the monitoring cameras are spread over various places. In the face of such a large number of monitoring cameras, most of monitoring is limited by cost or technology, voice content of a user speaking cannot be obtained, so that the speaking content cannot be recognized according to the voice, and the lip language recognition can recognize the content only by seeing the mouth shape, so that the method can play a great role in the security field. However, in terms of the utilization level of the monitoring video, the complete lip language identification is not needed, the complete speaking content is known, and only the identification and detection of a few keywords are needed. The lip language keyword detection can play an important role in the security field. However, the lip recognition technology has many difficulties in practical application, so it is a difficult thing to accurately recognize the lip of the monitoring videos.
At present, the research of lip keyword detection is less, some non-speaking fragments exist in the data set of the lip keyword detection, and if the fragments are relatively long, the fragments can influence the keyword detection, however, the keyword detection has been well developed in the field of voice recognition. In the field of speech recognition, keyword detection methods are mainly of three types: a method based on a complementary white model, a method based on a sample, and a method based on a large vocabulary continuous speech recognition system. According to the voice keyword detection method based on the sample, the input query sample is a small number of voice fragments containing the keyword sample, similarity calculation is carried out on the voice fragments and the test voice fragments, and if the similarity exceeds a certain threshold value, the test voice frequency is considered to contain keywords. One type of method commonly used is a dynamic time warping (dynamic time warping, DTW) based method, which uses the DTW algorithm to calculate the similarity between two sequences of audio features, often using acoustic features as audio features in the early days, but is susceptible to external factors such as environment, channel, speaker, etc. Posterior probability features are introduced later, so that the influence of a speaker and the environment on a keyword detection system is reduced. For the calculation of posterior probability features, the keyword audio and test audio are typically converted into fixed length embedded vectors by building a phoneme decoder. Early artificial neural networks were used, and later, as deep learning progressed, phoneme recognizers were typically built using deep neural networks, LSTM, etc.
Disclosure of Invention
The invention aims to provide a lip language keyword detection method, a lip language keyword detection device, lip language keyword detection equipment and a storage medium for solving the problems.
The embodiment of the invention provides a lip language keyword detection method, which comprises the following steps:
training by a DNN method to obtain a classified DNN model;
judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video;
extracting the speaking segment, and extracting the characteristics of the query sample and each frame of lip picture of the speaking segment through a lip language identification model to serve as posterior probability characteristics;
constructing a similarity matrix graph based on the posterior probability characteristics;
and carrying out two-classification on the similarity matrix diagram through a convolutional neural network classification model, and judging whether keywords exist in the lip language video.
Further, the whole lip language video is divided into 8 states:
unknown state, talk start state, talk end state, non-talk start state, non-talk end state, end state; wherein:
and supplementing a section of non-speaking frame before the speaking start state and after the speaking end state so as to prevent misjudgment in the judging process.
Further, the lip language recognition model is specifically used for:
extracting visual features of the lip picture sequence through three-dimensional convolution and two-dimensional Densenet;
decoding the visual features by resBi-lstm;
the decoded visual features are trained by CTC loss functions.
Further, the constructing the similarity matrix graph based on the posterior probability features specifically includes:
carrying out vector dot product operation on posterior probability characteristics of the query sample and posterior probability characteristics of the speaking segment, and taking logarithms to obtain a similarity matrix diagram; the vector dot product operation and logarithm taking process comprises the following steps:
d(q i ,x j )=log(q i ·x j )
wherein i is more than or equal to 1 and less than or equal to m, j is more than or equal to 1 and less than or equal to n, m and n respectively represent the number of frames of the query sample and the lip segment, and the higher the calculated distance matrix d is, the higher the similarity between the two vectors is.
Still further, still include:
and carrying out normalization calculation on the values of the distance matrix d to ensure that all the values in the similarity matrix are in the [ -1,1] interval, wherein the calculation process is as follows:
further, the convolutional neural network classification model is specifically used for:
constructing by 6 layers of convolution, 2 layers of maxpool, self-adaptive mean pooling and full connection layers;
training was performed by a negative log likelihood (Negative Log Likelihood, NLL) loss function.
Further, the negative log likelihood loss function performs a loss function calculation on a value obtained by taking the logarithm of the output probability of softmax, and the formula is as follows:
wherein N represents N data, y i Is the one-hot code corresponding to the real label, and the representative label is the i-th class, q i Is the log-derived output of softmax.
The embodiment of the invention also provides a lip language keyword detection device, which comprises:
the training module is used for training by a DNN method to obtain a classified DNN model;
the separation module is used for judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video;
the extraction module is used for extracting the speaking segment, and extracting the characteristic of each frame of lip picture of the query sample and the speaking segment through the lip recognition model to serve as a posterior probability characteristic;
the construction module is used for constructing a similarity matrix graph based on the posterior probability characteristics;
and the classification module is used for carrying out two classification on the similarity matrix diagram through a convolutional neural network classification model and judging whether keywords exist in the lip language video.
The embodiment of the invention also provides lip keyword detection equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor is used for running the computer program to realize the lip keyword detection method.
The embodiment of the invention also provides a storage medium which is used for storing at least one program and at least one instruction, and the at least one program and the instruction are executed to realize the lip keyword detection method.
The embodiment of the invention has the following beneficial technical effects:
performing lip language activity endpoint detection through DNN, analyzing the state of lip language video, removing the part which is not speaking in the lip language video, and improving the performance of keyword detection; by using the lip language identifier to extract posterior probability characteristics in the lip language video, the characteristics can better express semantic characteristics of the lip language video; according to posterior probability characteristics, a similarity matrix diagram is constructed, then a CNN classifier is used for carrying out two-classification on the similarity diagram, whether keywords exist in lip language videos or not is judged, and the method is similar to other DTW methods, so that the keyword detection performance can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some examples of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a method for detecting a lip keyword according to a first embodiment of the present invention.
Fig. 2 is a schematic flow chart of another method for detecting a lip keyword according to the first embodiment of the present invention.
Fig. 3 is a schematic flow chart of a lip language recognition model according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a CNN classifier according to an embodiment of the present invention.
Fig. 5 is a schematic flow chart of a lip keyword detection apparatus according to a second embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Details of embodiments of the present invention are described below.
As shown in fig. 1-2, a first embodiment of the present invention provides a method for detecting a lip language keyword, including the steps of:
s11, training by a DNN method to obtain a classified DNN model;
in this embodiment, by using the data set, the tag is a non-speaking frame slice, and the speaking frame no-slice, the network training is performed, and a two-class DNN model is obtained after the completion.
S12, judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video;
in this embodiment, for the whole lip language video, according to whether each frame is speaking, the judgment of the video state is performed, and the whole judgment process is divided into 8 states:
unknown state, talk start state, talk end state, non-talk start state, non-talk end state, end state; wherein:
the processing rules for the video are: the lip language video is in an unknown state at the beginning, and if the number of non-speaking frames in the window is greater than the number of speaking frames, the lip language video is considered to be in a non-speaking starting state; if the continuous 10 frames are judged as speaking frames, the non-speaking end state and the speaking start state are set. And vice versa. In order to solve the problem of too short lip speech transition time caused by the rule, a section of non-speech frame is supplemented before the speech starting state and after the speech ending state, so that erroneous judgment in the judging process is prevented.
S13, extracting the speaking segment, and extracting the characteristic of each frame of lip picture of the query sample and the speaking segment through a lip recognition model to serve as a posterior probability characteristic;
in this embodiment, as shown in fig. 3, the lip language recognition model uses three-dimensional convolution and two-dimensional Densenet to extract visual features of the lip picture sequence, then uses resBi-sLSTM to decode the visual features, and uses CTC loss function for training. And calculating posterior probability characteristics of each frame of lip picture for the lip query sample without the non-speaking segment and the lip segment by using the model.
S14, constructing a similarity matrix diagram based on the posterior probability characteristics;
in this embodiment, according to the posterior probability feature, the posterior probability feature of the lip query sample is assumed to be q= [ Q ] 1 ,q 2 ,…q m ]And the posterior probability characteristic of the lip segment is x= [ X 1 ,x 2 ,…,x n ]Where m and n represent the number of frames of the query sample and the lip segment, respectively. Given any two posterior probability feature vectors q in query samples and lip segments i and xj For the distance between two vectors, the distance is calculated by taking the logarithm after the dot product operation of the two vectorsThe formula is as follows:
d(q i ,x j )=log(q i ·x j )
wherein i is equal to or less than 1 and equal to or less than m, j is equal to or less than 1 and equal to or less than n, and d (q i ,x j ) The higher the value of (c) the higher the similarity between the two vectors. To better handle the variation of similarity scores between different lip query examples and lip segments, values within the distance matrix d are further normalized so that all values in the similarity matrix are at [ -1,1]In this interval, the mathematical calculation formula is as follows:
the images corresponding to the calculated similarity matrix can be divided into two types, wherein one type refers to a positive sample, namely a query sample appears in the lip test fragment; the other is a negative sample, i.e., the query sample does not appear in the lip test segment.
And S15, performing two-class classification on the similarity matrix diagram through a convolutional neural network classification model, and judging whether keywords exist in the lip language video.
In this embodiment, as shown in FIG. 4, a simple convolutional neural network is constructed, consisting of 6 layers of convolutional, 2 layers of maxpool, and adaptive mean pooling and full-connected layers. In training, the network model is trained using a negative log likelihood (Negative Log Likelihood, NLL) loss function. The negative log likelihood loss function is a loss function calculation of the value after taking the logarithm of the output probability of softmax, where the label is encoded using one-hot. Assuming N data, y i Is the one-hot code corresponding to the real label, and the representative label is the i-th class, q i Is the output of softmax after taking the logarithm, the mathematical formula of the negative log likelihood loss function is as follows:
in the embodiment, the state of the lip language video is analyzed, so that the part which is not speaking in the lip language video is removed, and the keyword detection performance is improved; by using the lip language identifier to extract posterior probability characteristics in the lip language video, the characteristics can better express semantic characteristics of the lip language video; according to posterior probability characteristics, a similarity matrix diagram is constructed, then a CNN classifier is used for carrying out two-classification on the similarity diagram, whether keywords exist in lip language videos or not is judged, and the method is similar to other DTW methods, so that the keyword detection performance can be improved.
A second embodiment of the present invention provides a lip keyword detection apparatus, as shown in fig. 5, including:
the training module 110 is configured to train by a DNN method to obtain a classified DNN model;
in this embodiment, by using the data set, the tag is a non-speaking frame slice, and the speaking frame no-slice, the network training is performed, and a two-class DNN model is obtained after the completion.
The separation module 120 is configured to determine a speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separate a speaking segment and a non-speaking segment of the lip video;
in this embodiment, for the whole lip language video, according to whether each frame is speaking, the judgment of the video state is performed, and the whole judgment process is divided into 8 states:
unknown state, talk start state, talk end state, non-talk start state, non-talk end state, end state; wherein:
the processing rules for the video are: the lip language video is in an unknown state at the beginning, and if the number of non-speaking frames in the window is greater than the number of speaking frames, the lip language video is considered to be in a non-speaking starting state; if the continuous 10 frames are judged as speaking frames, the non-speaking end state and the speaking start state are set. And vice versa. In order to solve the problem of too short lip speech transition time caused by the rule, a section of non-speech frame is supplemented before the speech starting state and after the speech ending state, so that erroneous judgment in the judging process is prevented.
The extracting module 130 is configured to extract the speaking segment, and extract, through a lip recognition model, a query sample and features of each frame of lip picture of the speaking segment as posterior probability features;
in this embodiment, as shown in fig. 3, the lip language recognition model uses three-dimensional convolution and two-dimensional Densenet to extract visual features of the lip picture sequence, then uses resBi-sLSTM to decode the visual features, and uses CTC loss function for training. And calculating posterior probability characteristics of each frame of lip picture for the lip query sample without the non-speaking segment and the lip segment by using the model.
A construction module 140, configured to construct a similarity matrix map based on the posterior probability features;
in this embodiment, according to the posterior probability feature, the posterior probability feature of the lip query sample is assumed to be q= [ Q ] 1 ,q 2 ,…q m ]And the posterior probability characteristic of the lip segment is x= [ X 1 ,x 2 ,…,x n ]Where m and n represent the number of frames of the query sample and the lip segment, respectively. Given any two posterior probability feature vectors q in query samples and lip segments i and xj For the distance between two vectors, the distance is obtained by taking the logarithm after dot product operation of the two vectors, and the calculation formula is as follows:
d(q i ,x j )=log(q i ·x j )
wherein i is equal to or less than 1 and equal to or less than m, j is equal to or less than 1 and equal to or less than n, and d (q i ,x j ) The higher the value of (c) the higher the similarity between the two vectors. To better handle the variation of similarity scores between different lip query examples and lip segments, values within the distance matrix d are further normalized so that all values in the similarity matrix are at [ -1,1]In this interval, the mathematical calculation formula is as follows:
the images corresponding to the calculated similarity matrix can be divided into two types, wherein one type refers to a positive sample, namely a query sample appears in the lip test fragment; the other is a negative sample, i.e., the query sample does not appear in the lip test segment.
And the classification module 150 is used for performing two-classification on the similarity matrix diagram through a convolutional neural network classification model and judging whether keywords exist in the lip language video.
In this embodiment, as shown in FIG. 4, a simple convolutional neural network is constructed, consisting of 6 layers of convolutional, 2 layers of maxpool, and adaptive mean pooling and full-connected layers. In training, the network model is trained using a negative log likelihood (Negative Log Likelihood, NLL) loss function. The negative log likelihood loss function is a loss function calculation of the value after taking the logarithm of the output probability of softmax, where the label is encoded using one-hot. Assuming N data, y i Is the one-hot code corresponding to the real label, and the representative label is the i-th class, q i Is the output of softmax after taking the logarithm, the mathematical formula of the negative log likelihood loss function is as follows:
in the embodiment, the state of the lip language video is analyzed, the part which is not speaking in the lip language video is removed, and the keyword detection performance is improved; by using the lip language identifier to extract posterior probability characteristics in the lip language video, the characteristics can better express semantic characteristics of the lip language video; according to posterior probability characteristics, a similarity matrix diagram is constructed, then a CNN classifier is used for carrying out two-classification on the similarity diagram, whether keywords exist in lip language videos or not is judged, and the method is similar to other DTW methods, so that the keyword detection performance can be improved.
The third embodiment of the invention provides a lip keyword detection device, which comprises a memory and a processor, wherein a computer program is stored in the memory, and the processor is used for running the computer program to realize the lip keyword detection method.
The fourth embodiment of the present invention further provides a storage medium, where the storage medium stores a computer program, where the computer program can be executed by a processor of a device where the storage medium is located, so as to implement the method for detecting a lip language keyword.
In the several embodiments provided by the embodiments of the present invention, it should be understood that the provided apparatus and method may be implemented in other manners. The apparatus and method embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present invention may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, an electronic device, a network device, or the like) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (8)
1. The lip language keyword detection method is characterized by comprising the following steps of:
training by a DNN method to obtain a classified DNN model;
judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video; the whole lip language video is divided into 8 states: unknown state, talk start state, talk end state, non-talk start state, non-talk end state, end state;
extracting the speaking segment, and extracting the characteristics of the query sample and each frame of lip picture of the speaking segment through a lip language identification model to serve as posterior probability characteristics; the lip language identification model is specifically used for: extracting visual features of the lip picture sequence through three-dimensional convolution and two-dimensional Densenet; decoding the visual features by resBi-lstm; training the decoding visual characteristics through a CTC loss function;
constructing a similarity matrix graph based on the posterior probability characteristics; wherein, specifically, it is:
posterior probability characterization of query samplesPosterior probability characteristics of speaking segment->Performing vector dot product operation, and taking the logarithm to obtain a similarity matrix diagram; the vector dot product operation and logarithm taking process comprises the following steps:
wherein ,and->,/> and />Representing the number of frames of the query sample and the lip segment, respectively, and calculating the obtained similarity matrix +.>The higher the value of (c) the more similarity between the two vectors is representedHigh;
and carrying out two-classification on the similarity matrix diagram through a convolutional neural network classification model, and judging whether keywords exist in the lip language video.
2. The method for detecting a lip-language keyword according to claim 1, further comprising:
and supplementing a section of non-speaking frame before the speaking start state and after the speaking end state so as to prevent misjudgment in the judging process.
4. the method for detecting a lip language keyword according to claim 1, wherein the convolutional neural network classification model is specifically configured to:
constructing by 6 layers of convolution, 2 layers of maxpool, self-adaptive mean pooling and full connection layers;
training is performed by a negative log-likelihood loss function.
5. The method for detecting a lip keyword according to claim 4, wherein the negative log likelihood loss function performs a loss function calculation on a value obtained by taking a logarithm of an output probability of softmax, and the formula is as follows:
6. A lip language keyword detection device comprises
The training module is used for training by a DNN method to obtain a classified DNN model;
the separation module is used for judging the speaking state of each frame of lip picture in the lip video to be detected based on the DNN model, and separating speaking fragments and non-speaking fragments of the lip video; the whole lip language video is divided into 8 states: unknown state, talk start state, talk end state, non-talk start state, non-talk end state, end state;
the extraction module is used for extracting the speaking segment, and extracting the characteristic of each frame of lip picture of the query sample and the speaking segment through the lip recognition model to serve as a posterior probability characteristic; the lip language identification model is specifically used for: extracting visual features of the lip picture sequence through three-dimensional convolution and two-dimensional Densenet; decoding the visual features by resBi-lstm; training the decoding visual characteristics through a CTC loss function;
the construction module is used for constructing a similarity matrix graph based on the posterior probability characteristics; posterior probability characterization of query samplesPosterior probability characteristics of speaking segment->Performing vector dot product operation, and taking the logarithm to obtain a similarity matrix diagram; the vector dot product operation and logarithm taking process comprises the following steps:
wherein ,and->,/> and />Representing the number of frames of the query sample and the lip segment, respectively, and calculating the obtained similarity matrix +.>The higher the value of (c), the higher the similarity between the two vectors;
and the classification module is used for carrying out two classification on the similarity matrix diagram through a convolutional neural network classification model and judging whether keywords exist in the lip language video.
7. A lip keyword detection apparatus, comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to implement a lip keyword detection method as claimed in any one of claims 1 to 5.
8. A storage medium storing a computer program executable by a processor of a device in which the storage medium is located to implement a method of lip keyword detection as claimed in any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010827853.8A CN111914803B (en) | 2020-08-17 | 2020-08-17 | Lip language keyword detection method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010827853.8A CN111914803B (en) | 2020-08-17 | 2020-08-17 | Lip language keyword detection method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111914803A CN111914803A (en) | 2020-11-10 |
CN111914803B true CN111914803B (en) | 2023-06-13 |
Family
ID=73279084
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010827853.8A Active CN111914803B (en) | 2020-08-17 | 2020-08-17 | Lip language keyword detection method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111914803B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112633208A (en) * | 2020-12-30 | 2021-04-09 | 海信视像科技股份有限公司 | Lip language identification method, service equipment and storage medium |
CN113065444A (en) * | 2021-03-26 | 2021-07-02 | 北京大米科技有限公司 | Behavior detection method and device, readable storage medium and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107992813A (en) * | 2017-11-27 | 2018-05-04 | 北京搜狗科技发展有限公司 | A kind of lip condition detection method and device |
CN109409195A (en) * | 2018-08-30 | 2019-03-01 | 华侨大学 | A kind of lip reading recognition methods neural network based and system |
CN110633683A (en) * | 2019-09-19 | 2019-12-31 | 华侨大学 | Chinese sentence-level lip language recognition method combining DenseNet and resBi-LSTM |
CN110767228A (en) * | 2018-07-25 | 2020-02-07 | 杭州海康威视数字技术股份有限公司 | Sound acquisition method, device, equipment and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107492379B (en) * | 2017-06-30 | 2021-09-21 | 百度在线网络技术(北京)有限公司 | Voiceprint creating and registering method and device |
-
2020
- 2020-08-17 CN CN202010827853.8A patent/CN111914803B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107992813A (en) * | 2017-11-27 | 2018-05-04 | 北京搜狗科技发展有限公司 | A kind of lip condition detection method and device |
CN110767228A (en) * | 2018-07-25 | 2020-02-07 | 杭州海康威视数字技术股份有限公司 | Sound acquisition method, device, equipment and system |
CN109409195A (en) * | 2018-08-30 | 2019-03-01 | 华侨大学 | A kind of lip reading recognition methods neural network based and system |
CN110633683A (en) * | 2019-09-19 | 2019-12-31 | 华侨大学 | Chinese sentence-level lip language recognition method combining DenseNet and resBi-LSTM |
Non-Patent Citations (1)
Title |
---|
LipreadingwithDenseNetandresBi-LSTM;Xuejuan Chen;《Signal,Imageand Video Processing》;1-9 * |
Also Published As
Publication number | Publication date |
---|---|
CN111914803A (en) | 2020-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110852215B (en) | Multi-mode emotion recognition method and system and storage medium | |
Natarajan et al. | Multimodal feature fusion for robust event detection in web videos | |
CN108962227B (en) | Voice starting point and end point detection method and device, computer equipment and storage medium | |
CN111524527B (en) | Speaker separation method, speaker separation device, electronic device and storage medium | |
CN106297828B (en) | Detection method and device for false sounding detection based on deep learning | |
CN112289323B (en) | Voice data processing method and device, computer equipment and storage medium | |
CN112668559B (en) | Multi-mode information fusion short video emotion judgment device and method | |
US20100057452A1 (en) | Speech interfaces | |
CN111914803B (en) | Lip language keyword detection method, device, equipment and storage medium | |
CN108877812B (en) | Voiceprint recognition method and device and storage medium | |
CN112735385A (en) | Voice endpoint detection method and device, computer equipment and storage medium | |
CN115312033A (en) | Speech emotion recognition method, device, equipment and medium based on artificial intelligence | |
CN111477219A (en) | Keyword distinguishing method and device, electronic equipment and readable storage medium | |
Ghaemmaghami et al. | Complete-linkage clustering for voice activity detection in audio and visual speech | |
CN114627868A (en) | Intention recognition method and device, model and electronic equipment | |
CN112802498B (en) | Voice detection method, device, computer equipment and storage medium | |
CN111462762B (en) | Speaker vector regularization method and device, electronic equipment and storage medium | |
CN110569908B (en) | Speaker counting method and system | |
KR20220047080A (en) | A speaker embedding extraction method and system for automatic speech recognition based pooling method for speaker recognition, and recording medium therefor | |
US11238289B1 (en) | Automatic lie detection method and apparatus for interactive scenarios, device and medium | |
CN112634869A (en) | Command word recognition method, device and computer storage medium | |
Chaloupka | A prototype of audio-visual broadcast transcription system | |
CN112035670A (en) | Multi-modal rumor detection method based on image emotional tendency | |
CN113129874B (en) | Voice awakening method and system | |
CN116229943B (en) | Conversational data set generation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |