WO2016201679A1

WO2016201679A1 - Feature extraction method, lip-reading classification method, device and apparatus

Info

Publication number: WO2016201679A1
Application number: PCT/CN2015/081824
Authority: WO
Inventors: 左坤隆; 张新曼; 路龙宾
Original assignee: 华为技术有限公司
Priority date: 2015-06-18
Filing date: 2015-06-18
Publication date: 2016-12-22

Abstract

An embodiment of the present invention provides a feature extraction method and a lip-reading classification method, device and apparatus, relating to the field of feature recognition, the method comprising: according to a time sequence of frames in a video, dividing the video into M time sub-blocks; dividing a lip portion region of each frame of each time sub-block into N space sub-blocks; assembling a video sub-block from space sub-frames at corresponding identical positions in each frame of the same time sub-block, the video comprising in total M×N video sub-blocks; calculating a lip-reading feature vector of each video sub-block, each video sub-block lip-reading feature vector being an X-dimensional vector; and combining the X-dimensional lip-reading feature vectors of the M×N video sub-blocks of the video to obtain an X×M×N-dimensional feature vector of the video. The present invention fixes the number of feature vector dimensions and does not require dynamic adjustment of the number of feature vector dimensions, simplifying operations and saving training time and classification time.

Description

Feature extraction method, lip language classification method, device and device

Technical field

The present invention relates to the field of feature recognition, and in particular, to a feature extraction method, a lip language classification method, device and device.

Background technique

With the intelligent development of computer technology, biometrics technology has been widely used in many fields such as identity verification as a new human-computer interaction method. As a new biometric technology, lip recognition technology has become a new technology. Research hotspots in the field of human-computer interaction.

Feature extraction is an important step in the process of lip language recognition. In the lip language recognition of video, it is necessary to extract the lip language features of the video. The existing feature extraction method usually extracts the lip contour of the video frame, expresses the lip contour with several parameters, and linearly combines some parameters to obtain the lip language feature of the video. Alternatively, the multi-frame image in the video is used as a two-dimensional signal, and the two-dimensional signal is image-converted to obtain a lip-speech feature of the video.

Since the number of frames in different videos is not fixed, the dimension of the lip features is not fixed when the lip feature is extracted by the above feature extraction method. However, most classifiers require a fixed feature dimension, which results in the need to dynamically adjust the dimension of the lip feature of the video when the application video trains the classifier or when the classifier is used to classify the video. It is cumbersome, and the training time and classification time are very long.

Summary of the invention

In order to fix the dimension of the lip feature, the embodiment of the invention provides a feature extraction method, a lip language classification method, a device and a device. The technical solution is as follows:

In a first aspect, a feature extraction method is provided, the method comprising:

Dividing the video into M time sub-blocks according to a chronological order of video frames in the video, each time sub-block includes at least two consecutive video frames, each of which includes a lip region;

Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;

Calculating a lip language feature vector of each video sub-block, wherein the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block, and the lip language feature vector of each video sub-block is an X-dimensional vector;

Combining the X-dimensional lip feature vectors of the M×N video sub-blocks in the video to obtain an X×M×N-dimensional lip feature vector of the video;

Where M, N, and X are positive integers, and × is used to represent the product of numerical values.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the calculating a lip language feature vector of each video sub-block includes:

Extracting an X-dimensional local binary pattern LBP feature vector of each spatial sub-block in a video sub-block, wherein the one video sub-block includes Y spatial sub-blocks;

Combining the LBP feature vectors of Y spatial sub-blocks to obtain an X*Y-order local texture feature matrix;

Performing singular value decomposition on the local texture feature matrix to obtain a Y*Y order first right singular matrix;

Extracting a first column vector of the first right singular matrix as a projection vector;

Calculating a product of the local texture feature matrix and the projection vector to obtain an X-dimensional lip language feature vector of the video sub-block;

Where Y is a positive integer, the X*Y order matrix represents a matrix of X rows and Y columns, and the Y*Y order matrix represents a matrix of Y rows and Y columns.

In a second aspect, a lip language classification method is provided, the method comprising:

For each of the pre-selected D sample videos, the sample video is divided into M time sub-blocks according to the chronological order of the video frames in the sample video, and each time sub-block includes Two consecutive video frames, each of which includes a lip region;

Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The sample video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;

Calculating a lip language feature vector of each video sub-block, the lip language feature vector being used to describe texture information of a lip region in the corresponding video sub-block;

According to the lip feature vector of the video sub-block in the D sample videos, the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm until the classification accuracy of the multi-layer classifier satisfies the first preset When the condition is stopped, the multi-layer classifier that is trained to be completed is obtained, and the multi-layer classifier is used to classify the semantic information of the video;

Among them, M, N, and D are positive integers, D>1, and × is used to represent the product of numerical values.

With reference to the second aspect, in a first possible implementation manner of the second aspect, the classification accuracy of the multi-layer classifier according to the lip language feature vector of the video sub-blocks in the D sample videos is according to preset training The algorithm is trained to include:

Step 1: According to a predetermined rule, construct L specified identifiers, and the L specified identifiers are used to determine the number of corresponding hidden layer nodes and the selected video sub-blocks in the D sample videos, L is a positive integer, L>1;

Step 2: Obtain a flag value for each specified identifier, and each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and a different flag value corresponds to the hidden The number of layer nodes is different or the corresponding selected video sub-blocks are different, and each flag value is used to train a corresponding multi-layer classifier;

Step 3: For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, the flag value is Training the classification accuracy of the corresponding multi-layer classifier;

Step 4: Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain the global optimal flag value of the L designated identifiers according to a preset search algorithm. And update the flag value of each specified identifier;

The above steps 3 to 4 are repeatedly executed until the global optimum flag value satisfies the second preset condition.

With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the number of the hidden layer nodes corresponding to the flag value of the specified identifier and the selected one of the D sample videos A lip language feature vector of the video sub-block, training the classification accuracy of the multi-layer classifier corresponding to the flag value, including:

Determining, by the number of hidden layer nodes corresponding to the flag value of the specified identifier, and the selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer;

For each of the D sample videos, the X-dimensional lip feature vectors of the selected H video sub-blocks in the sample video are combined to obtain an H×X-dimensional lip feature vector of the sample video;

Combining H×X-dimensional lip feature vectors of D sample videos to obtain a feature matrix of D*(H×X) order;

Performing singular value decomposition on the feature matrix to obtain a second right singular matrix;

Extracting, from the second right singular matrix, a column vector corresponding to the preset retention dimension as a projection matrix;

Training the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes, the multi-layer classifier including at least an input layer node and at least one hidden layer node, A projection matrix is used to represent the weight of the input layer node.

In conjunction with the first possible implementation of the second aspect, in a third possible implementation of the second aspect, the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier The classification accuracy rate is obtained according to a preset search algorithm, and the global optimal flag values of the L specified identifiers are obtained, and the flag values of each specified identifier are updated, including:

Calculating an average optimal flag value of the L designated identifiers according to a flag value of each specified identifier;

Obtaining a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, obtaining an optimal flag value of each specified identifier and the global optimal flag value;

Calculating a local attractor for each specified identifier according to an optimal flag value of each specified identifier and the global optimal flag value;

The flag value of each specified identifier is updated according to a preset update algorithm according to the local attractor of each specified identifier and the average optimal flag value.

In conjunction with the third possible implementation of the second aspect, in a fourth possible implementation of the second aspect,

The classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, obtaining an optimal flag value of each specified identifier and the global optimal flag value, including:

For each specified identifier, the classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier is used as the fitness value of the specified identifier;

And updating, according to the current fitness value of the specified identifier and the historical optimal fitness value of the specified identifier, an optimal flag value of the specified identifier, to obtain an optimal flag value after the specified identifier is updated;

And updating, according to the current fitness value of the specified identifier and the historical global optimal fitness value of the L specified identifiers, the global optimal flag values of the L specified identifiers to obtain an updated global optimal identifier. value.

In a third aspect, a feature extraction device is provided, the device comprising:

a dividing module, configured to divide the video into M time sub-blocks according to a chronological order of video frames in a video, where each time sub-block includes at least two consecutive video frames, and each video frame includes a lip region ;

The dividing module is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and to correspond to spatial sub-blocks of the same position in each video frame in the same time sub-block Composing a video sub-block, the video obtains a total of M×N video sub-blocks, and each time sub-block includes N video sub-blocks;

a feature calculation module, configured to calculate a lip feature feature of each video sub-block obtained by the dividing module The lip language feature vector is used to describe the texture information of the lip region in the corresponding video sub-block, and the lip feature vector of each video sub-block is an X-dimensional vector;

a combination module, configured to combine the X-dimensional lip language feature vectors of the M×N video sub-blocks in the video obtained by the feature calculation module to obtain an X×M×N-dimensional lip language feature vector of the video ;

With reference to the third aspect, in a first possible implementation manner of the third aspect, the feature calculation module includes:

An extracting unit, configured to extract an X-dimensional local binary mode LBP feature vector of each spatial sub-block in a video sub-block, where the one video sub-block includes Y spatial sub-blocks;

a combining unit, configured to combine the LBP feature vectors of the Y spatial sub-blocks obtained by the extracting unit to obtain an X*Y-order local texture feature matrix;

a decomposition unit, configured to perform singular value decomposition on the local texture feature matrix obtained by the combining unit to obtain a Y*Y order first right singular matrix;

a projection unit, configured to extract a first column vector of the first right singular matrix obtained by the decomposition unit, as a projection vector;

a calculation unit, configured to calculate a product of the local texture feature matrix obtained by the combining unit and the projection vector obtained by the projection unit, to obtain an X-dimensional lip language feature vector of the video sub-block;

In a fourth aspect, a lip language classification device is provided, the device comprising:

a dividing module, configured to divide the sample video into M time sub-blocks according to a chronological order of video frames in the sample video for each of the pre-selected D sample videos, each time sub-block includes At least two consecutive video frames, each of which includes a lip region;

The dividing module is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and to correspond to spatial sub-blocks of the same position in each video frame in the same time sub-block Composing a video sub-block, the sample video is obtained by M×N video sub-blocks, and each time sub-block includes N video sub-blocks;

a feature calculation module, configured to calculate a lip language feature vector of each video sub-block obtained by the dividing module, where the lip language feature vector is used to describe texture information of a lip region in a corresponding video sub-block;

a training module, configured to perform a lip language feature vector of the video sub-blocks in the D sample videos obtained by the feature calculation module, and perform a training on the classification accuracy of the multi-layer classifier according to a preset training algorithm until the When the classification accuracy rate of the layer classifier meets the first preset condition, the multi-layer classifier is completed, and the multi-layer classifier is used to classify the semantic information of the video;

In conjunction with the fourth aspect, in a first possible implementation manner of the fourth aspect, the training module is configured to perform the following steps:

In conjunction with the first possible implementation of the fourth aspect, the second possible implementation of the fourth aspect The training module is further configured to determine a number of hidden layer nodes corresponding to the flag value of the specified identifier and a selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer; for D sample videos Each sample video in the sample, combining the X-dimensional lip feature vectors of the selected H video sub-blocks in the sample video to obtain an H×X-dimensional lip feature vector of the sample video; for D sample videos H×X-dimensional lip eigenvectors are combined to obtain a feature matrix of D*(H×X) order; singular value decomposition is performed on the feature matrix to obtain a second right singular matrix; from the second right singular matrix Extracting a column vector corresponding to the preset retention dimension as a projection matrix; and training the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes, The layer classifier includes at least an input layer node and at least one hidden layer node, the projection matrix being used to represent the weight of the input layer node.

In conjunction with the first possible implementation of the fourth aspect, in a third possible implementation manner of the fourth aspect, the training module is further configured to calculate an average of the L specified identifiers according to a flag value of each specified identifier. An optimal flag value; a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, obtaining an optimal flag value of each specified identifier and the global optimal flag value; An optimal flag value and the global optimal flag value are calculated, and a local attractor of each specified identifier is calculated; each local designation is performed according to a preset update algorithm according to a local attractor of each specified identifier and the average optimal flag value The flag value of the tag is updated.

With reference to the third possible implementation manner of the fourth aspect, in a fourth possible implementation manner of the fourth aspect, the training module is further configured to perform multi-layer classification of the flag value of the specified identifier for each specified identifier. The classification accuracy rate of the device is used as the fitness value of the specified identifier; and the optimal flag value of the specified identifier is updated according to the current fitness value of the specified identifier and the historical optimal fitness value of the specified identifier. Obtaining an optimal flag value after the specified identifier is updated; according to the current fitness value of the specified identifier and the historical global optimal fitness value of the L specified identifiers, the global maximum of the L designated identifiers The good flag value is updated to obtain the updated global optimal flag value.

In a fifth aspect, a feature extraction device is provided, the device comprising: a memory and a processor, The memory is coupled to the processor, the memory stores program code, and the processor is configured to invoke the program code to perform the following operations:

In a sixth aspect, a lip language classification device is provided, the device comprising: a memory and a processor, the memory being coupled to the processor, the memory storing program code, the processor for calling the program Code, do the following:

For each of the pre-selected D sample videos, the sample video is divided into M time sub-blocks according to the chronological order of the video frames in the sample video, and each time sub-block includes at least two consecutive a video frame, each of which includes a lip region;

According to the lip feature vector of the video sub-block in the D sample videos, the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm until the classification accuracy of the multi-layer classifier satisfies Stopping at a preset condition, obtaining the multi-layer classifier that is trained to be used, and the multi-layer classifier is used to classify the semantic information of the video;

The beneficial effects of the technical solutions provided by the embodiments of the present invention are:

The method, device and device provided by the embodiments of the present invention divide the video into M time sub-blocks according to the time dimension, and according to the spatial dimension, the lip area of each video frame in each time sub-block is divided into N pieces. The spatial sub-blocks form a video sub-block corresponding to the same position in each video frame in the same time sub-block, and the video obtains a total of M×N video sub-blocks, and then calculates the lip of each video sub-block. The feature vector, combining the lip feature vectors of M×N video sub-blocks in the video to obtain the lip feature vector of the video, the number of video sub-blocks obtained by different videos is the same, so that the final extracted The lip eigenvectors of the video have the same dimension, which realizes the fixed dimension of the feature. According to the lip feature vector of the video sub-block in the sample video, the classification accuracy of the multi-layer classifier is trained without the feature dimension. The dynamic adjustment simplifies the operation and saves the training time. When the multi-layer classifier of the training is used to classify the video, the classification time is saved and the classification accuracy is improved.

DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. Other drawings may also be obtained from those of ordinary skill in the art in light of the inventive work.

FIG. 1 is a flowchart of a feature extraction method according to an embodiment of the present invention;

2 is a flowchart of a lip language classification method according to an embodiment of the present invention;

3 is a flowchart of a feature extraction method according to an embodiment of the present invention;

4 is a schematic diagram of a lip region of a video frame according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of video blocking according to an embodiment of the present invention; FIG.

FIG. 6 is a schematic diagram of a pixel neighborhood provided by an embodiment of the present invention; FIG.

FIG. 7 is a flowchart of a lip language classification method according to an embodiment of the present invention; FIG.

FIG. 8 is a schematic structural diagram of a multi-layer classifier according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a feature extraction apparatus according to an embodiment of the present invention;

FIG. 10 is a schematic structural diagram of a lip language sorting apparatus according to an embodiment of the present invention; FIG.

FIG. 11 is a schematic structural diagram of a feature extraction device according to an embodiment of the present invention;

FIG. 12 is a schematic structural diagram of a lip language classification device according to an embodiment of the present invention.

detailed description

The embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.

FIG. 1 is a flowchart of a feature extraction method according to an embodiment of the present invention. Referring to Figure 1, the method includes:

Step 101: Divide the video into M time sub-blocks according to a time sequence of video frames in the video, where each time sub-block includes at least two consecutive video frames, and each video frame includes a lip region.

Step 102: Divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and form spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block. The video obtains a total of M×N video sub-blocks, and each time sub-block includes N video sub-blocks.

Step 103: Calculate a lip language feature vector of each video sub-block, where the lip language feature vector is used to describe texture information of a lip region in a corresponding video sub-block, and the lip language feature vector of each video sub-block is an X-dimensional vector. .

Step 104: Combine the X-dimensional lip feature vectors of the M×N video sub-blocks in the video to obtain an X×M×N-dimensional lip feature vector of the video.

In the embodiment of the present invention, the video includes a plurality of video frames, each of which includes a human lip region, and the lip region of the video frame is classified, and after determining the category to which the lip region belongs, the person can be determined. The content of the talk.

M and N are preset values, and the number of video sub-blocks divided by each video is M×N, which is a fixed value. When calculating the lip eigenvectors of video sub-blocks, the dimensions of the lip eigenvectors of different video sub-blocks are the same. If the lip-feature vector of each video sub-block is an X-dimensional vector, multiple video sub-pictures are used. When the lip eigenvectors of the block are combined, the obtained lip eigenvector of the video has a dimension of X×M×N, which is also a fixed value.

That is, the embodiment of the present invention pre-sets the number M of time sub-blocks obtained by dividing the video for different videos and the number N of spatial sub-blocks divided by the lip regions of each video frame in the time sub-block, Then, for different videos, the number of video sub-blocks divided by each video is a fixed value, so that the dimension of the lip-feature vector extracted by each video is also a fixed value.

The method provided by the embodiment of the present invention divides a video into M time sub-blocks according to a time dimension, and divides a lip area of each video frame in each time sub-block into N spatial sub-blocks according to a spatial dimension. If a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M×N video sub-blocks, and then calculates a lip-feature vector of each video sub-block. Combining the lip feature vectors of M×N video sub-blocks in the video to obtain the lip feature vector of the video, the number of video sub-blocks obtained by different videos is the same, so that the lip of the finally extracted video is The eigenvectors have the same dimension, which realizes the fixation of the feature dimension. It does not need to dynamically adjust the feature dimension, which simplifies the operation and saves time.

Optionally, the calculating the lip feature vector of each video sub-block includes:

Extracting an X-dimensional local binary mode LBP feature vector of each spatial sub-block in a video sub-block, wherein the one video sub-block includes Y spatial sub-blocks;

Calculating the product of the local texture feature matrix and the projection vector to obtain the X-dimensional lip of the video sub-block Language feature vector

FIG. 2 is a flowchart of a lip language classification method according to an embodiment of the present invention. Referring to Figure 2, the method includes:

Step 201: For each sample video of the pre-selected D sample videos, divide the sample video into M time sub-blocks according to the time sequence of the video frames in the sample video, and include at least two in each time sub-block. Continuous video frames, each of which includes a lip area.

Step 202: Divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and form spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block. The sample video obtains a total of M×N video sub-blocks, and each time sub-block includes N video sub-blocks.

Step 203: Calculate a lip language feature vector of each video sub-block, where the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block.

Step 204: According to the lip feature vector of the video sub-block in the D sample videos, the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm until the classification accuracy of the multi-layer classifier satisfies the first pre-preparation When the condition is set to stop, the trained multi-layer classifier is obtained, and the multi-layer classifier is used to classify the semantic information of the video.

The method provided by the embodiment of the present invention divides a video into M time sub-blocks according to a time dimension, and divides a lip area of each video frame in each time sub-block into N spatial sub-blocks according to a spatial dimension. When a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M×N video sub-blocks, and then calculates the X-dimensional lip feature of each video sub-block. Vector, the number of video sub-blocks obtained by different videos is the same, so that the lip-feature vector dimension of the finally extracted video is the same, and the feature dimension is fixed, and according to the sample video The lip eigenvectors of the video sub-blocks train the classification accuracy of the multi-layer classifier, eliminating the need to dynamically adjust the feature dimensions, simplifying the operation, saving training time, and applying the trained multi-layer classifier to the video. When classifying, there is no need to dynamically adjust the feature dimension, which simplifies the operation, saves the classification time, and improves the classification accuracy.

Optionally, according to the lip feature vector of the video sub-block in the D sample video, the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm, including:

Step 3: For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, corresponding to the flag value The classification accuracy of the multi-layer classifier is trained;

Step 4: According to the flag value of each specified identifier and the flag value of each specified identifier, the classification accuracy rate of the multi-layer classifier is obtained, and the global optimal flag value of the L designated identifiers is obtained according to a preset search algorithm. And update the flag value of each specified identifier;

Optionally, the classification of the multi-layer classifier corresponding to the flag value according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos Training for accuracy, including:

Determining the number of hidden layer nodes corresponding to the flag value of the specified identifier and the selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer;

For each of the D sample videos, the selected H videos in the sample video The X-dimensional lip eigenvectors of the sub-blocks are combined to obtain an H×X-dimensional lip eigenvector of the sample video;

And classifying the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes, the multi-layer classifier including at least an input layer node and at least one hidden layer node, wherein the projection matrix is used to represent The weight of the input layer node.

Optionally, the classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier is obtained, and the global optimal flag of the L designated identifiers is obtained according to a preset search algorithm. Values, and updates the flag values for each specified identity, including:

Calculating an average optimal flag value of the L designated identifiers according to the flag value of each specified identifier;

Optionally, the classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier is obtained, and the optimal flag value of each specified identifier and the global optimal flag value are obtained, including:

According to the current fitness value of the specified identifier and the historical global optimal adaptation of the L designated identifiers The degree value is used to update the global optimal flag values of the L specified identifiers to obtain an updated global optimal flag value.

All of the above optional technical solutions may be used in any combination to form an optional embodiment of the present invention, and will not be further described herein.

FIG. 3 is a flowchart of a feature extraction method according to an embodiment of the present invention. The execution body of the embodiment of the present invention is a feature extraction device. Referring to FIG. 3, the method includes:

Step 301: For each video, the feature extraction device divides the video into M×N video sub-blocks according to the time sequence of the video frames in the video.

The feature extraction device may be a device such as a computer or a server, which is not limited in this embodiment of the present invention.

Specifically, the video includes a plurality of video frames, and the plurality of video frames are arranged in chronological order. The feature extraction device may divide the video into M time sub-blocks according to the time sequence of the video frames in the video, each time sub-timer. At least two consecutive video frames are included in the block, and a lip region is included in each video frame of the video. Then, the classifying device divides the lip region of each video frame in each time sub-block into N spatial sub-blocks, and forms a spatial sub-block corresponding to the same position in each video frame in the same time sub-block into one video sub-block. Block, then a time sub-block includes N video sub-blocks, and the video obtains M×N video sub-blocks in total. Where M and N are positive integers.

The feature extraction device may first divide the video into M time sub-blocks, and then perform positioning and segmentation on each video frame to obtain a lip region of each video frame, and divide the lip region of each video frame into N spaces. The sub-block, or the feature extraction device may further perform positioning and segmentation for each video frame to obtain a lip region of each video frame, and then divide the obtained plurality of lip regions into M time sub-blocks, each of which The lip region in the time sub-block is divided into N spatial sub-blocks, and the timing of the positioning and segmentation is not limited in the embodiment of the present invention.

Referring to Figure 4, t represents the time of the video frame, and the coordinate system formed by x and y represents the space in which the video frame is located. The video includes multiple video frames, and each video frame is segmented and positioned to obtain each video. The lip area of the frame.

The feature extraction device may first divide according to the time dimension, group the video frames according to the time sequence of the video frames in the video, and divide the video into M time sub-blocks by using at least two video frames as one time sub-block. Each time sub-block includes at least two consecutive video frames, and then divided according to a spatial dimension, and the lip region of each video frame is divided into N spatial sub-blocks, and the sub-block includes multiple spaces at the same time. The sub-blocks, the feature extraction device combines spatial sub-blocks corresponding to the same position in each video frame in the same spatial sub-block into one video sub-block, thereby obtaining M×N video sub-blocks. Of course, the feature extraction device may also first divide according to the spatial dimension, divide the lip region of each video frame into N spatial sub-blocks, and then divide according to the time dimension, and use the spatial sub-blocks of at least two video frames as A time sub-block divides the video into M time sub-blocks, and then combines spatial sub-blocks corresponding to the same position in each video frame in each time sub-block into one video sub-block, thereby obtaining M×N video sub-blocks. The embodiment of the present invention does not limit this.

Referring to FIG. 5, the feature extraction apparatus may divide the video according to a time dimension to obtain M time sub-blocks. For each time sub-block, the time sub-block includes a plurality of video frames, and then divides according to the spatial dimension, and divides the lip region of each video frame into N spatial sub-blocks, and each video frame is If the spatial sub-blocks corresponding to the same location form a video sub-block, the time sub-block can obtain N video sub-blocks, and the video obtains a total of M×N video sub-blocks. In FIG. 5, taking N as 4 as an example, each time sub-block includes a lip region of three video frames, and the lip region of each video frame is divided into four spatial sub-blocks: upper left corner lip region, upper right The corner lip area, the lower left corner lip area and the lower right corner lip area form the video sub-block (1) in the upper left corner lip region of the three video frames, and the upper right corner lip region of the three video frames constitute the video. Sub-block (2), the lower left corner lip region of the three video frames is composed into a video sub-block (3), and the lower right corner lip region of the three video frames is formed into a video sub-block (4), and four video sub-blocks are obtained. Piece.

In addition, the number of video frames in different videos is also different. In order to satisfy the division of the video into M time sub-blocks, the feature extraction device may filter or repeat the block according to the number of frames of the video. Further, for a certain video, the number of video frames included in each time sub-block in the video may be In the same way, for example, if the video includes 1-10 of the ten video frames, and M is 4, the feature extraction device may group the four video frames 1-4 into one group to form a time sub-block, and then 3 -6 These four video frames are grouped into one time sub-block, and the four video frames 5-8 are grouped to form one time sub-block, and the four video frames 7-10 are grouped to form one. The time sub-block is finally divided into 4 time sub-blocks, and each time sub-block includes 4 video frames.

The M is referred to as a first preset number, and N is referred to as a second preset number. The feature extraction device predetermines the first preset number M and the second preset number N, and the first preset number M is used. The number of time sub-blocks divided in each video is specified, and the second preset number N is used to specify the number of spatial sub-blocks divided by each video frame in the time sub-block. For different videos, the feature extraction device may divide the first preset number M and the second preset number N. The M and N may be determined by the feature extraction device according to the accuracy requirement of the video sub-block, and the M may be 4, 5 or other values, and the N may be 5, 6, or other values. limited.

The feature extraction apparatus may further preset a number of video frames included in each time sub-block, such as setting a video frame including a third preset number Q in each time sub-block, when performing the division, when When the number of frames of the video is greater than the product of the first preset number and the third preset number M×Q, the feature extraction apparatus may filter the video frame in the video to filter out the first specified number of video frames. The first specified number is equal to the difference between the number of video frames and M×Q, so that the number of frames of the video after filtering is equal to M×Q, and then the video frame is divided into blocks, so that the video is divided into M time sub-blocks. When it is possible to satisfy Q video frames in each time sub-block. When the number of frames of the video is less than M×Q, the feature extraction apparatus may select a second specified number of video frames, the second specified number being equal to the difference between the M×Q and the number of the video frames, and the second The specified number of video frames are repeatedly divided into two time sub-blocks, so that when the video is divided into M time sub-blocks, it is possible to satisfy Q video frames in each time sub-block, where Q is a positive integer, and Q can be The feature extraction device is determined in advance according to the accuracy requirement of the video sub-block, which is not limited by the embodiment of the present invention.

In the embodiment of the present invention, in order to fix the feature dimension, before the feature extraction of the video, the feature extraction device may first block the video to obtain M×N video sub-blocks. And, the feature extraction The device divides the video according to the time dimension and the spatial dimension, divides the original video into a set of video sub-blocks, increases the time information and spatial information in the subsequently extracted features, and can better express the lip language features.

Step 302: The feature extraction device calculates an X-dimensional lip language feature vector of each video sub-block, and the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block.

In the embodiment of the present invention, after the video is divided, each video sub-block includes at least one spatial sub-block, and each spatial sub-block is actually a part of the lip area in the video frame. For each video sub-block in the video, this step 302 can include the following steps 302a-302c:

Step 302a: The feature extraction device extracts an X-dimensional LBP feature vector of each spatial sub-block in the video sub-block. If the video sub-block includes Y spatial sub-blocks, the LBP feature vectors of the Y spatial sub-blocks are combined. , to obtain an X*Y order local texture feature matrix.

The feature extraction device may extract an X-dimensional LBP feature vector of each spatial sub-block in the video sub-block, describe the texture information of the lip region in the corresponding spatial sub-block with the LBP feature vector, and use the LBP of each spatial sub-block. The feature vector is used as a column, and the LBP feature vectors of the Y spatial sub-blocks are combined to obtain a matrix of X*Y order, which is the local texture feature matrix of the video sub-block. Where Y is a positive integer and the X*Y order matrix represents a matrix of X rows and Y columns.

When acquiring the LBP feature vector of the spatial sub-block, for each pixel in the spatial sub-block, the feature extraction device takes the pixel as a central pixel, and acquires each designated pixel adjacent to the central pixel, for the central pixel For each of the designated pixels, the pixel value of the specified pixel is compared with the pixel value of the central pixel of the spatial sub-block, and if the pixel value of the specified pixel is greater than the pixel value of the central pixel, determining the characteristic of the designated pixel A value of 1, if the pixel value of the specified pixel is not greater than the pixel value of the center pixel, determining that the feature value of the specified pixel is 0, a binary feature value is set for each specified pixel, and all the central pixels are The eigenvalues of the specified pixel are combined, and the combined binary number is converted into a decimal number to obtain the LBP feature value of the central pixel. The feature extraction device may acquire an LBP feature value of each pixel in the spatial sub-block, calculate a statistical histogram of the spatial sub-block according to an LBP feature value of each pixel, and normalize the statistical histogram. The LBP feature vector of the spatial sub-block is obtained.

Referring to FIG. 6 , for each pixel in the spatial sub-block, the pixel is used as a central pixel, and the neighborhood of the central pixel is acquired, and the neighborhood is a pixel region of 3*3, and each pixel in the neighborhood is The pixel value is as shown in a diagram of FIG. 6, and the pixel value of each specified pixel around the center pixel is compared with the pixel value of the center pixel, and the feature value of each specified pixel can be obtained, as shown in FIG. The figure shows. The specified pixel in the upper left corner is used as the rightmost bit, and the feature values of each specified pixel are sequentially combined in a clockwise direction, and the LBP feature vector of the neighborhood is obtained as 11110001. It is also possible to calculate the decimal number of the LBP feature vector, that is, the LBP feature value of the center pixel, such as starting from the specified pixel in the upper left corner, and clockwise, each specified pixel around the center pixel has a weight of 1, 2, respectively. For 4, 8, 16, 32, 64, and 128, the calculated LBP feature value is 1+16+32+64+128=241. After calculating the LBP feature value of each pixel in the spatial sub-block, calculating a statistical histogram of the spatial sub-block according to the LBP feature value of each pixel, and normalizing the statistical histogram to obtain the spatial sub-block The LBP feature vector.

LBP is essentially used to describe the texture information of each spatial sub-block in the video. The extracted LBP features have significant advantages such as rotation invariance and gray invariance, and enhance the lip-feature vector to illumination conditions, rotation, The robustness of factors such as translation improves the classification accuracy. And the LBP feature vector has strong discriminating power and simple calculation.

Step 302b, the feature extraction device performs singular value decomposition on the local texture feature matrix to obtain a Y*Y order first right singular matrix, and extracts a first column vector of the first right singular matrix as a projection vector, Y*Y The order matrix represents a matrix of Y rows and Y columns.

After extracting the local texture feature matrix of the video sub-block, the feature extraction device may perform singular value decomposition on the local texture feature matrix, obtain the right singular matrix obtained by the decomposition, as the first right singular matrix, and extract the first The first column vector of a right singular matrix as a projection vector.

The feature extraction device may apply the following formula to perform singular value decomposition on the local texture feature matrix:

[u,s,v]=svd(X);

Wherein, X represents the local texture feature matrix, and the matrix u is a left singular matrix obtained by decomposing the local texture feature matrix X, and the matrix s is a singular value matrix obtained by decomposing the local texture feature matrix X, and the matrix v is the local texture The right singular matrix obtained by the decomposition of the feature matrix X. The local texture feature matrix is an X*Y order matrix, and the first right singular matrix obtained by performing singular value decomposition on the local texture feature matrix is a Y*Y order matrix, and the projection vector is a Y-dimensional vector.

Step 302c: The feature extraction device calculates a product of the local texture feature matrix and the projection vector to obtain an X-dimensional lip language feature vector of the video sub-block.

The feature extraction device may calculate a lip language feature vector of the video sub-block by applying the following formula:

f _PLBP =X*pVect;

Where * represents the product of the matrix, pVect represents the first column vector of the matrix v, and f _PLBP represents the lip eigenvector. The local texture feature matrix is an X*Y order matrix, and the projection vector is a Y-dimensional vector, and the calculated lip language feature vector is an X-dimensional vector.

Then the lip eigenvector of the video can be as follows:

Wherein, f _PLBPi represents the PLBP feature of the i-th video sub-block in the video, m represents the number of video sub-blocks (M×N) divided by the video, and F represents the lip-feature vector of the video. Since the number of video sub-blocks divided by different videos is fixed, the dimension of the lip feature vector F of the video is also fixed, and the lip-feature vector can be used to classify the video.

In the embodiment of the present invention, the singular value decomposition is performed on the extracted local texture feature matrix on the premise of extracting the local texture feature matrix of the video sub-block, and the first column vector of the resolved right singular matrix is used as the optimal projection vector. The local texture feature matrix is projected, and the PLBP (Projection Local Binary Patterns) feature of the video sub-block is extracted.

The PLBP feature is a method for selecting feature features of image sequences based on LBP features. The basic principle is to extract the LBP feature vectors corresponding to each frame of the image sequence, and combine the feature vectors of all frame images into feature matrices. Each column corresponds to the feature vector of a frame of pictures, never The feature of the fixed dimension is extracted from the image sequence of the same frame number. By finding the optimal projection direction, the optimal projection vector is found, and the feature matrix is projected based on the optimal projection vector.

Further, the embodiment of the present invention adopts the idea of extracting features by block, divides video into several video sub-blocks in time and space, extracts PLBP features of video sub-blocks, and finally combines PLBP features of video sub-blocks. Become a new feature and output as the final feature. The application of the blocking technique increases the spatial and temporal information contained in the finally extracted feature vector, which can better describe the movement of the lip in the image sequence, which greatly facilitates the optimization of the training algorithm of the late classifier. The lip language classification stage significantly improves the lip reading recognition rate, and has great reference significance for related video recognition technology.

Step 303: The feature extraction device combines the X-dimensional lip feature vectors of the M×N video sub-blocks in the video to obtain an X×M×N-dimensional lip feature vector of the video.

The feature extraction device may acquire an X-dimensional lip language feature vector of each video sub-block in the video, and combine the X-dimensional lip feature vectors of the M×N video sub-blocks into the M×N video sub-blocks. The X-dimensional lip feature vector of any video sub-block is arranged after the X-dimensional lip feature vector of the previous video sub-block, thereby obtaining an X×M×N-dimensional lip feature vector.

Using the above steps 301-33, a lip language feature vector of any video can be extracted, and the lip language feature vector can be used to classify the semantic information of the video. The feature extraction device may obtain the lip language feature vector of the plurality of sample videos by using the above steps 301-303, and train the classifier according to the lip feature vector of the plurality of sample videos, and whenever the semantic information of the video is to be classified, Using the above steps 301-303, the lip language feature vector of the video is obtained, and the lip language feature vector is input to the trained classifier to obtain the classification result.

The method provided by the embodiment of the present invention divides a video into M time sub-blocks according to a time dimension, and divides a lip area of each video frame in each time sub-block into N spatial sub-blocks according to a spatial dimension. If a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M×N video sub-blocks, and then calculates a lip-feature vector of each video sub-block. Combining the lip feature vectors of M×N video sub-blocks in the video to obtain the view The frequency of the lip eigenvectors, the number of video sub-blocks obtained by different videos is the same, so that the final extracted video has the same lip-dimensional feature vector dimension, which realizes the fixation of the feature dimension, and does not need to be featured in the classification. Dynamic adjustment of the dimension simplifies operation and saves classification time. After extracting the local texture feature matrix of the video sub-block, the local texture feature matrix is projected to obtain the lip-feature feature vector, which enhances the robustness of the lip-feature feature.

The feature extraction method provided by the embodiment of the present invention can extract a lip-feature feature vector with a fixed dimension, and the lip-feature feature vector can be used to classify the semantic information of the video. The classification process is detailed in the next embodiment.

FIG. 7 is a flowchart of a lip language classification method according to an embodiment of the present invention. The execution subject of the embodiment of the present invention is a classification device. Referring to FIG. 7, the method includes:

Step 701: The classification device extracts a lip feature vector of the video sub-block in the pre-selected D sample videos.

In order to train the multi-layer classifier, the classifying device pre-acquires D sample videos, divides each sample video into M×N video sub-blocks, and extracts the lip-language feature vector of each video sub-block, the specific process and the above Steps 301-302 are similar, and are not described herein again.

Step 702: The classification device trains the classification accuracy of the multi-layer classifier according to a preset training algorithm according to the lip feature vector of the video sub-block in the D sample videos until the classification accuracy of the multi-layer classifier satisfies When a preset condition is stopped, a trained multi-layer classifier is obtained, D is a positive integer, D>1.

In the embodiment of the present invention, the classification device may train the classification accuracy of the multi-layer classifier according to a preset training algorithm according to the lip feature vector of the video sub-block in the D sample videos until the classification of the multi-layer classifier When the accuracy rate is stopped when the first preset condition is met, the multi-layer classifier obtained by the training can be used to classify the semantic information of the video to be classified, thereby implementing lip language recognition. The preset training algorithm is pre-determined by the classification device, and may be an SVM (Support Vector Machine) classification algorithm, an artificial neural network algorithm, etc., and the first preset condition is used to determine the multi-layer classifier. The training target may be determined according to the requirement of the classification accuracy rate, and the first preset condition may be that the current classification accuracy rate of the multi-layer classifier reaches a preset classification accuracy rate, or the current classification of the multi-layer classifier is accurate. The difference between the rate and the previous classification accuracy is less than the preset difference, or the number of iterations of the training process reaches the maximum number of iterations, etc., which is not limited by the embodiment of the present invention.

In practical application, the classification device may combine the lip feature vectors of all the video sub-blocks in each sample video into one lip language feature vector, and the specific process and the above step 303 “the feature extraction device M×N in the video The X-dimensional lip eigenvectors of the video sub-blocks are combined to obtain the X×M×N-dimensional lip eigenvectors of the video, which are similar, and will not be described herein. Then, the classification device can combine the lip feature vectors of each sample video to obtain a feature matrix, and use the feature matrix as an input of the multi-layer classifier, using an ELM (Extreme Learning Machine) algorithm, The classification accuracy of the layer classifier is trained. The multi-layer classifier includes an input layer node and at least one hidden layer node, and the input weight is used to represent the weight of the input layer node, and only the number of hidden layer nodes needs to be determined during training. The multi-level classifier can be trained by inputting the weight, the offset term and the excitation function. When the training is completed, the multi-layer classifier can be determined according to the input weight and the output weight of the current training. However, the input weights and offsets of the hidden layer nodes used by the ELM algorithm are obtained by random assignment. Random assignment will cause the multi-layer classifier to be unstable in high-dimensional small samples, and it is difficult to obtain optimal results. Parameter value.

To this end, the classification device may determine the input weight of the multi-layer classifier by means of projection, and train the classification accuracy of the multi-layer classifier based on the determined input weight.

In the embodiment of the present invention, a PELM (Projection Extreme Learning Machine) can be used as a multi-layer classifier for distinguishing different semantic information (speech content), and FIG. 8 is a structural diagram of a PELM multi-layer classifier, the PELM multi-layer classification The device includes an input node, a hidden layer node, and an output node, where D represents the dimension of the input feature vector, N is the number of hidden layer nodes, and m is the dimension of the output vector (ie, the number of categories of the spoken content to be distinguished).

Given a training sample {X,T}, where

To input the sample matrix, each row corresponds to one input sample (the eigenvector of a video sub-block);

For the class vector corresponding to X, each row represents a class vector of a sample (what class the sample belongs to, the position corresponding to the class in the vector is 1, and the remaining positions are 0); w _DN represents the D input node To the input weight of the Nth hidden layer node, β _Nm represents the output weight from the Nth hidden layer node to the mth output node. The model for training PELM multi-layer classifiers is to find the appropriate one based on {X,T}.

with

Let T = g(XW)β, where g(x) is the hidden layer node excitation function.

The training process of the PELM multi-layer classifier may be: the classification device acquires a lip language feature vector of a plurality of sample videos, combines to obtain a feature matrix, performs singular value decomposition on the feature matrix, and obtains a right singular matrix obtained by decomposition, Extracting a column vector corresponding to the preset retention dimension from the right singular matrix according to a preset retention dimension, as a projection matrix, using the projection matrix as an input weight of the multi-layer classifier, and using the projection matrix Representing the weight of the input layer node of the multi-layer classifier, and no longer randomly assigning the input weight, based on the projection matrix, the current number of hidden layer nodes, and the excitation function, the classification accuracy of the multi-layer classifier is performed. After the training is completed, the multi-layer classifier is determined according to the trained input weight and the output weight, so that the multi-layer classifier is used to classify the semantic information of the video. The preset retention dimension is used to specify the number of columns of the projection matrix, and the preset retention dimension is less than the number of columns of the right singular matrix, and may be 1, 2 or other values, which is not in this embodiment of the present invention. Make a limit.

Specifically, the classification device acquires a lip feature vector of D sample videos, and if the lip feature vector of each sample video is an R-dimensional vector, the R-dimensional lip feature vectors of the D sample videos are combined to obtain D. * The characteristic matrix of the Rth order. Applying the formula [P, S, Q ^T ]=svd(X), the eigenvalue decomposition is performed on the feature matrix, and the right singular matrix Q ^T obtained after decomposition is obtained, and the right singular matrix is retained according to the preset retention dimension K. And extracting a column vector corresponding to the preset retention dimension as the projection matrix Q _K . The projection matrix is used as the input weight W=Q _{N of the} multi-layer classifier, and H=g(XW), and β is calculated by applying the formula β=H ^-1 T. At this point, the PELM multi-layer classifier is trained. When the feature matrix x _new =[x ₁ ,x ₂ ,...,x _D ] of a new video is obtained, the formula t=g(x _new W)β is applied, where t=[t ₁ , t ₂ , ...,t _m ]. Finding the category corresponding to the maximum value in t ₁ , t ₂ , ..., t _m is the classification result of the video.

PELM is an easy-to-use and effective single hidden layer neural network learning algorithm. Compared with the traditional neural network learning algorithm, it is necessary to artificially set a large number of network training parameters, and it is easy to generate a local optimal solution. PELM only needs to use the projection matrix of the feature matrix composed of the lip language feature vectors of multiple sample videos as the projection matrix. The input weight of the network, and the number of hidden layer nodes of the network, do not need to adjust the input weight of the network and the bias term of the hidden layer node during the execution of the algorithm, and can generate a unique optimal solution, thus having learning The advantages of fast speed and good generalization performance enable the trained multi-layer classifier to obtain a stable recognition rate.

It should be noted that, in the embodiment of the present invention, only the video sub-blocks of the D videos are taken as samples as an example, and in actual application, the classification device may also adopt a preset selection policy to multiple The video sub-block of the sample video is selected, and a video sub-block capable of well describing the lip-speech information is selected, and the selected video sub-block is recombined, and only the selected video sub-block is taken as a sample, and the samples can be used for The classification accuracy of the multi-layer classifier is trained to reduce the redundancy characteristics and improve the calculation speed. Moreover, the number of selected video sub-blocks in different sample videos is the same to ensure that the dimension of the lip-feature vector of different sample videos is fixed. The preset selection policy is used to determine the selection mode of the video sub-block, which may be determined in advance by the classification device, which is not limited by the embodiment of the present invention.

Optionally, the classifying device may select a video sub-block in the sample video by using the training method provided in the following steps 702a-702c, and train a multi-layer classifier based on the lip feature vector of the selected video sub-block:

Step 702a: Constructing L designated identifiers according to a predetermined rule, and acquiring a flag value of each specified identifier, where the L specified identifiers are used to determine the corresponding number of hidden layer nodes and the selected video sub-blocks in the D sample videos.

The classifying device can construct L designated identifiers according to a predetermined rule, L is a positive integer, L>1. Then, according to different number of hidden layer nodes and different selection methods of video sub-blocks, each specified identifier is initialized, a flag value is assigned to each specified identifier, and a flag value can be randomly assigned to each specified identifier, or The preset flag value is assigned to each of the designated identifiers, such as the flag value 0000 or 1111, and the like, which is not limited by the embodiment of the present invention. Each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and the number of hidden layer nodes corresponding to different flag values is different or the corresponding selected video sub- The blocks are different, that is, each flag value corresponds to a number of hidden layer nodes, and corresponds to the selected video sub-blocks in the D sample videos, and the number of corresponding hidden layer nodes and each sample video can be determined according to each flag value. The video sub-block should be selected.

For example, the classification device may use 10 flag bits as the flag bits indicating the number of hidden layer nodes, and the decimal value corresponding to the binary number formed by the 10 flag bits is the number of hidden layer nodes. Moreover, the classification device may also adopt m flag bits as flag bits indicating whether each video sub-block is selected, and m is the number of video sub-blocks in each sample video, when one of the m flag bits When it is 1, it indicates that the video sub-block corresponding to the flag bit is selected. When one of the m flag bits is 0, it indicates that the video sub-block corresponding to the flag bit is not selected. For example, when m flag bits are 1000, it means that the first video sub-block in each video is selected. The classification device takes the binary number composed of the 10+m flag bits as a flag value, 10+m represents the number of bits of the flag value, and assigns the flag value to a specified identifier.

Step 702b: For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, corresponding to the flag value The classification accuracy of the multi-layer classifier is trained.

In an embodiment of the invention, each flag value is used to train a corresponding multi-layer classifier. For each specified identifier, when the flag value of the specified identifier is obtained, the number of hidden layer nodes corresponding to the flag value and the selected video sub-block in the corresponding D sample videos may be determined, and the flag value is obtained. a lip feature vector of the selected video sub-block in the D sample videos, according to the number of hidden layer nodes corresponding to the flag value and the lip feature vector of the selected video sub-block in the corresponding D sample videos, Training the classification accuracy of the multi-level classifier corresponding to the flag value until the multi-layer classification When the classification accuracy of the device is stopped when the first preset condition is met, and the trained multi-layer classifier is obtained, the L-numbered identifiers of the specified identifiers can train L multi-layer classifiers.

The step 702b may specifically include the following steps 702b-1 to 702b-4:

Step 702b-1: Determine the number of hidden layer nodes corresponding to the flag value of the specified identifier and the selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer.

The classifying device may determine the number of hidden layer nodes corresponding to the flag value and the selected H video sub-blocks of the corresponding D sample videos according to values on the plurality of flag bits in the flag value.

Based on the example of step 702a, the flag value includes 10+m flag bits, and the classifying device calculates a decimal value corresponding to the first 10 flag bits in the flag value, that is, the number of hidden layer nodes corresponding to the flag value, and obtains A flag bit having a value of 1 in the last m flag bits, and a video sub-block corresponding to the flag bit having a value of 1 is the selected video sub-block in each sample video.

Step 702b-2, combining, for each sample video in the D sample videos, an X-dimensional lip feature vector of the selected H video sub-blocks in the sample video to obtain an H×X-dimensional lip language of the sample video. The feature vector combines the H×X-dimensional lip feature vectors of the D sample videos to obtain a feature matrix of D*(H×X) order.

After selecting the video sub-block according to the flag value, the classification device trains the multi-layer classifier only according to the selected video sub-block, and therefore, for each sample video in the D sample videos, the selected sample video is selected. The X-dimensional lip eigenvectors of the H video sub-blocks are combined to obtain the H×X-dimensional lip eigenvectors of the sample video, and the lip eigenvectors of all the video sub-blocks in the sample video are no longer combined. An H×X-dimensional lip feature vector is obtained for each sample video. The H×X-dimensional lip feature vector of each sample video is used as a row, and the H×X-dimensional lip feature vectors of D sample videos are combined to obtain A feature matrix of D*(H×X) order.

Step 702b-3, performing singular value decomposition on the feature matrix to obtain a second right singular matrix, and extracting, from the second right singular matrix, a column vector corresponding to the preset retention dimension as a projection matrix.

After acquiring the feature matrix, the classifying device performs singular value decomposition on the feature matrix to obtain a right singular matrix as a second right singular matrix, and extracts and presets from the second right singular matrix. The column vector corresponding to the dimension is reserved as the projection matrix. The projection process is similar to the above step 302b, and details are not described herein again.

Step 702b-4, training the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes, the multi-layer classifier including at least an input layer node and at least one hidden layer node, The projection matrix is used to represent the weight of the input layer node.

After the classification device calculates the projection matrix, the weight of the input layer node of the multi-layer classifier is represented by the projection matrix, and the classification of the multi-layer classifier is accurate based on the projection matrix, the excitation function and the number of the hidden layer nodes. The rate is trained until the classification accuracy of the multi-layer classifier stops when the first preset condition is met.

During the training process, the classification device can use D sample video as the training sample video, obtain the projection matrix corresponding to the D sample videos, and train a multi-layer classifier based on the projection matrix, the excitation function and the number of the hidden layer nodes. And acquiring D' test sample videos, obtaining a feature matrix corresponding to the D' sample videos, inputting the feature matrix into the multi-layer classifier, obtaining a classification result of each test sample video, and each test sample The classification result of the video is compared with the category actually divided by the test sample video, and the classification accuracy of the multi-layer classifier is calculated.

Step 702c: Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain a global optimal flag value of the L designated identifiers according to a preset search algorithm. The flag value of each specified identifier is updated.

Among the L designated identifiers, the flag value of each specified identifier trains a multi-layer classifier, and the classification device can search for the flag values of the L specified identifiers according to a preset search algorithm, and find the L designations. The global optimal flag value of the identifier is obtained, thereby obtaining a multi-layer classifier corresponding to the global optimal flag value. The preset search algorithm may be determined in advance by the classification device, which is not limited by the embodiment of the present invention.

Optionally, the classification device may obtain the global optimal flag value by using the search methods provided in the following steps 702c-1 to 702c-4:

Step 702c-1: Calculate an average of the L designated identifiers according to the flag value of each specified identifier. Excellent flag value.

The classifying device may calculate an average optimal flag value of the L designated identifiers according to the flag value of each specified identifier by applying the following formula:

Where m _best represents the average optimal flag value of the L specified identifiers, L represents the number of specified identifiers, t represents the current number of iterations, n represents the number of dimensions of the flag value of the specified identifier, and P _i,n (t) represents The value of the nth flag bit in the flag value of the i-th specified identifier.

Step 702c-2: Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, and obtain an optimal flag value of each specified identifier and the global optimal flag value.

For each specified identifier, the classification device may use the classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier as the fitness value of the specified identifier, and use the fitness value to perform the flag values of the L specified identifiers. Adjusting the evolution, according to the current fitness value of the specified identifier and the historical optimal fitness value of the specified identifier, updating the optimal flag value of the specified identifier, and obtaining the optimal flag value after the specified identifier is updated, according to the Specifying a current fitness value and a historical global optimal fitness value of the L specified identifiers, and updating the global optimal flag values of the L specified identifiers to obtain an updated global optimal flag value.

The optimal fitness value of the specified identifier refers to the maximum fitness value among the fitness values corresponding to the plurality of flag values of the specified identifier, and the optimal flag value of the specified identifier refers to multiple iterations. Medium, the flag value of the plurality of flag values of the specified identifier having the largest fitness value.

When the optimal flag value of the specified identifier is updated, the classification device obtains the current fitness value and the historical optimal fitness value of the specified identifier, and if the current fitness value of the specified identifier is greater than the historical optimal adaptation of the specified identifier The value of the specified identifier is used as the optimal flag value of the specified identifier, and if the current fitness value of the specified identifier is not greater than the historical optimal fitness value of the specified identifier, the specified identifier is The optimal flag value is unchanged, and is still the historical optimal fitness value of the specified identifier.

The global optimal fitness value of the L specified identifiers refers to the maximum fitness value in the optimal fitness value of each of the L specified identifiers, and the global optimal flag values of the L specified identifiers are multiple The value of the flag with the highest fitness value among the multiple flag values of each specified identifier during the iteration.

When the global optimal flag value of the L specified identifiers is updated, the classification device obtains the current fitness value of the specified identifier and the historical global optimal fitness value of the L designated identifiers, if the current identifier of the specified identifier If the value is greater than the historical global optimal fitness value of the L specified identifiers, the current flag value of the specified identifier is used as the global optimal flag value of the L specified identifiers, and if the current fitness value of the specified identifier is not greater than L For the historical global optimal fitness value of the specified identifier, the global optimal fitness value of the L specified identifiers is unchanged, and is still the historical global optimal fitness value of the L specified identifiers.

Step 702c-3: Calculate a local attractor of each specified identifier according to an optimal flag value of each specified identifier and the global optimal flag value.

Specifically, the classification device may calculate the local attractor of the specified identifier according to the optimal flag value of the specified identifier and the global optimal flag value by applying the following formula:

among them,

To obey the random number evenly distributed at (0,1),

p(t) represents the local attractor of the specified identifier, P _i (t) represents the optimal flag value of the specified identifier, and P _g (t) represents the global optimal flag value of the L specified identifiers.

Step 702c-4: Update the flag value of each specified identifier according to a preset update algorithm according to the local attractor of each specified identifier and the average optimal flag value.

Specifically, the classification device may apply the following formula to update the flag value of the specified identifier according to the local attractor of the specified identifier and the average optimal flag value:

x(t+1)=p(t)±β*abs(m _best -x(t))*ln(1/u);

Where x(t+1) represents the updated flag value of the specified identifier, p(t) represents the local attractor of the specified identifier, β represents the contraction expansion coefficient, and m _best represents the average optimal flag value, x ( t) indicates the flag value before the specified identifier is updated, u is a random number uniformly obeyed at (0, 1), u: U(0, 1).

Further, β=0.5*(Maxiter-count)/Maxiter+0.5; Maxiter indicates the most preset The number of iterations, count represents the current number of iterations.

After the flag value of each specified identifier is updated, the flag value of each specified identifier changes, and the classifying device repeatedly performs the above steps 702b to 702c based on the updated flag value of each specified identifier until the calculated value is obtained. When the global optimal flag value satisfies the second preset condition, the multi-level classifier trained by the global optimal flag value is acquired.

The second preset condition may include that the global optimal flag value reaches a preset optimal flag value, and the global optimal fitness value corresponding to the global optimal flag value reaches a maximum fitness value, and the global optimal flag value The difference between the global optimal fitness value corresponding to the global optimal fitness value obtained last time is smaller than the preset global difference, the current number of iterations reaches the maximum number of iterations, and the global maximum The classification accuracy of the multi-layer classifier trained by the superior flag value reaches at least one of the preset accuracy rates, which is not limited by the embodiment of the present invention.

Step 703: The classification device classifies the semantic information of the video to be classified based on the multi-layer classifier.

After finding the global optimal flag value, the classifying device acquires the multi-layer classifier trained by the global optimal flag value, and obtains the input weight and the output weight of the multi-layer classifier, and the multi-layer classifier can be used for Classify the semantic information of the video. When the classification device acquires the video to be classified, the lip language feature vector of each video sub-block in the video is extracted by using the method shown in the above steps 301-303, and the lip language feature vector of each video sub-block is input. To the multi-layer classifier, according to the input weights and output weights trained by the multi-layer classifier, the classification result is calculated, and the lip recognition process of the video can be realized, and the semantic information of the video is obtained.

The method provided by the embodiment of the present invention divides a video into M time sub-blocks according to a time dimension, and divides a lip area of each video frame in each time sub-block into N spatial sub-blocks according to a spatial dimension. If a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M×N video sub-blocks, and then calculates a lip-feature vector of each video sub-block. Combining the lip feature vectors of M×N video sub-blocks in the video to obtain the lip feature vector of the video, the number of video sub-blocks obtained by different videos is the same, so that the final extraction is performed. The video has the same lip-dimensional feature vector dimension, which realizes the fixed dimension of the feature. It does not need to dynamically adjust the feature dimension in the classification, which simplifies the operation, saves the training time, and applies the multi-layer classifier of the training. When the video is classified, the feature dimension is not dynamically adjusted, which simplifies the operation, saves the classification time, and improves the classification accuracy. After extracting the local texture feature matrix of the video sub-block, the local texture feature matrix is projected to obtain the lip-feature feature vector, which enhances the robustness of the lip-feature feature. The video sub-block is selected by using a preset selection strategy, and a video sub-block capable of describing the lip-speech information is selected to overcome the defects of different lip-speech information of each video sub-block in the block PLBP feature, and the defect is reduced. Redundant features increase computational speed and classification accuracy.

In view of the shortcomings of the current lip language feature dimension and a large amount of redundant information, the present invention innovatively proposes a video segmentation method and a description operator PLBP, which can ensure the integrity of information. Effectively increase the spatial and temporal information in the video, and can display the video with fixed dimension in the dimension with fixed dimension, which can describe the spatio-temporal feature well, which greatly facilitates the optimization of the post-recognition algorithm, and thus the subsequent lip language. The recognition phase significantly improves the recognition rate of lip reading.

The embodiment of the invention adopts a combination of BQPSO (Binary Quantum Particle Swarm Optimization) algorithm and PELM algorithm, and uses the specified identifier of the structure as a particle, and BQPSO as a search algorithm for selecting a feature combination, with more PELM. The classifier classifies the classification accuracy of the sample video as the fitness value, continuously adjusts the flag values of the L specified identifiers, searches for the flag value that optimizes the fitness value, and determines the feature combination that optimizes the fitness value. The video sub-blocks are selected by BQPSO and trained and classified by PELM's multi-layer classifier, which can significantly improve the speed of sample training in lip-speech recognition process, and achieve higher recognition rate, which can enhance the application in mobile terminals. Sexuality provides a reference for the application of other biometrics on mobile terminals.

In order to embody the effect of the method provided by the embodiment of the present invention, the HMM (Hidden Markov Model) algorithm in the prior art and the method provided by the embodiment of the present invention are used for experiments, and a total of 20 experiments are taken during the experiment. In the experimental order, 5 samples were used as training samples for each experimental command, and 5 samples were used as test samples. A total of 100 training samples and 100 samples were obtained. Test the sample.

The training time used in the training using the HMM algorithm and the 100 training samples in the prior art and the training time used in applying the method provided by the embodiment of the present invention and 100 training samples are shown in Table 1 below. The classification accuracy of the classification of 100 test samples by the classifier trained by the HMM algorithm in the prior art and the classification of 100 test samples by the multi-layer classifier trained by the method provided by the embodiment of the present invention The classification accuracy rate is shown in Table 2 below.

It can be seen from the following Table 1 that the training time of the method provided by the embodiment of the present invention is less than 0.1 s, and the training time of the traditional HMM algorithm is as long as 4.538 s. It can be seen from Table 2 below that the average classification accuracy of the method provided by the embodiment of the present invention is as high as 97.2%, while the average classification accuracy of the traditional HMM algorithm is only 84.5%.

Table 1

志愿者volunteer	HMM算法的训练时间Training time of HMM algorithm	本发明的训练时间Training time of the present invention
志愿者volunteer	HMM算法的训练时间Training time of HMM algorithm	本发明的训练时间Training time of the present invention	11	8.75178.7517	0.07800.0780
22	3.72843.7284	0.01560.0156	11	8.75178.7517	0.07800.0780
22	3.72843.7284	0.01560.0156	33	5.33525.3352	0.01560.0156
44	1.99681.9968	0.07800.0780	33	5.33525.3352	0.01560.0156
44	1.99681.9968	0.07800.0780	55	2.41802.4180	0.12480.1248
66	7.11367.1136	0.06240.0624	55	2.41802.4180	0.12480.1248
66	7.11367.1136	0.06240.0624	77	8.50218.5021	0.01560.0156
88	3.82203.8220	0.01560.0156	77	8.50218.5021	0.01560.0156
88	3.82203.8220	0.01560.0156	99	1.74721.7472	0.07800.0780
1010	1.96561.9656	0.03120.0312	99	1.74721.7472	0.07800.0780

Table 2

志愿者volunteer	HMM算法的分类准确率Classification accuracy of HMM algorithm	本发明的分类准确率Classification accuracy rate of the present invention
志愿者volunteer	HMM算法的分类准确率Classification accuracy of HMM algorithm		11	93％93%	89％89%
22	87％87%	95％95%	11	93％93%	89％89%
22	87％87%	95％95%	33	96％96%	97％97%
44	87％87%	100％100%	33	96％96%	97％97%
44	87％87%	100％100%	55	81％81%	100％100%
66	84％84%	98％98%	55	81％81%	100％100%
66	84％84%	98％98%	77	83％83%	100％100%
88	86％86%	99％99%	77	83％83%	100％100%
88	86％86%	99％99%	99	81％81%	98％98%
1010	67％67%	96％96%	99	81％81%	98％98%

FIG. 9 is a schematic structural diagram of a feature extraction apparatus according to an embodiment of the present invention. Referring to FIG. 9, the apparatus includes:

The dividing module 901 is configured to divide the video into a first preset number of time sub-blocks according to a chronological order of video frames in the video, where each time sub-block includes at least two consecutive video frames, each video frame Including the lip area;

The dividing module 901 is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and corresponding space spaces of the same position in each video frame in the same time sub-block The blocks form a video sub-block, and the video obtains a total of M×N video sub-blocks, and each time sub-block includes N video sub-blocks;

The feature calculation module 902 is configured to calculate a lip language feature vector of each video sub-block obtained by the dividing module 901, where the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block, each video The lip eigenvectors of the sub-blocks are X-dimensional vectors.

The combination module 903 is configured to combine the lip feature vectors of the plurality of video sub-blocks in the video obtained by the feature calculation module 902 to obtain a lip feature vector of the video.

The device provided by the embodiment of the present invention divides the video into M time sub-blocks according to the time dimension, and divides the lip area of each video frame in each time sub-block into N spatial sub-blocks according to the spatial dimension. If a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M×N video sub-blocks, and then calculates a lip-feature vector of each video sub-block. Combining the lip feature vectors of M×N video sub-blocks in the video to obtain the lip feature vector of the video, the number of video sub-blocks obtained by different videos is the same, so that the lip of the finally extracted video is The eigenvectors have the same dimension, which realizes the fixation of the feature dimension. It does not need to dynamically adjust the feature dimension, which simplifies the operation and saves time.

Optionally, the feature calculation module 902 includes:

FIG. 10 is a schematic structural diagram of a lip language sorting apparatus according to an embodiment of the present invention. Referring to FIG. 10, the apparatus includes:

The dividing module 1001 is configured to, for each sample video of the preselected D sample videos, divide the sample video into M time sub-blocks according to a chronological order of video frames in the sample video, in each time sub-block Include at least two consecutive video frames, each of which includes a lip region;

The dividing module 1001 is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and corresponding space sub-frames corresponding to the same position in each video frame in the same time sub-block Blocks form a video sub-block, the sample video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;

a feature calculation module 1002, configured to calculate a lip language feature vector of each video sub-block obtained by the dividing module 1001, where the lip language feature vector is used to describe texture information of a lip region in a corresponding video sub-block;

The training module 1003 is configured to perform, according to the lip training feature vector of the video sub-blocks in the D sample videos obtained by the feature calculation module 1002, the classification accuracy of the multi-layer classifier according to a preset training algorithm, until the The multi-level classifier is used to classify the semantic information of the video when the classification accuracy rate of the multi-layer classifier is stopped when the first preset condition is met, and the training is completed.

The device provided by the embodiment of the present invention divides the video into M time sub-blocks according to the time dimension, and divides the lip area of each video frame in each time sub-block into N spatial sub-blocks according to the spatial dimension. When a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M×N video sub-blocks, and then calculates the X-dimensional lip feature of each video sub-block. Vector, the number of video sub-blocks obtained by different videos is the same, so that the lip-feature vector dimension of the finally extracted video is the same, and the feature dimension is fixed, and according to the lip language of the video sub-block in the sample video. The eigenvectors train the classification accuracy of the multi-layer classifier without dynamic adjustment of the feature dimension, simplifying the operation and saving the training time. When applying the trained multi-layer classifier to classify the video, there is no need to feature Dynamic adjustment of the dimension simplifies the operation, saves the classification time, and improves the classification accuracy.

Optionally, the training module 1003 is configured to perform the following steps:

Optionally, the training module 1003 is further configured to determine a number of hidden layer nodes corresponding to the flag value of the specified identifier and a selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer; Each sample video in the sample video combines the X-dimensional lip feature vectors of the selected H video sub-blocks in the sample video to obtain an H×X-dimensional lip feature vector of the sample video; The H×X-dimensional lip eigenvectors of the sample video are combined to obtain a feature matrix of D*(H×X) order; singular value decomposition is performed on the feature matrix to obtain a second right singular matrix; from the second right In the singular matrix, extracting a column vector corresponding to the preset retention dimension as a projection matrix; and training the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes, The multi-layer classifier includes at least an input layer node and at least one hidden layer node, the projection matrix being used to represent the weight of the input layer node.

Optionally, the training module 1003 is further configured to calculate, according to the flag value of each specified identifier. An average optimal flag value of the L specified identifiers; a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, obtaining an optimal flag value of each specified identifier and the global optimal flag value Calculating a local attractor for each specified identifier according to an optimal flag value of each specified identifier and the global optimal flag value; according to the local attractor of each specified identifier and the average optimal flag value, according to Let the update algorithm update the flag value of each specified identifier.

Optionally, the training module 1003 is further configured to, for each specified identifier, a classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier as the fitness value of the specified identifier; according to the specified identifier The current fitness value and the historical optimal fitness value of the specified identifier are updated, and the optimal flag value of the specified identifier is updated to obtain an optimal flag value after the specified identifier is updated; The fitness value and the historical global optimal fitness value of the L specified identifiers are updated, and the global optimal flag values of the L specified identifiers are updated to obtain an updated global optimal flag value.

11 is a schematic structural diagram of a feature extraction device according to an embodiment of the present invention. Referring to FIG. 11, the device includes: a memory 1101 and a processor 1102. The memory 1101 is connected to the processor 1102, and the memory 1101 is stored. There is program code, and the processor 1102 is configured to invoke the program code to perform the following operations:

Optionally, the processor 1102 is further configured to invoke the program code, and perform the following operations:

FIG. 12 is a schematic structural diagram of a lip language sorting apparatus according to an embodiment of the present invention. Referring to FIG. 12, the apparatus includes: a memory 1201 and a processor 1202. The memory 1201 is connected to the processor 1202, and the memory 1201 is connected to the processor 1202. Stored with program code, the processor 1202 is configured to invoke the program Code, do the following:

Optionally, the processor 1202 is further configured to invoke the program code, and perform the following operations:

Step 1: According to a predetermined rule, construct L designated identifiers, and L designated identifiers are used to determine corresponding Number of hidden layer nodes and selected video sub-blocks in D sample video, L is a positive integer, L>1;

Training the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes, the multi-layer classifier including at least an input layer node and at least one hidden layer node Point, the projection matrix is used to represent the weight of the input layer node.

A person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium. The storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

The above is only the preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles are intended to be included within the scope of the present invention.

Claims

A feature extraction method, the method comprising:

Dividing the video into M time sub-blocks according to a chronological order of video frames in the video, each time sub-block includes at least two consecutive video frames, each of which includes a lip region;

Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;

Calculating a lip language feature vector of each video sub-block, wherein the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block, and the lip language feature vector of each video sub-block is an X-dimensional vector;

Combining the X-dimensional lip feature vectors of the M×N video sub-blocks in the video to obtain an X×M×N-dimensional lip feature vector of the video;

Where M, N, and X are positive integers, and × is used to represent the product of numerical values.
The method according to claim 1, wherein said calculating a lip feature vector of each video sub-block comprises:

Extracting an X-dimensional local binary pattern LBP feature vector of each spatial sub-block in a video sub-block, wherein the one video sub-block includes Y spatial sub-blocks;

Combining the LBP feature vectors of Y spatial sub-blocks to obtain an X*Y-order local texture feature matrix;

Performing singular value decomposition on the local texture feature matrix to obtain a Y*Y order first right singular matrix;

Extracting a first column vector of the first right singular matrix as a projection vector;

Calculating a product of the local texture feature matrix and the projection vector to obtain an X-dimensional lip language feature vector of the video sub-block;

Where Y is a positive integer, the X*Y order matrix represents a matrix of X rows and Y columns, and the Y*Y order matrix represents a matrix of Y rows and Y columns.
A lip language classification method, the method comprising:

For each of the pre-selected D sample videos, the sample video is divided into M time sub-blocks according to the chronological order of the video frames in the sample video, and each time sub-block includes at least two consecutive a video frame, each of which includes a lip region;

Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The sample video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;

Calculating a lip language feature vector of each video sub-block, the lip language feature vector being used to describe texture information of a lip region in the corresponding video sub-block;

According to the lip feature vector of the video sub-block in the D sample videos, the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm until the classification accuracy of the multi-layer classifier satisfies the first preset When the condition is stopped, the multi-layer classifier that is trained to be completed is obtained, and the multi-layer classifier is used to classify the semantic information of the video;

Among them, M, N, and D are positive integers, D>1, and × is used to represent the product of numerical values.
The method according to claim 3, wherein the classification accuracy rate of the multi-layer classifier is trained according to a preset training algorithm according to a lip feature vector of the video sub-blocks in the D sample videos, including :

Step 1: According to a predetermined rule, construct L specified identifiers, and the L specified identifiers are used to determine the number of corresponding hidden layer nodes and the selected video sub-blocks in the D sample videos, L is a positive integer, L>1;

Step 2: Obtain a flag value for each specified identifier, and each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and a different flag value corresponds to the hidden The number of layer nodes is different or the corresponding selected video sub-blocks are different, and each flag value is used to train a corresponding multi-layer classifier;

Step 3: For each flag of the specified identifier, the hidden layer section corresponding to the flag value of the specified identifier The number of points and the lip feature vector of the selected video sub-block of the corresponding D sample videos are used to train the classification accuracy of the multi-layer classifier corresponding to the flag value;

Step 4: Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain the global optimal flag value of the L designated identifiers according to a preset search algorithm. And update the flag value of each specified identifier;

The above steps 3 to 4 are repeatedly executed until the global optimum flag value satisfies the second preset condition.
The method according to claim 4, wherein the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip feature vector of the selected video sub-block in the corresponding D sample videos are Training the classification accuracy of the multi-level classifier corresponding to the flag value, including:

Determining, by the number of hidden layer nodes corresponding to the flag value of the specified identifier, and the selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer;

For each of the D sample videos, the X-dimensional lip feature vectors of the selected H video sub-blocks in the sample video are combined to obtain an H×X-dimensional lip feature vector of the sample video;

Combining H×X-dimensional lip feature vectors of D sample videos to obtain a feature matrix of D*(H×X) order;

Performing singular value decomposition on the feature matrix to obtain a second right singular matrix;

Extracting, from the second right singular matrix, a column vector corresponding to the preset retention dimension as a projection matrix;

Training the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes, the multi-layer classifier including at least an input layer node and at least one hidden layer node, A projection matrix is used to represent the weight of the input layer node.
The method according to claim 4, wherein the classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier is preset The search algorithm obtains the global optimal flag values of the L specified identifiers, and updates the flag values of each specified identifier, including:

Calculating an average optimal flag value of the L designated identifiers according to a flag value of each specified identifier;

Obtaining a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, obtaining an optimal flag value of each specified identifier and the global optimal flag value;

Calculating a local attractor for each specified identifier according to an optimal flag value of each specified identifier and the global optimal flag value;

The flag value of each specified identifier is updated according to a preset update algorithm according to the local attractor of each specified identifier and the average optimal flag value.
The method according to claim 6, wherein the classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier acquires an optimal flag value of each specified identifier and the global maximum Excellent flag value, including:

For each specified identifier, the classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier is used as the fitness value of the specified identifier;

And updating, according to the current fitness value of the specified identifier and the historical optimal fitness value of the specified identifier, an optimal flag value of the specified identifier, to obtain an optimal flag value after the specified identifier is updated;

And updating, according to the current fitness value of the specified identifier and the historical global optimal fitness value of the L specified identifiers, the global optimal flag values of the L specified identifiers to obtain an updated global optimal identifier. value.
A feature extraction device, characterized in that the device comprises:

a dividing module, configured to divide the video into M time sub-blocks according to a chronological order of video frames in a video, where each time sub-block includes at least two consecutive video frames, and each video frame includes a lip region ;

The dividing module is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and to correspond to spatial sub-blocks of the same position in each video frame in the same time sub-block Composing a video sub-block, the video obtains a total of M×N video sub-blocks, and each time sub-block includes N video sub-blocks;

a feature calculation module, configured to calculate a lip language feature vector of each video sub-block obtained by the dividing module, where the lip language feature vector is used to describe texture information of a lip region in a corresponding video sub-block, each video sub-block The lip eigenvector is a vector of X dimensions;

a combination module, configured to combine the X-dimensional lip language feature vectors of the M×N video sub-blocks in the video obtained by the feature calculation module to obtain an X×M×N-dimensional lip language feature vector of the video ;

Where M, N, and X are positive integers, and × is used to represent the product of numerical values.
The device according to claim 8, wherein the feature calculation module comprises:

An extracting unit, configured to extract an X-dimensional local binary mode LBP feature vector of each spatial sub-block in a video sub-block, where the one video sub-block includes Y spatial sub-blocks;

a combining unit, configured to combine the LBP feature vectors of the Y spatial sub-blocks obtained by the extracting unit to obtain an X*Y-order local texture feature matrix;

a decomposition unit, configured to perform singular value decomposition on the local texture feature matrix obtained by the combining unit to obtain a Y*Y order first right singular matrix;

a projection unit, configured to extract a first column vector of the first right singular matrix obtained by the decomposition unit, as a projection vector;

a calculation unit, configured to calculate a product of the local texture feature matrix obtained by the combining unit and the projection vector obtained by the projection unit, to obtain an X-dimensional lip language feature vector of the video sub-block;

Where Y is a positive integer, the X*Y order matrix represents a matrix of X rows and Y columns, and the Y*Y order matrix represents a matrix of Y rows and Y columns.
A lip language classification device, characterized in that the device comprises:

a dividing module, configured to divide the sample video into M time sub-blocks according to a chronological order of video frames in the sample video for each of the pre-selected D sample videos, each time sub-block includes At least two consecutive video frames, each of which includes a lip region;

The dividing module is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and to correspond to spatial sub-blocks of the same position in each video frame in the same time sub-block Composing a video sub-block, the sample video is obtained by M×N video sub-blocks, and each time sub-block includes N video sub-blocks;

a feature calculation module, configured to calculate a lip language feature vector of each video sub-block obtained by the dividing module, where the lip language feature vector is used to describe texture information of a lip region in a corresponding video sub-block;

a training module, configured to perform a lip language feature vector of a video sub-block in the D sample videos obtained by the feature calculation module, and perform a training on the classification accuracy of the multi-layer classifier according to a preset training algorithm until the multi-layer classification The classification accuracy rate of the device is stopped when the first preset condition is met, and the multi-layer classifier is completed, and the multi-layer classifier is used to classify the semantic information of the video;

Among them, M, N, and D are positive integers, D>1, and × is used to represent the product of numerical values.
The apparatus of claim 10 wherein said training module is operative to perform the steps of:

Step 1: According to a predetermined rule, construct L specified identifiers, and the L specified identifiers are used to determine the number of corresponding hidden layer nodes and the selected video sub-blocks in the D sample videos, L is a positive integer, L>1;

Step 2: Obtain a flag value for each specified identifier, and each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and a different flag value corresponds to the hidden The number of layer nodes is different or the corresponding selected video sub-blocks are different, and each flag value is used to train a corresponding multi-layer classifier;

Step 3: For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, the flag value is Training the classification accuracy of the corresponding multi-layer classifier;

Step 4: Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain the global optimal flag value of the L designated identifiers according to a preset search algorithm. And update the flag value of each specified identifier;

The above steps 3 to 4 are repeatedly executed until the global optimum flag value satisfies the second preset condition.
The apparatus according to claim 11, wherein the training module is further configured to determine a number of hidden layer nodes corresponding to the flag value of the specified identifier and a selected H video sub-blocks in the corresponding D sample videos. , H is a positive integer; for each sample video in the D sample videos, the X-dimensional lip feature vectors of the selected H video sub-blocks in the sample video are combined to obtain an H×X dimension of the sample video. Lip-language feature vector; combining H×X-dimensional lip feature vectors of D sample videos to obtain feature matrix of D*(H×X) order; performing singular value decomposition on the feature matrix to obtain second right singularity a matrix; extracting, from the second right singular matrix, a column vector corresponding to a preset retention dimension as a projection matrix; classifying the plurality of layers based on the projection matrix, an excitation function, and the number of the hidden layer nodes The classification accuracy of the device is trained, the multi-layer classifier comprising at least an input layer node and at least one hidden layer node, the projection matrix being used to represent the weight of the input layer node.
The apparatus according to claim 11, wherein the training module is further configured to calculate an average optimal flag value of the L specified identifiers according to a flag value of each specified identifier; a classification accuracy rate of the multi-layer classifier trained by the value, obtaining an optimal flag value of each specified identifier and the global optimal flag value; an optimal flag value according to each specified identifier and the global optimal flag value Calculating a local attractor for each specified identifier; updating the flag value of each specified identifier according to a preset update algorithm according to the local attractor of each specified identifier and the average optimal flag value.
The apparatus of claim 13 wherein said training module is further Each specified identifier, the classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier is used as the fitness value of the specified identifier; according to the current fitness value of the specified identifier and the history of the specified identifier The optimal fitness value is updated, and the optimal flag value of the specified identifier is updated to obtain an optimal flag value after the specified identifier is updated; and the current fitness value and the history of the L designated identifiers are determined according to the specified identifier. The global optimal fitness value is updated, and the global optimal flag values of the L specified identifiers are updated to obtain an updated global optimal flag value.
A feature extraction device, comprising: a memory and a processor, wherein the memory is connected to the processor, the memory stores program code, and the processor is configured to invoke the program code to execute The following operations:

Dividing the video into M time sub-blocks according to a chronological order of video frames in the video, each time sub-block includes at least two consecutive video frames, each of which includes a lip region;

Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;

Calculating a lip language feature vector of each video sub-block, wherein the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block, and the lip language feature vector of each video sub-block is an X-dimensional vector;

Combining the X-dimensional lip feature vectors of the M×N video sub-blocks in the video to obtain an X×M×N-dimensional lip feature vector of the video;

Where M, N, and X are positive integers, and × is used to represent the product of numerical values.
A lip language classification device, comprising: a memory and a processor, wherein the memory is connected to the processor, the memory stores program code, and the processor is configured to invoke the program code, Do the following:

For each of the pre-selected D sample videos, the sample video is divided into M time sub-blocks according to the chronological order of the video frames in the sample video, and each time sub-block includes at least Two consecutive video frames, each of which includes a lip region;

Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The sample video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;

Calculating a lip language feature vector of each video sub-block, the lip language feature vector being used to describe texture information of a lip region in the corresponding video sub-block;

According to the lip feature vector of the video sub-block in the D sample videos, the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm until the classification accuracy of the multi-layer classifier satisfies the first preset When the condition is stopped, the multi-layer classifier that is trained to be completed is obtained, and the multi-layer classifier is used to classify the semantic information of the video;

Among them, M, N, and D are positive integers, D>1, and × is used to represent the product of numerical values.