WO2016201679A1 - Feature extraction method, lip-reading classification method, device and apparatus - Google Patents

Feature extraction method, lip-reading classification method, device and apparatus Download PDF

Info

Publication number
WO2016201679A1
WO2016201679A1 PCT/CN2015/081824 CN2015081824W WO2016201679A1 WO 2016201679 A1 WO2016201679 A1 WO 2016201679A1 CN 2015081824 W CN2015081824 W CN 2015081824W WO 2016201679 A1 WO2016201679 A1 WO 2016201679A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
sub
block
blocks
lip
Prior art date
Application number
PCT/CN2015/081824
Other languages
French (fr)
Chinese (zh)
Inventor
左坤隆
张新曼
路龙宾
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2015/081824 priority Critical patent/WO2016201679A1/en
Publication of WO2016201679A1 publication Critical patent/WO2016201679A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features

Definitions

  • the present invention relates to the field of feature recognition, and in particular, to a feature extraction method, a lip language classification method, device and device.
  • biometrics technology has been widely used in many fields such as identity verification as a new human-computer interaction method.
  • lip recognition technology has become a new technology.
  • Feature extraction is an important step in the process of lip language recognition.
  • the existing feature extraction method usually extracts the lip contour of the video frame, expresses the lip contour with several parameters, and linearly combines some parameters to obtain the lip language feature of the video.
  • the multi-frame image in the video is used as a two-dimensional signal, and the two-dimensional signal is image-converted to obtain a lip-speech feature of the video.
  • the dimension of the lip features is not fixed when the lip feature is extracted by the above feature extraction method.
  • most classifiers require a fixed feature dimension, which results in the need to dynamically adjust the dimension of the lip feature of the video when the application video trains the classifier or when the classifier is used to classify the video. It is cumbersome, and the training time and classification time are very long.
  • the embodiment of the invention provides a feature extraction method, a lip language classification method, a device and a device.
  • the technical solution is as follows:
  • a feature extraction method comprising:
  • each time sub-block includes at least two consecutive video frames, each of which includes a lip region;
  • each time sub-block Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block,
  • the video obtains M ⁇ N video sub-blocks, and each time sub-block includes N video sub-blocks;
  • M, N, and X are positive integers, and ⁇ is used to represent the product of numerical values.
  • the calculating a lip language feature vector of each video sub-block includes:
  • the X*Y order matrix represents a matrix of X rows and Y columns
  • the Y*Y order matrix represents a matrix of Y rows and Y columns
  • a lip language classification method comprising:
  • the sample video is divided into M time sub-blocks according to the chronological order of the video frames in the sample video, and each time sub-block includes Two consecutive video frames, each of which includes a lip region;
  • the sample video obtains M ⁇ N video sub-blocks, and each time sub-block includes N video sub-blocks;
  • the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm until the classification accuracy of the multi-layer classifier satisfies the first preset When the condition is stopped, the multi-layer classifier that is trained to be completed is obtained, and the multi-layer classifier is used to classify the semantic information of the video;
  • M, N, and D are positive integers, D>1, and ⁇ is used to represent the product of numerical values.
  • the classification accuracy of the multi-layer classifier according to the lip language feature vector of the video sub-blocks in the D sample videos is according to preset training
  • the algorithm is trained to include:
  • Step 1 According to a predetermined rule, construct L specified identifiers, and the L specified identifiers are used to determine the number of corresponding hidden layer nodes and the selected video sub-blocks in the D sample videos, L is a positive integer, L>1;
  • Step 2 Obtain a flag value for each specified identifier, and each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and a different flag value corresponds to the hidden The number of layer nodes is different or the corresponding selected video sub-blocks are different, and each flag value is used to train a corresponding multi-layer classifier;
  • Step 3 For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, the flag value is Training the classification accuracy of the corresponding multi-layer classifier;
  • Step 4 Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain the global optimal flag value of the L designated identifiers according to a preset search algorithm. And update the flag value of each specified identifier;
  • the number of the hidden layer nodes corresponding to the flag value of the specified identifier and the selected one of the D sample videos A lip language feature vector of the video sub-block, training the classification accuracy of the multi-layer classifier corresponding to the flag value, including:
  • the X-dimensional lip feature vectors of the selected H video sub-blocks in the sample video are combined to obtain an H ⁇ X-dimensional lip feature vector of the sample video;
  • a projection matrix is used to represent the weight of the input layer node.
  • the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier The classification accuracy rate is obtained according to a preset search algorithm, and the global optimal flag values of the L specified identifiers are obtained, and the flag values of each specified identifier are updated, including:
  • the flag value of each specified identifier is updated according to a preset update algorithm according to the local attractor of each specified identifier and the average optimal flag value.
  • the classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, obtaining an optimal flag value of each specified identifier and the global optimal flag value including:
  • the classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier is used as the fitness value of the specified identifier
  • a feature extraction device comprising:
  • a dividing module configured to divide the video into M time sub-blocks according to a chronological order of video frames in a video, where each time sub-block includes at least two consecutive video frames, and each video frame includes a lip region ;
  • the dividing module is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and to correspond to spatial sub-blocks of the same position in each video frame in the same time sub-block Composing a video sub-block, the video obtains a total of M ⁇ N video sub-blocks, and each time sub-block includes N video sub-blocks;
  • a feature calculation module configured to calculate a lip feature feature of each video sub-block obtained by the dividing module
  • the lip language feature vector is used to describe the texture information of the lip region in the corresponding video sub-block, and the lip feature vector of each video sub-block is an X-dimensional vector;
  • a combination module configured to combine the X-dimensional lip language feature vectors of the M ⁇ N video sub-blocks in the video obtained by the feature calculation module to obtain an X ⁇ M ⁇ N-dimensional lip language feature vector of the video ;
  • M, N, and X are positive integers, and ⁇ is used to represent the product of numerical values.
  • the feature calculation module includes:
  • An extracting unit configured to extract an X-dimensional local binary mode LBP feature vector of each spatial sub-block in a video sub-block, where the one video sub-block includes Y spatial sub-blocks;
  • a combining unit configured to combine the LBP feature vectors of the Y spatial sub-blocks obtained by the extracting unit to obtain an X*Y-order local texture feature matrix
  • a decomposition unit configured to perform singular value decomposition on the local texture feature matrix obtained by the combining unit to obtain a Y*Y order first right singular matrix
  • a projection unit configured to extract a first column vector of the first right singular matrix obtained by the decomposition unit, as a projection vector
  • a calculation unit configured to calculate a product of the local texture feature matrix obtained by the combining unit and the projection vector obtained by the projection unit, to obtain an X-dimensional lip language feature vector of the video sub-block;
  • the X*Y order matrix represents a matrix of X rows and Y columns
  • the Y*Y order matrix represents a matrix of Y rows and Y columns
  • a lip language classification device comprising:
  • a dividing module configured to divide the sample video into M time sub-blocks according to a chronological order of video frames in the sample video for each of the pre-selected D sample videos, each time sub-block includes At least two consecutive video frames, each of which includes a lip region;
  • the dividing module is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and to correspond to spatial sub-blocks of the same position in each video frame in the same time sub-block Composing a video sub-block, the sample video is obtained by M ⁇ N video sub-blocks, and each time sub-block includes N video sub-blocks;
  • a feature calculation module configured to calculate a lip language feature vector of each video sub-block obtained by the dividing module, where the lip language feature vector is used to describe texture information of a lip region in a corresponding video sub-block;
  • a training module configured to perform a lip language feature vector of the video sub-blocks in the D sample videos obtained by the feature calculation module, and perform a training on the classification accuracy of the multi-layer classifier according to a preset training algorithm until the When the classification accuracy rate of the layer classifier meets the first preset condition, the multi-layer classifier is completed, and the multi-layer classifier is used to classify the semantic information of the video;
  • M, N, and D are positive integers, D>1, and ⁇ is used to represent the product of numerical values.
  • the training module is configured to perform the following steps:
  • Step 1 According to a predetermined rule, construct L specified identifiers, and the L specified identifiers are used to determine the number of corresponding hidden layer nodes and the selected video sub-blocks in the D sample videos, L is a positive integer, L>1;
  • Step 2 Obtain a flag value for each specified identifier, and each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and a different flag value corresponds to the hidden The number of layer nodes is different or the corresponding selected video sub-blocks are different, and each flag value is used to train a corresponding multi-layer classifier;
  • Step 3 For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, the flag value is Training the classification accuracy of the corresponding multi-layer classifier;
  • Step 4 Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain the global optimal flag value of the L designated identifiers according to a preset search algorithm. And update the flag value of each specified identifier;
  • the training module is further configured to determine a number of hidden layer nodes corresponding to the flag value of the specified identifier and a selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer; for D sample videos Each sample video in the sample, combining the X-dimensional lip feature vectors of the selected H video sub-blocks in the sample video to obtain an H ⁇ X-dimensional lip feature vector of the sample video; for D sample videos H ⁇ X-dimensional lip eigenvectors are combined to obtain a feature matrix of D*(H ⁇ X) order; singular value decomposition is performed on the feature matrix to obtain a second right singular matrix; from the second right singular matrix Extracting a column vector corresponding to the preset retention dimension as a projection matrix; and training the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes,
  • the layer classifier includes at least an input layer node and at least one
  • the training module is further configured to calculate an average of the L specified identifiers according to a flag value of each specified identifier.
  • An optimal flag value and the global optimal flag value are calculated, and a local attractor of each specified identifier is calculated; each local designation is performed according to a preset update algorithm according to a local attractor of each specified identifier and the average optimal flag value
  • the flag value of the tag is updated.
  • the training module is further configured to perform multi-layer classification of the flag value of the specified identifier for each specified identifier.
  • the classification accuracy rate of the device is used as the fitness value of the specified identifier; and the optimal flag value of the specified identifier is updated according to the current fitness value of the specified identifier and the historical optimal fitness value of the specified identifier.
  • the good flag value is updated to obtain the updated global optimal flag value.
  • a feature extraction device comprising: a memory and a processor, The memory is coupled to the processor, the memory stores program code, and the processor is configured to invoke the program code to perform the following operations:
  • each time sub-block includes at least two consecutive video frames, each of which includes a lip region;
  • each time sub-block Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block,
  • the video obtains M ⁇ N video sub-blocks, and each time sub-block includes N video sub-blocks;
  • M, N, and X are positive integers, and ⁇ is used to represent the product of numerical values.
  • a lip language classification device comprising: a memory and a processor, the memory being coupled to the processor, the memory storing program code, the processor for calling the program Code, do the following:
  • the sample video is divided into M time sub-blocks according to the chronological order of the video frames in the sample video, and each time sub-block includes at least two consecutive a video frame, each of which includes a lip region;
  • the sample video obtains M ⁇ N video sub-blocks, and each time sub-block includes N video sub-blocks;
  • the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm until the classification accuracy of the multi-layer classifier satisfies Stopping at a preset condition, obtaining the multi-layer classifier that is trained to be used, and the multi-layer classifier is used to classify the semantic information of the video;
  • M, N, and D are positive integers, D>1, and ⁇ is used to represent the product of numerical values.
  • the method, device and device provided by the embodiments of the present invention divide the video into M time sub-blocks according to the time dimension, and according to the spatial dimension, the lip area of each video frame in each time sub-block is divided into N pieces.
  • the spatial sub-blocks form a video sub-block corresponding to the same position in each video frame in the same time sub-block, and the video obtains a total of M ⁇ N video sub-blocks, and then calculates the lip of each video sub-block.
  • the feature vector combining the lip feature vectors of M ⁇ N video sub-blocks in the video to obtain the lip feature vector of the video, the number of video sub-blocks obtained by different videos is the same, so that the final extracted
  • the lip eigenvectors of the video have the same dimension, which realizes the fixed dimension of the feature.
  • the classification accuracy of the multi-layer classifier is trained without the feature dimension. The dynamic adjustment simplifies the operation and saves the training time. When the multi-layer classifier of the training is used to classify the video, the classification time is saved and the classification accuracy is improved.
  • FIG. 1 is a flowchart of a feature extraction method according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a lip language classification method according to an embodiment of the present invention.
  • FIG. 3 is a flowchart of a feature extraction method according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a lip region of a video frame according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of video blocking according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a pixel neighborhood provided by an embodiment of the present invention.
  • FIG. 7 is a flowchart of a lip language classification method according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a multi-layer classifier according to an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of a feature extraction apparatus according to an embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a lip language sorting apparatus according to an embodiment of the present invention.
  • FIG. 11 is a schematic structural diagram of a feature extraction device according to an embodiment of the present invention.
  • FIG. 12 is a schematic structural diagram of a lip language classification device according to an embodiment of the present invention.
  • FIG. 1 is a flowchart of a feature extraction method according to an embodiment of the present invention. Referring to Figure 1, the method includes:
  • Step 101 Divide the video into M time sub-blocks according to a time sequence of video frames in the video, where each time sub-block includes at least two consecutive video frames, and each video frame includes a lip region.
  • Step 102 Divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and form spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block.
  • the video obtains a total of M ⁇ N video sub-blocks, and each time sub-block includes N video sub-blocks.
  • Step 103 Calculate a lip language feature vector of each video sub-block, where the lip language feature vector is used to describe texture information of a lip region in a corresponding video sub-block, and the lip language feature vector of each video sub-block is an X-dimensional vector. .
  • Step 104 Combine the X-dimensional lip feature vectors of the M ⁇ N video sub-blocks in the video to obtain an X ⁇ M ⁇ N-dimensional lip feature vector of the video.
  • M, N, and X are positive integers, and ⁇ is used to represent the product of numerical values.
  • the video includes a plurality of video frames, each of which includes a human lip region, and the lip region of the video frame is classified, and after determining the category to which the lip region belongs, the person can be determined.
  • the content of the talk is a plurality of video frames, each of which includes a human lip region, and the lip region of the video frame is classified, and after determining the category to which the lip region belongs, the person can be determined. The content of the talk.
  • M and N are preset values, and the number of video sub-blocks divided by each video is M ⁇ N, which is a fixed value.
  • the dimensions of the lip eigenvectors of different video sub-blocks are the same. If the lip-feature vector of each video sub-block is an X-dimensional vector, multiple video sub-pictures are used. When the lip eigenvectors of the block are combined, the obtained lip eigenvector of the video has a dimension of X ⁇ M ⁇ N, which is also a fixed value.
  • the embodiment of the present invention pre-sets the number M of time sub-blocks obtained by dividing the video for different videos and the number N of spatial sub-blocks divided by the lip regions of each video frame in the time sub-block, Then, for different videos, the number of video sub-blocks divided by each video is a fixed value, so that the dimension of the lip-feature vector extracted by each video is also a fixed value.
  • the method provided by the embodiment of the present invention divides a video into M time sub-blocks according to a time dimension, and divides a lip area of each video frame in each time sub-block into N spatial sub-blocks according to a spatial dimension. If a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M ⁇ N video sub-blocks, and then calculates a lip-feature vector of each video sub-block.
  • the number of video sub-blocks obtained by different videos is the same, so that the lip of the finally extracted video is
  • the eigenvectors have the same dimension, which realizes the fixation of the feature dimension. It does not need to dynamically adjust the feature dimension, which simplifies the operation and saves time.
  • the calculating the lip feature vector of each video sub-block includes:
  • the X*Y order matrix represents a matrix of X rows and Y columns
  • the Y*Y order matrix represents a matrix of Y rows and Y columns
  • FIG. 2 is a flowchart of a lip language classification method according to an embodiment of the present invention. Referring to Figure 2, the method includes:
  • Step 201 For each sample video of the pre-selected D sample videos, divide the sample video into M time sub-blocks according to the time sequence of the video frames in the sample video, and include at least two in each time sub-block. Continuous video frames, each of which includes a lip area.
  • Step 202 Divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and form spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block.
  • the sample video obtains a total of M ⁇ N video sub-blocks, and each time sub-block includes N video sub-blocks.
  • Step 203 Calculate a lip language feature vector of each video sub-block, where the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block.
  • Step 204 According to the lip feature vector of the video sub-block in the D sample videos, the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm until the classification accuracy of the multi-layer classifier satisfies the first pre-preparation When the condition is set to stop, the trained multi-layer classifier is obtained, and the multi-layer classifier is used to classify the semantic information of the video.
  • M, N, and D are positive integers, D>1, and ⁇ is used to represent the product of numerical values.
  • the method provided by the embodiment of the present invention divides a video into M time sub-blocks according to a time dimension, and divides a lip area of each video frame in each time sub-block into N spatial sub-blocks according to a spatial dimension.
  • a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M ⁇ N video sub-blocks, and then calculates the X-dimensional lip feature of each video sub-block.
  • the number of video sub-blocks obtained by different videos is the same, so that the lip-feature vector dimension of the finally extracted video is the same, and the feature dimension is fixed, and according to the sample video
  • the lip eigenvectors of the video sub-blocks train the classification accuracy of the multi-layer classifier, eliminating the need to dynamically adjust the feature dimensions, simplifying the operation, saving training time, and applying the trained multi-layer classifier to the video.
  • classifying there is no need to dynamically adjust the feature dimension, which simplifies the operation, saves the classification time, and improves the classification accuracy.
  • the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm, including:
  • Step 1 According to a predetermined rule, construct L specified identifiers, and the L specified identifiers are used to determine the number of corresponding hidden layer nodes and the selected video sub-blocks in the D sample videos, L is a positive integer, L>1;
  • Step 2 Obtain a flag value for each specified identifier, and each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and a different flag value corresponds to the hidden The number of layer nodes is different or the corresponding selected video sub-blocks are different, and each flag value is used to train a corresponding multi-layer classifier;
  • Step 3 For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, corresponding to the flag value The classification accuracy of the multi-layer classifier is trained;
  • Step 4 According to the flag value of each specified identifier and the flag value of each specified identifier, the classification accuracy rate of the multi-layer classifier is obtained, and the global optimal flag value of the L designated identifiers is obtained according to a preset search algorithm. And update the flag value of each specified identifier;
  • the classification of the multi-layer classifier corresponding to the flag value according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos Training for accuracy including:
  • the selected H videos in the sample video The X-dimensional lip eigenvectors of the sub-blocks are combined to obtain an H ⁇ X-dimensional lip eigenvector of the sample video;
  • the multi-layer classifier including at least an input layer node and at least one hidden layer node, wherein the projection matrix is used to represent The weight of the input layer node.
  • the classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier is obtained, and the global optimal flag of the L designated identifiers is obtained according to a preset search algorithm. Values, and updates the flag values for each specified identity, including:
  • the flag value of each specified identifier is updated according to a preset update algorithm according to the local attractor of each specified identifier and the average optimal flag value.
  • the classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier is obtained, and the optimal flag value of each specified identifier and the global optimal flag value are obtained, including:
  • the classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier is used as the fitness value of the specified identifier
  • the degree value is used to update the global optimal flag values of the L specified identifiers to obtain an updated global optimal flag value.
  • FIG. 3 is a flowchart of a feature extraction method according to an embodiment of the present invention.
  • the execution body of the embodiment of the present invention is a feature extraction device. Referring to FIG. 3, the method includes:
  • Step 301 For each video, the feature extraction device divides the video into M ⁇ N video sub-blocks according to the time sequence of the video frames in the video.
  • the feature extraction device may be a device such as a computer or a server, which is not limited in this embodiment of the present invention.
  • the video includes a plurality of video frames, and the plurality of video frames are arranged in chronological order.
  • the feature extraction device may divide the video into M time sub-blocks according to the time sequence of the video frames in the video, each time sub-timer. At least two consecutive video frames are included in the block, and a lip region is included in each video frame of the video. Then, the classifying device divides the lip region of each video frame in each time sub-block into N spatial sub-blocks, and forms a spatial sub-block corresponding to the same position in each video frame in the same time sub-block into one video sub-block. Block, then a time sub-block includes N video sub-blocks, and the video obtains M ⁇ N video sub-blocks in total. Where M and N are positive integers.
  • the feature extraction device may first divide the video into M time sub-blocks, and then perform positioning and segmentation on each video frame to obtain a lip region of each video frame, and divide the lip region of each video frame into N spaces.
  • the sub-block, or the feature extraction device may further perform positioning and segmentation for each video frame to obtain a lip region of each video frame, and then divide the obtained plurality of lip regions into M time sub-blocks, each of which The lip region in the time sub-block is divided into N spatial sub-blocks, and the timing of the positioning and segmentation is not limited in the embodiment of the present invention.
  • t represents the time of the video frame
  • the coordinate system formed by x and y represents the space in which the video frame is located.
  • the video includes multiple video frames, and each video frame is segmented and positioned to obtain each video.
  • the feature extraction device may first divide according to the time dimension, group the video frames according to the time sequence of the video frames in the video, and divide the video into M time sub-blocks by using at least two video frames as one time sub-block.
  • Each time sub-block includes at least two consecutive video frames, and then divided according to a spatial dimension, and the lip region of each video frame is divided into N spatial sub-blocks, and the sub-block includes multiple spaces at the same time.
  • the sub-blocks the feature extraction device combines spatial sub-blocks corresponding to the same position in each video frame in the same spatial sub-block into one video sub-block, thereby obtaining M ⁇ N video sub-blocks.
  • the feature extraction device may also first divide according to the spatial dimension, divide the lip region of each video frame into N spatial sub-blocks, and then divide according to the time dimension, and use the spatial sub-blocks of at least two video frames as A time sub-block divides the video into M time sub-blocks, and then combines spatial sub-blocks corresponding to the same position in each video frame in each time sub-block into one video sub-block, thereby obtaining M ⁇ N video sub-blocks.
  • the embodiment of the present invention does not limit this.
  • the feature extraction apparatus may divide the video according to a time dimension to obtain M time sub-blocks.
  • the time sub-block includes a plurality of video frames, and then divides according to the spatial dimension, and divides the lip region of each video frame into N spatial sub-blocks, and each video frame is If the spatial sub-blocks corresponding to the same location form a video sub-block, the time sub-block can obtain N video sub-blocks, and the video obtains a total of M ⁇ N video sub-blocks.
  • each time sub-block includes a lip region of three video frames, and the lip region of each video frame is divided into four spatial sub-blocks: upper left corner lip region, upper right The corner lip area, the lower left corner lip area and the lower right corner lip area form the video sub-block (1) in the upper left corner lip region of the three video frames, and the upper right corner lip region of the three video frames constitute the video.
  • Sub-block (2), the lower left corner lip region of the three video frames is composed into a video sub-block (3), and the lower right corner lip region of the three video frames is formed into a video sub-block (4), and four video sub-blocks are obtained.
  • the feature extraction device may filter or repeat the block according to the number of frames of the video. Further, for a certain video, the number of video frames included in each time sub-block in the video may be In the same way, for example, if the video includes 1-10 of the ten video frames, and M is 4, the feature extraction device may group the four video frames 1-4 into one group to form a time sub-block, and then 3 -6 These four video frames are grouped into one time sub-block, and the four video frames 5-8 are grouped to form one time sub-block, and the four video frames 7-10 are grouped to form one. The time sub-block is finally divided into 4 time sub-blocks, and each time sub-block includes 4 video frames.
  • the M is referred to as a first preset number
  • N is referred to as a second preset number.
  • the feature extraction device predetermines the first preset number M and the second preset number N, and the first preset number M is used.
  • the number of time sub-blocks divided in each video is specified, and the second preset number N is used to specify the number of spatial sub-blocks divided by each video frame in the time sub-block.
  • the feature extraction device may divide the first preset number M and the second preset number N.
  • the M and N may be determined by the feature extraction device according to the accuracy requirement of the video sub-block, and the M may be 4, 5 or other values, and the N may be 5, 6, or other values. limited.
  • the feature extraction apparatus may further preset a number of video frames included in each time sub-block, such as setting a video frame including a third preset number Q in each time sub-block, when performing the division, when When the number of frames of the video is greater than the product of the first preset number and the third preset number M ⁇ Q, the feature extraction apparatus may filter the video frame in the video to filter out the first specified number of video frames.
  • the first specified number is equal to the difference between the number of video frames and M ⁇ Q, so that the number of frames of the video after filtering is equal to M ⁇ Q, and then the video frame is divided into blocks, so that the video is divided into M time sub-blocks. When it is possible to satisfy Q video frames in each time sub-block.
  • the feature extraction apparatus may select a second specified number of video frames, the second specified number being equal to the difference between the M ⁇ Q and the number of the video frames, and the second The specified number of video frames are repeatedly divided into two time sub-blocks, so that when the video is divided into M time sub-blocks, it is possible to satisfy Q video frames in each time sub-block, where Q is a positive integer, and Q can be
  • the feature extraction device is determined in advance according to the accuracy requirement of the video sub-block, which is not limited by the embodiment of the present invention.
  • the feature extraction device may first block the video to obtain M ⁇ N video sub-blocks. And, the feature extraction The device divides the video according to the time dimension and the spatial dimension, divides the original video into a set of video sub-blocks, increases the time information and spatial information in the subsequently extracted features, and can better express the lip language features.
  • Step 302 The feature extraction device calculates an X-dimensional lip language feature vector of each video sub-block, and the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block.
  • each video sub-block includes at least one spatial sub-block, and each spatial sub-block is actually a part of the lip area in the video frame.
  • this step 302 can include the following steps 302a-302c:
  • Step 302a The feature extraction device extracts an X-dimensional LBP feature vector of each spatial sub-block in the video sub-block. If the video sub-block includes Y spatial sub-blocks, the LBP feature vectors of the Y spatial sub-blocks are combined. , to obtain an X*Y order local texture feature matrix.
  • the feature extraction device may extract an X-dimensional LBP feature vector of each spatial sub-block in the video sub-block, describe the texture information of the lip region in the corresponding spatial sub-block with the LBP feature vector, and use the LBP of each spatial sub-block.
  • the feature vector is used as a column, and the LBP feature vectors of the Y spatial sub-blocks are combined to obtain a matrix of X*Y order, which is the local texture feature matrix of the video sub-block.
  • Y is a positive integer and the X*Y order matrix represents a matrix of X rows and Y columns.
  • the feature extraction device takes the pixel as a central pixel, and acquires each designated pixel adjacent to the central pixel, for the central pixel
  • the pixel value of the specified pixel is compared with the pixel value of the central pixel of the spatial sub-block, and if the pixel value of the specified pixel is greater than the pixel value of the central pixel, determining the characteristic of the designated pixel A value of 1, if the pixel value of the specified pixel is not greater than the pixel value of the center pixel, determining that the feature value of the specified pixel is 0, a binary feature value is set for each specified pixel, and all the central pixels are The eigenvalues of the specified pixel are combined, and the combined binary number is converted into a decimal number to obtain the LBP feature value of the central pixel.
  • the feature extraction device may acquire an LBP feature value of each pixel in the spatial sub-block, calculate a statistical histogram of the spatial sub-block according to an LBP feature value of each pixel, and normalize the statistical histogram.
  • the LBP feature vector of the spatial sub-block is obtained.
  • the pixel is used as a central pixel, and the neighborhood of the central pixel is acquired, and the neighborhood is a pixel region of 3*3, and each pixel in the neighborhood is
  • the pixel value is as shown in a diagram of FIG. 6, and the pixel value of each specified pixel around the center pixel is compared with the pixel value of the center pixel, and the feature value of each specified pixel can be obtained, as shown in FIG. The figure shows.
  • the specified pixel in the upper left corner is used as the rightmost bit, and the feature values of each specified pixel are sequentially combined in a clockwise direction, and the LBP feature vector of the neighborhood is obtained as 11110001.
  • each specified pixel around the center pixel has a weight of 1, 2, respectively.
  • LBP is essentially used to describe the texture information of each spatial sub-block in the video.
  • the extracted LBP features have significant advantages such as rotation invariance and gray invariance, and enhance the lip-feature vector to illumination conditions, rotation, The robustness of factors such as translation improves the classification accuracy. And the LBP feature vector has strong discriminating power and simple calculation.
  • Step 302b the feature extraction device performs singular value decomposition on the local texture feature matrix to obtain a Y*Y order first right singular matrix, and extracts a first column vector of the first right singular matrix as a projection vector, Y*Y
  • the order matrix represents a matrix of Y rows and Y columns.
  • the feature extraction device may perform singular value decomposition on the local texture feature matrix, obtain the right singular matrix obtained by the decomposition, as the first right singular matrix, and extract the first The first column vector of a right singular matrix as a projection vector.
  • the feature extraction device may apply the following formula to perform singular value decomposition on the local texture feature matrix:
  • X represents the local texture feature matrix
  • the matrix u is a left singular matrix obtained by decomposing the local texture feature matrix X
  • the matrix s is a singular value matrix obtained by decomposing the local texture feature matrix X
  • the matrix v is the local texture
  • the right singular matrix obtained by the decomposition of the feature matrix X is an X*Y order matrix
  • the first right singular matrix obtained by performing singular value decomposition on the local texture feature matrix is a Y*Y order matrix
  • the projection vector is a Y-dimensional vector.
  • Step 302c The feature extraction device calculates a product of the local texture feature matrix and the projection vector to obtain an X-dimensional lip language feature vector of the video sub-block.
  • the feature extraction device may calculate a lip language feature vector of the video sub-block by applying the following formula:
  • pVect represents the first column vector of the matrix v
  • f PLBP represents the lip eigenvector.
  • the local texture feature matrix is an X*Y order matrix
  • the projection vector is a Y-dimensional vector
  • the calculated lip language feature vector is an X-dimensional vector.
  • the lip eigenvector of the video can be as follows:
  • f PLBPi represents the PLBP feature of the i-th video sub-block in the video
  • m represents the number of video sub-blocks (M ⁇ N) divided by the video
  • F represents the lip-feature vector of the video. Since the number of video sub-blocks divided by different videos is fixed, the dimension of the lip feature vector F of the video is also fixed, and the lip-feature vector can be used to classify the video.
  • the singular value decomposition is performed on the extracted local texture feature matrix on the premise of extracting the local texture feature matrix of the video sub-block, and the first column vector of the resolved right singular matrix is used as the optimal projection vector.
  • the local texture feature matrix is projected, and the PLBP (Projection Local Binary Patterns) feature of the video sub-block is extracted.
  • the PLBP feature is a method for selecting feature features of image sequences based on LBP features.
  • the basic principle is to extract the LBP feature vectors corresponding to each frame of the image sequence, and combine the feature vectors of all frame images into feature matrices.
  • Each column corresponds to the feature vector of a frame of pictures, never The feature of the fixed dimension is extracted from the image sequence of the same frame number.
  • the optimal projection vector is found, and the feature matrix is projected based on the optimal projection vector.
  • the embodiment of the present invention adopts the idea of extracting features by block, divides video into several video sub-blocks in time and space, extracts PLBP features of video sub-blocks, and finally combines PLBP features of video sub-blocks. Become a new feature and output as the final feature.
  • the application of the blocking technique increases the spatial and temporal information contained in the finally extracted feature vector, which can better describe the movement of the lip in the image sequence, which greatly facilitates the optimization of the training algorithm of the late classifier.
  • the lip language classification stage significantly improves the lip reading recognition rate, and has great reference significance for related video recognition technology.
  • Step 303 The feature extraction device combines the X-dimensional lip feature vectors of the M ⁇ N video sub-blocks in the video to obtain an X ⁇ M ⁇ N-dimensional lip feature vector of the video.
  • the feature extraction device may acquire an X-dimensional lip language feature vector of each video sub-block in the video, and combine the X-dimensional lip feature vectors of the M ⁇ N video sub-blocks into the M ⁇ N video sub-blocks.
  • the X-dimensional lip feature vector of any video sub-block is arranged after the X-dimensional lip feature vector of the previous video sub-block, thereby obtaining an X ⁇ M ⁇ N-dimensional lip feature vector.
  • a lip language feature vector of any video can be extracted, and the lip language feature vector can be used to classify the semantic information of the video.
  • the feature extraction device may obtain the lip language feature vector of the plurality of sample videos by using the above steps 301-303, and train the classifier according to the lip feature vector of the plurality of sample videos, and whenever the semantic information of the video is to be classified, Using the above steps 301-303, the lip language feature vector of the video is obtained, and the lip language feature vector is input to the trained classifier to obtain the classification result.
  • the method provided by the embodiment of the present invention divides a video into M time sub-blocks according to a time dimension, and divides a lip area of each video frame in each time sub-block into N spatial sub-blocks according to a spatial dimension. If a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M ⁇ N video sub-blocks, and then calculates a lip-feature vector of each video sub-block.
  • the feature extraction method provided by the embodiment of the present invention can extract a lip-feature feature vector with a fixed dimension, and the lip-feature feature vector can be used to classify the semantic information of the video.
  • the classification process is detailed in the next embodiment.
  • FIG. 7 is a flowchart of a lip language classification method according to an embodiment of the present invention.
  • the execution subject of the embodiment of the present invention is a classification device. Referring to FIG. 7, the method includes:
  • Step 701 The classification device extracts a lip feature vector of the video sub-block in the pre-selected D sample videos.
  • the classifying device pre-acquires D sample videos, divides each sample video into M ⁇ N video sub-blocks, and extracts the lip-language feature vector of each video sub-block, the specific process and the above Steps 301-302 are similar, and are not described herein again.
  • Step 702 The classification device trains the classification accuracy of the multi-layer classifier according to a preset training algorithm according to the lip feature vector of the video sub-block in the D sample videos until the classification accuracy of the multi-layer classifier satisfies When a preset condition is stopped, a trained multi-layer classifier is obtained, D is a positive integer, D>1.
  • the classification device may train the classification accuracy of the multi-layer classifier according to a preset training algorithm according to the lip feature vector of the video sub-block in the D sample videos until the classification of the multi-layer classifier
  • the multi-layer classifier obtained by the training can be used to classify the semantic information of the video to be classified, thereby implementing lip language recognition.
  • the preset training algorithm is pre-determined by the classification device, and may be an SVM (Support Vector Machine) classification algorithm, an artificial neural network algorithm, etc., and the first preset condition is used to determine the multi-layer classifier.
  • the training target may be determined according to the requirement of the classification accuracy rate, and the first preset condition may be that the current classification accuracy rate of the multi-layer classifier reaches a preset classification accuracy rate, or the current classification of the multi-layer classifier is accurate.
  • the difference between the rate and the previous classification accuracy is less than the preset difference, or the number of iterations of the training process reaches the maximum number of iterations, etc., which is not limited by the embodiment of the present invention.
  • the classification device may combine the lip feature vectors of all the video sub-blocks in each sample video into one lip language feature vector, and the specific process and the above step 303 “the feature extraction device M ⁇ N in the video
  • the X-dimensional lip eigenvectors of the video sub-blocks are combined to obtain the X ⁇ M ⁇ N-dimensional lip eigenvectors of the video, which are similar, and will not be described herein.
  • the classification device can combine the lip feature vectors of each sample video to obtain a feature matrix, and use the feature matrix as an input of the multi-layer classifier, using an ELM (Extreme Learning Machine) algorithm, The classification accuracy of the layer classifier is trained.
  • ELM Extreme Learning Machine
  • the multi-layer classifier includes an input layer node and at least one hidden layer node, and the input weight is used to represent the weight of the input layer node, and only the number of hidden layer nodes needs to be determined during training.
  • the multi-level classifier can be trained by inputting the weight, the offset term and the excitation function. When the training is completed, the multi-layer classifier can be determined according to the input weight and the output weight of the current training.
  • the input weights and offsets of the hidden layer nodes used by the ELM algorithm are obtained by random assignment. Random assignment will cause the multi-layer classifier to be unstable in high-dimensional small samples, and it is difficult to obtain optimal results. Parameter value.
  • the classification device may determine the input weight of the multi-layer classifier by means of projection, and train the classification accuracy of the multi-layer classifier based on the determined input weight.
  • a PELM Projection Extreme Learning Machine
  • FIG. 8 is a structural diagram of a PELM multi-layer classifier, the PELM multi-layer classification
  • the device includes an input node, a hidden layer node, and an output node, where D represents the dimension of the input feature vector, N is the number of hidden layer nodes, and m is the dimension of the output vector (ie, the number of categories of the spoken content to be distinguished).
  • each row corresponds to one input sample (the eigenvector of a video sub-block);
  • each row represents a class vector of a sample (what class the sample belongs to, the position corresponding to the class in the vector is 1, and the remaining positions are 0);
  • w DN represents the D input node
  • ⁇ Nm represents the output weight from the Nth hidden layer node to the mth output node.
  • the training process of the PELM multi-layer classifier may be: the classification device acquires a lip language feature vector of a plurality of sample videos, combines to obtain a feature matrix, performs singular value decomposition on the feature matrix, and obtains a right singular matrix obtained by decomposition, Extracting a column vector corresponding to the preset retention dimension from the right singular matrix according to a preset retention dimension, as a projection matrix, using the projection matrix as an input weight of the multi-layer classifier, and using the projection matrix Representing the weight of the input layer node of the multi-layer classifier, and no longer randomly assigning the input weight, based on the projection matrix, the current number of hidden layer nodes, and the excitation function, the classification accuracy of the multi-layer classifier is performed.
  • the multi-layer classifier is determined according to the trained input weight and the output weight, so that the multi-layer classifier is used to classify the semantic information of the video.
  • the preset retention dimension is used to specify the number of columns of the projection matrix, and the preset retention dimension is less than the number of columns of the right singular matrix, and may be 1, 2 or other values, which is not in this embodiment of the present invention. Make a limit.
  • the classification device acquires a lip feature vector of D sample videos, and if the lip feature vector of each sample video is an R-dimensional vector, the R-dimensional lip feature vectors of the D sample videos are combined to obtain D.
  • the PELM multi-layer classifier is trained.
  • Finding the category corresponding to the maximum value in t 1 , t 2 , ..., t m is the classification result of the video.
  • PELM is an easy-to-use and effective single hidden layer neural network learning algorithm. Compared with the traditional neural network learning algorithm, it is necessary to artificially set a large number of network training parameters, and it is easy to generate a local optimal solution. PELM only needs to use the projection matrix of the feature matrix composed of the lip language feature vectors of multiple sample videos as the projection matrix.
  • the input weight of the network, and the number of hidden layer nodes of the network do not need to adjust the input weight of the network and the bias term of the hidden layer node during the execution of the algorithm, and can generate a unique optimal solution, thus having learning
  • the advantages of fast speed and good generalization performance enable the trained multi-layer classifier to obtain a stable recognition rate.
  • the classification device may also adopt a preset selection policy to multiple The video sub-block of the sample video is selected, and a video sub-block capable of well describing the lip-speech information is selected, and the selected video sub-block is recombined, and only the selected video sub-block is taken as a sample, and the samples can be used for
  • the classification accuracy of the multi-layer classifier is trained to reduce the redundancy characteristics and improve the calculation speed.
  • the number of selected video sub-blocks in different sample videos is the same to ensure that the dimension of the lip-feature vector of different sample videos is fixed.
  • the preset selection policy is used to determine the selection mode of the video sub-block, which may be determined in advance by the classification device, which is not limited by the embodiment of the present invention.
  • the classifying device may select a video sub-block in the sample video by using the training method provided in the following steps 702a-702c, and train a multi-layer classifier based on the lip feature vector of the selected video sub-block:
  • Step 702a Constructing L designated identifiers according to a predetermined rule, and acquiring a flag value of each specified identifier, where the L specified identifiers are used to determine the corresponding number of hidden layer nodes and the selected video sub-blocks in the D sample videos.
  • the classifying device can construct L designated identifiers according to a predetermined rule, L is a positive integer, L>1. Then, according to different number of hidden layer nodes and different selection methods of video sub-blocks, each specified identifier is initialized, a flag value is assigned to each specified identifier, and a flag value can be randomly assigned to each specified identifier, or The preset flag value is assigned to each of the designated identifiers, such as the flag value 0000 or 1111, and the like, which is not limited by the embodiment of the present invention.
  • Each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and the number of hidden layer nodes corresponding to different flag values is different or the corresponding selected video sub-
  • the blocks are different, that is, each flag value corresponds to a number of hidden layer nodes, and corresponds to the selected video sub-blocks in the D sample videos, and the number of corresponding hidden layer nodes and each sample video can be determined according to each flag value.
  • the video sub-block should be selected.
  • the classification device may use 10 flag bits as the flag bits indicating the number of hidden layer nodes, and the decimal value corresponding to the binary number formed by the 10 flag bits is the number of hidden layer nodes.
  • the classification device may also adopt m flag bits as flag bits indicating whether each video sub-block is selected, and m is the number of video sub-blocks in each sample video, when one of the m flag bits When it is 1, it indicates that the video sub-block corresponding to the flag bit is selected. When one of the m flag bits is 0, it indicates that the video sub-block corresponding to the flag bit is not selected. For example, when m flag bits are 1000, it means that the first video sub-block in each video is selected.
  • the classification device takes the binary number composed of the 10+m flag bits as a flag value, 10+m represents the number of bits of the flag value, and assigns the flag value to a specified identifier.
  • Step 702b For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, corresponding to the flag value The classification accuracy of the multi-layer classifier is trained.
  • each flag value is used to train a corresponding multi-layer classifier. For each specified identifier, when the flag value of the specified identifier is obtained, the number of hidden layer nodes corresponding to the flag value and the selected video sub-block in the corresponding D sample videos may be determined, and the flag value is obtained.
  • the L-numbered identifiers of the specified identifiers can train L multi-layer classifiers.
  • the step 702b may specifically include the following steps 702b-1 to 702b-4:
  • Step 702b-1 Determine the number of hidden layer nodes corresponding to the flag value of the specified identifier and the selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer.
  • the classifying device may determine the number of hidden layer nodes corresponding to the flag value and the selected H video sub-blocks of the corresponding D sample videos according to values on the plurality of flag bits in the flag value.
  • the flag value includes 10+m flag bits
  • the classifying device calculates a decimal value corresponding to the first 10 flag bits in the flag value, that is, the number of hidden layer nodes corresponding to the flag value, and obtains A flag bit having a value of 1 in the last m flag bits, and a video sub-block corresponding to the flag bit having a value of 1 is the selected video sub-block in each sample video.
  • Step 702b-2 combining, for each sample video in the D sample videos, an X-dimensional lip feature vector of the selected H video sub-blocks in the sample video to obtain an H ⁇ X-dimensional lip language of the sample video.
  • the feature vector combines the H ⁇ X-dimensional lip feature vectors of the D sample videos to obtain a feature matrix of D*(H ⁇ X) order.
  • the classification device trains the multi-layer classifier only according to the selected video sub-block, and therefore, for each sample video in the D sample videos, the selected sample video is selected.
  • the X-dimensional lip eigenvectors of the H video sub-blocks are combined to obtain the H ⁇ X-dimensional lip eigenvectors of the sample video, and the lip eigenvectors of all the video sub-blocks in the sample video are no longer combined.
  • An H ⁇ X-dimensional lip feature vector is obtained for each sample video.
  • the H ⁇ X-dimensional lip feature vector of each sample video is used as a row, and the H ⁇ X-dimensional lip feature vectors of D sample videos are combined to obtain A feature matrix of D*(H ⁇ X) order.
  • Step 702b-3 performing singular value decomposition on the feature matrix to obtain a second right singular matrix, and extracting, from the second right singular matrix, a column vector corresponding to the preset retention dimension as a projection matrix.
  • the classifying device After acquiring the feature matrix, the classifying device performs singular value decomposition on the feature matrix to obtain a right singular matrix as a second right singular matrix, and extracts and presets from the second right singular matrix.
  • the column vector corresponding to the dimension is reserved as the projection matrix.
  • the projection process is similar to the above step 302b, and details are not described herein again.
  • Step 702b-4 training the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes, the multi-layer classifier including at least an input layer node and at least one hidden layer node,
  • the projection matrix is used to represent the weight of the input layer node.
  • the classification device calculates the projection matrix
  • the weight of the input layer node of the multi-layer classifier is represented by the projection matrix
  • the classification of the multi-layer classifier is accurate based on the projection matrix, the excitation function and the number of the hidden layer nodes.
  • the rate is trained until the classification accuracy of the multi-layer classifier stops when the first preset condition is met.
  • the classification device can use D sample video as the training sample video, obtain the projection matrix corresponding to the D sample videos, and train a multi-layer classifier based on the projection matrix, the excitation function and the number of the hidden layer nodes. And acquiring D' test sample videos, obtaining a feature matrix corresponding to the D' sample videos, inputting the feature matrix into the multi-layer classifier, obtaining a classification result of each test sample video, and each test sample The classification result of the video is compared with the category actually divided by the test sample video, and the classification accuracy of the multi-layer classifier is calculated.
  • Step 702c Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain a global optimal flag value of the L designated identifiers according to a preset search algorithm. The flag value of each specified identifier is updated.
  • the flag value of each specified identifier trains a multi-layer classifier, and the classification device can search for the flag values of the L specified identifiers according to a preset search algorithm, and find the L designations.
  • the global optimal flag value of the identifier is obtained, thereby obtaining a multi-layer classifier corresponding to the global optimal flag value.
  • the preset search algorithm may be determined in advance by the classification device, which is not limited by the embodiment of the present invention.
  • the classification device may obtain the global optimal flag value by using the search methods provided in the following steps 702c-1 to 702c-4:
  • Step 702c-1 Calculate an average of the L designated identifiers according to the flag value of each specified identifier. Excellent flag value.
  • the classifying device may calculate an average optimal flag value of the L designated identifiers according to the flag value of each specified identifier by applying the following formula:
  • n the number of dimensions of the flag value of the specified identifier
  • P i,n (t) represents The value of the nth flag bit in the flag value of the i-th specified identifier.
  • Step 702c-2 Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, and obtain an optimal flag value of each specified identifier and the global optimal flag value.
  • the classification device may use the classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier as the fitness value of the specified identifier, and use the fitness value to perform the flag values of the L specified identifiers. Adjusting the evolution, according to the current fitness value of the specified identifier and the historical optimal fitness value of the specified identifier, updating the optimal flag value of the specified identifier, and obtaining the optimal flag value after the specified identifier is updated, according to the Specifying a current fitness value and a historical global optimal fitness value of the L specified identifiers, and updating the global optimal flag values of the L specified identifiers to obtain an updated global optimal flag value.
  • the optimal fitness value of the specified identifier refers to the maximum fitness value among the fitness values corresponding to the plurality of flag values of the specified identifier, and the optimal flag value of the specified identifier refers to multiple iterations. Medium, the flag value of the plurality of flag values of the specified identifier having the largest fitness value.
  • the classification device obtains the current fitness value and the historical optimal fitness value of the specified identifier, and if the current fitness value of the specified identifier is greater than the historical optimal adaptation of the specified identifier The value of the specified identifier is used as the optimal flag value of the specified identifier, and if the current fitness value of the specified identifier is not greater than the historical optimal fitness value of the specified identifier, the specified identifier is The optimal flag value is unchanged, and is still the historical optimal fitness value of the specified identifier.
  • the global optimal fitness value of the L specified identifiers refers to the maximum fitness value in the optimal fitness value of each of the L specified identifiers, and the global optimal flag values of the L specified identifiers are multiple The value of the flag with the highest fitness value among the multiple flag values of each specified identifier during the iteration.
  • the classification device obtains the current fitness value of the specified identifier and the historical global optimal fitness value of the L designated identifiers, if the current identifier of the specified identifier If the value is greater than the historical global optimal fitness value of the L specified identifiers, the current flag value of the specified identifier is used as the global optimal flag value of the L specified identifiers, and if the current fitness value of the specified identifier is not greater than L For the historical global optimal fitness value of the specified identifier, the global optimal fitness value of the L specified identifiers is unchanged, and is still the historical global optimal fitness value of the L specified identifiers.
  • Step 702c-3 Calculate a local attractor of each specified identifier according to an optimal flag value of each specified identifier and the global optimal flag value.
  • the classification device may calculate the local attractor of the specified identifier according to the optimal flag value of the specified identifier and the global optimal flag value by applying the following formula:
  • p(t) represents the local attractor of the specified identifier
  • P i (t) represents the optimal flag value of the specified identifier
  • P g (t) represents the global optimal flag value of the L specified identifiers.
  • Step 702c-4 Update the flag value of each specified identifier according to a preset update algorithm according to the local attractor of each specified identifier and the average optimal flag value.
  • the classification device may apply the following formula to update the flag value of the specified identifier according to the local attractor of the specified identifier and the average optimal flag value:
  • x(t+1) p(t) ⁇ *abs(m best -x(t))*ln(1/u);
  • x(t+1) represents the updated flag value of the specified identifier
  • p(t) represents the local attractor of the specified identifier
  • represents the contraction expansion coefficient
  • m best represents the average optimal flag value
  • x ( t) indicates the flag value before the specified identifier is updated
  • u is a random number uniformly obeyed at (0, 1), u: U(0, 1).
  • the classifying device After the flag value of each specified identifier is updated, the flag value of each specified identifier changes, and the classifying device repeatedly performs the above steps 702b to 702c based on the updated flag value of each specified identifier until the calculated value is obtained.
  • the global optimal flag value satisfies the second preset condition, the multi-level classifier trained by the global optimal flag value is acquired.
  • the second preset condition may include that the global optimal flag value reaches a preset optimal flag value, and the global optimal fitness value corresponding to the global optimal flag value reaches a maximum fitness value, and the global optimal flag value
  • the difference between the global optimal fitness value corresponding to the global optimal fitness value obtained last time is smaller than the preset global difference, the current number of iterations reaches the maximum number of iterations, and the global maximum
  • the classification accuracy of the multi-layer classifier trained by the superior flag value reaches at least one of the preset accuracy rates, which is not limited by the embodiment of the present invention.
  • Step 703 The classification device classifies the semantic information of the video to be classified based on the multi-layer classifier.
  • the classifying device After finding the global optimal flag value, the classifying device acquires the multi-layer classifier trained by the global optimal flag value, and obtains the input weight and the output weight of the multi-layer classifier, and the multi-layer classifier can be used for Classify the semantic information of the video.
  • the classification device acquires the video to be classified
  • the lip language feature vector of each video sub-block in the video is extracted by using the method shown in the above steps 301-303, and the lip language feature vector of each video sub-block is input.
  • the classification result is calculated, and the lip recognition process of the video can be realized, and the semantic information of the video is obtained.
  • the method provided by the embodiment of the present invention divides a video into M time sub-blocks according to a time dimension, and divides a lip area of each video frame in each time sub-block into N spatial sub-blocks according to a spatial dimension. If a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M ⁇ N video sub-blocks, and then calculates a lip-feature vector of each video sub-block. Combining the lip feature vectors of M ⁇ N video sub-blocks in the video to obtain the lip feature vector of the video, the number of video sub-blocks obtained by different videos is the same, so that the final extraction is performed.
  • the video has the same lip-dimensional feature vector dimension, which realizes the fixed dimension of the feature. It does not need to dynamically adjust the feature dimension in the classification, which simplifies the operation, saves the training time, and applies the multi-layer classifier of the training.
  • the feature dimension is not dynamically adjusted, which simplifies the operation, saves the classification time, and improves the classification accuracy.
  • the local texture feature matrix is projected to obtain the lip-feature feature vector, which enhances the robustness of the lip-feature feature.
  • the video sub-block is selected by using a preset selection strategy, and a video sub-block capable of describing the lip-speech information is selected to overcome the defects of different lip-speech information of each video sub-block in the block PLBP feature, and the defect is reduced. Redundant features increase computational speed and classification accuracy.
  • the present invention innovatively proposes a video segmentation method and a description operator PLBP, which can ensure the integrity of information. Effectively increase the spatial and temporal information in the video, and can display the video with fixed dimension in the dimension with fixed dimension, which can describe the spatio-temporal feature well, which greatly facilitates the optimization of the post-recognition algorithm, and thus the subsequent lip language.
  • the recognition phase significantly improves the recognition rate of lip reading.
  • the embodiment of the invention adopts a combination of BQPSO (Binary Quantum Particle Swarm Optimization) algorithm and PELM algorithm, and uses the specified identifier of the structure as a particle, and BQPSO as a search algorithm for selecting a feature combination, with more PELM.
  • the classifier classifies the classification accuracy of the sample video as the fitness value, continuously adjusts the flag values of the L specified identifiers, searches for the flag value that optimizes the fitness value, and determines the feature combination that optimizes the fitness value.
  • the video sub-blocks are selected by BQPSO and trained and classified by PELM's multi-layer classifier, which can significantly improve the speed of sample training in lip-speech recognition process, and achieve higher recognition rate, which can enhance the application in mobile terminals. Sexuality provides a reference for the application of other biometrics on mobile terminals.
  • the HMM Hidden Markov Model
  • the method provided by the embodiment of the present invention are used for experiments, and a total of 20 experiments are taken during the experiment.
  • 5 samples were used as training samples for each experimental command, and 5 samples were used as test samples.
  • a total of 100 training samples and 100 samples were obtained. Test the sample.
  • the training time used in the training using the HMM algorithm and the 100 training samples in the prior art and the training time used in applying the method provided by the embodiment of the present invention and 100 training samples are shown in Table 1 below.
  • the classification accuracy of the classification of 100 test samples by the classifier trained by the HMM algorithm in the prior art and the classification of 100 test samples by the multi-layer classifier trained by the method provided by the embodiment of the present invention The classification accuracy rate is shown in Table 2 below.
  • the training time of the method provided by the embodiment of the present invention is less than 0.1 s, and the training time of the traditional HMM algorithm is as long as 4.538 s. It can be seen from Table 2 below that the average classification accuracy of the method provided by the embodiment of the present invention is as high as 97.2%, while the average classification accuracy of the traditional HMM algorithm is only 84.5%.
  • FIG. 9 is a schematic structural diagram of a feature extraction apparatus according to an embodiment of the present invention.
  • the apparatus includes:
  • the dividing module 901 is configured to divide the video into a first preset number of time sub-blocks according to a chronological order of video frames in the video, where each time sub-block includes at least two consecutive video frames, each video frame Including the lip area;
  • the dividing module 901 is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and corresponding space spaces of the same position in each video frame in the same time sub-block
  • the blocks form a video sub-block, and the video obtains a total of M ⁇ N video sub-blocks, and each time sub-block includes N video sub-blocks;
  • the feature calculation module 902 is configured to calculate a lip language feature vector of each video sub-block obtained by the dividing module 901, where the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block, each video
  • the lip eigenvectors of the sub-blocks are X-dimensional vectors.
  • the combination module 903 is configured to combine the lip feature vectors of the plurality of video sub-blocks in the video obtained by the feature calculation module 902 to obtain a lip feature vector of the video.
  • the device provided by the embodiment of the present invention divides the video into M time sub-blocks according to the time dimension, and divides the lip area of each video frame in each time sub-block into N spatial sub-blocks according to the spatial dimension. If a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M ⁇ N video sub-blocks, and then calculates a lip-feature vector of each video sub-block.
  • the number of video sub-blocks obtained by different videos is the same, so that the lip of the finally extracted video is
  • the eigenvectors have the same dimension, which realizes the fixation of the feature dimension. It does not need to dynamically adjust the feature dimension, which simplifies the operation and saves time.
  • the feature calculation module 902 includes:
  • An extracting unit configured to extract an X-dimensional local binary mode LBP feature vector of each spatial sub-block in a video sub-block, where the one video sub-block includes Y spatial sub-blocks;
  • a combining unit configured to combine the LBP feature vectors of the Y spatial sub-blocks obtained by the extracting unit to obtain an X*Y-order local texture feature matrix
  • a decomposition unit configured to perform singular value decomposition on the local texture feature matrix obtained by the combining unit to obtain a Y*Y order first right singular matrix
  • a projection unit configured to extract a first column vector of the first right singular matrix obtained by the decomposition unit, as a projection vector
  • a calculation unit configured to calculate a product of the local texture feature matrix obtained by the combining unit and the projection vector obtained by the projection unit, to obtain an X-dimensional lip language feature vector of the video sub-block;
  • the X*Y order matrix represents a matrix of X rows and Y columns
  • the Y*Y order matrix represents a matrix of Y rows and Y columns
  • FIG. 10 is a schematic structural diagram of a lip language sorting apparatus according to an embodiment of the present invention.
  • the apparatus includes:
  • the dividing module 1001 is configured to, for each sample video of the preselected D sample videos, divide the sample video into M time sub-blocks according to a chronological order of video frames in the sample video, in each time sub-block Include at least two consecutive video frames, each of which includes a lip region;
  • the dividing module 1001 is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and corresponding space sub-frames corresponding to the same position in each video frame in the same time sub-block Blocks form a video sub-block, the sample video obtains M ⁇ N video sub-blocks, and each time sub-block includes N video sub-blocks;
  • a feature calculation module 1002 configured to calculate a lip language feature vector of each video sub-block obtained by the dividing module 1001, where the lip language feature vector is used to describe texture information of a lip region in a corresponding video sub-block;
  • the training module 1003 is configured to perform, according to the lip training feature vector of the video sub-blocks in the D sample videos obtained by the feature calculation module 1002, the classification accuracy of the multi-layer classifier according to a preset training algorithm, until the The multi-level classifier is used to classify the semantic information of the video when the classification accuracy rate of the multi-layer classifier is stopped when the first preset condition is met, and the training is completed.
  • M, N, and D are positive integers, D>1, and ⁇ is used to represent the product of numerical values.
  • the device provided by the embodiment of the present invention divides the video into M time sub-blocks according to the time dimension, and divides the lip area of each video frame in each time sub-block into N spatial sub-blocks according to the spatial dimension.
  • a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block
  • the video obtains a total of M ⁇ N video sub-blocks, and then calculates the X-dimensional lip feature of each video sub-block.
  • Vector the number of video sub-blocks obtained by different videos is the same, so that the lip-feature vector dimension of the finally extracted video is the same, and the feature dimension is fixed, and according to the lip language of the video sub-block in the sample video.
  • the eigenvectors train the classification accuracy of the multi-layer classifier without dynamic adjustment of the feature dimension, simplifying the operation and saving the training time.
  • feature Dynamic adjustment of the dimension simplifies the operation, saves the classification time, and improves the classification accuracy.
  • the training module 1003 is configured to perform the following steps:
  • Step 1 According to a predetermined rule, construct L specified identifiers, and the L specified identifiers are used to determine the number of corresponding hidden layer nodes and the selected video sub-blocks in the D sample videos, L is a positive integer, L>1;
  • Step 2 Obtain a flag value for each specified identifier, and each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and a different flag value corresponds to the hidden The number of layer nodes is different or the corresponding selected video sub-blocks are different, and each flag value is used to train a corresponding multi-layer classifier;
  • Step 3 For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, the flag value is Training the classification accuracy of the corresponding multi-layer classifier;
  • Step 4 Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain the global optimal flag value of the L designated identifiers according to a preset search algorithm. And update the flag value of each specified identifier;
  • the training module 1003 is further configured to determine a number of hidden layer nodes corresponding to the flag value of the specified identifier and a selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer;
  • Each sample video in the sample video combines the X-dimensional lip feature vectors of the selected H video sub-blocks in the sample video to obtain an H ⁇ X-dimensional lip feature vector of the sample video;
  • the H ⁇ X-dimensional lip eigenvectors of the sample video are combined to obtain a feature matrix of D*(H ⁇ X) order; singular value decomposition is performed on the feature matrix to obtain a second right singular matrix; from the second right In the singular matrix, extracting a column vector corresponding to the preset retention dimension as a projection matrix; and training the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes,
  • the multi-layer classifier includes at least an input layer node and at least one hidden layer node, the projection matrix being
  • the training module 1003 is further configured to calculate, according to the flag value of each specified identifier.
  • An average optimal flag value of the L specified identifiers; a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, obtaining an optimal flag value of each specified identifier and the global optimal flag value Calculating a local attractor for each specified identifier according to an optimal flag value of each specified identifier and the global optimal flag value; according to the local attractor of each specified identifier and the average optimal flag value, according to Let the update algorithm update the flag value of each specified identifier.
  • the training module 1003 is further configured to, for each specified identifier, a classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier as the fitness value of the specified identifier; according to the specified identifier
  • the current fitness value and the historical optimal fitness value of the specified identifier are updated, and the optimal flag value of the specified identifier is updated to obtain an optimal flag value after the specified identifier is updated;
  • the fitness value and the historical global optimal fitness value of the L specified identifiers are updated, and the global optimal flag values of the L specified identifiers are updated to obtain an updated global optimal flag value.
  • FIG. 11 is a schematic structural diagram of a feature extraction device according to an embodiment of the present invention.
  • the device includes: a memory 1101 and a processor 1102.
  • the memory 1101 is connected to the processor 1102, and the memory 1101 is stored.
  • each time sub-block includes at least two consecutive video frames, each of which includes a lip region;
  • each time sub-block Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block,
  • the video obtains M ⁇ N video sub-blocks, and each time sub-block includes N video sub-blocks;
  • M, N, and X are positive integers, and ⁇ is used to represent the product of numerical values.
  • the device provided by the embodiment of the present invention divides the video into M time sub-blocks according to the time dimension, and divides the lip area of each video frame in each time sub-block into N spatial sub-blocks according to the spatial dimension. If a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M ⁇ N video sub-blocks, and then calculates a lip-feature vector of each video sub-block.
  • the number of video sub-blocks obtained by different videos is the same, so that the lip of the finally extracted video is
  • the eigenvectors have the same dimension, which realizes the fixation of the feature dimension. It does not need to dynamically adjust the feature dimension, which simplifies the operation and saves time.
  • processor 1102 is further configured to invoke the program code, and perform the following operations:
  • the X*Y order matrix represents a matrix of X rows and Y columns
  • the Y*Y order matrix represents a matrix of Y rows and Y columns
  • FIG. 12 is a schematic structural diagram of a lip language sorting apparatus according to an embodiment of the present invention.
  • the apparatus includes: a memory 1201 and a processor 1202.
  • the memory 1201 is connected to the processor 1202, and the memory 1201 is connected to the processor 1202.
  • Stored with program code, the processor 1202 is configured to invoke the program Code, do the following:
  • the sample video is divided into M time sub-blocks according to the chronological order of the video frames in the sample video, and each time sub-block includes at least two consecutive a video frame, each of which includes a lip region;
  • the sample video obtains M ⁇ N video sub-blocks, and each time sub-block includes N video sub-blocks;
  • the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm until the classification accuracy of the multi-layer classifier satisfies the first preset When the condition is stopped, the multi-layer classifier that is trained to be completed is obtained, and the multi-layer classifier is used to classify the semantic information of the video;
  • M, N, and D are positive integers, D>1, and ⁇ is used to represent the product of numerical values.
  • the device provided by the embodiment of the present invention divides the video into M time sub-blocks according to the time dimension, and divides the lip area of each video frame in each time sub-block into N spatial sub-blocks according to the spatial dimension.
  • a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block
  • the video obtains a total of M ⁇ N video sub-blocks, and then calculates the X-dimensional lip feature of each video sub-block.
  • Vector the number of video sub-blocks obtained by different videos is the same, so that the lip-feature vector dimension of the finally extracted video is the same, and the feature dimension is fixed, and according to the lip language of the video sub-block in the sample video.
  • the eigenvectors train the classification accuracy of the multi-layer classifier without dynamic adjustment of the feature dimension, simplifying the operation and saving the training time.
  • feature Dynamic adjustment of the dimension simplifies the operation, saves the classification time, and improves the classification accuracy.
  • processor 1202 is further configured to invoke the program code, and perform the following operations:
  • Step 1 According to a predetermined rule, construct L designated identifiers, and L designated identifiers are used to determine corresponding Number of hidden layer nodes and selected video sub-blocks in D sample video, L is a positive integer, L>1;
  • Step 2 Obtain a flag value for each specified identifier, and each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and a different flag value corresponds to the hidden The number of layer nodes is different or the corresponding selected video sub-blocks are different, and each flag value is used to train a corresponding multi-layer classifier;
  • Step 3 For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, the flag value is Training the classification accuracy of the corresponding multi-layer classifier;
  • Step 4 Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain the global optimal flag value of the L designated identifiers according to a preset search algorithm. And update the flag value of each specified identifier;
  • processor 1202 is further configured to invoke the program code, and perform the following operations:
  • the X-dimensional lip feature vectors of the selected H video sub-blocks in the sample video are combined to obtain an H ⁇ X-dimensional lip feature vector of the sample video;
  • the projection matrix is used to represent the weight of the input layer node.
  • processor 1202 is further configured to invoke the program code, and perform the following operations:
  • the flag value of each specified identifier is updated according to a preset update algorithm according to the local attractor of each specified identifier and the average optimal flag value.
  • processor 1202 is further configured to invoke the program code, and perform the following operations:
  • the classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier is used as the fitness value of the specified identifier
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

An embodiment of the present invention provides a feature extraction method and a lip-reading classification method, device and apparatus, relating to the field of feature recognition, the method comprising: according to a time sequence of frames in a video, dividing the video into M time sub-blocks; dividing a lip portion region of each frame of each time sub-block into N space sub-blocks; assembling a video sub-block from space sub-frames at corresponding identical positions in each frame of the same time sub-block, the video comprising in total M×N video sub-blocks; calculating a lip-reading feature vector of each video sub-block, each video sub-block lip-reading feature vector being an X-dimensional vector; and combining the X-dimensional lip-reading feature vectors of the M×N video sub-blocks of the video to obtain an X×M×N-dimensional feature vector of the video. The present invention fixes the number of feature vector dimensions and does not require dynamic adjustment of the number of feature vector dimensions, simplifying operations and saving training time and classification time.

Description

特征提取方法、唇语分类方法、装置及设备Feature extraction method, lip language classification method, device and device 技术领域Technical field
本发明涉及特征识别领域,特别涉及一种特征提取方法、唇语分类方法、装置及设备。The present invention relates to the field of feature recognition, and in particular, to a feature extraction method, a lip language classification method, device and device.
背景技术Background technique
随着计算机技术的智能化发展,生物特征识别技术作为新的人机交互方式,已广泛应用于身份验证等多种领域,而唇语识别技术作为一种新兴的生物特征识别技术,更是成为了人机交互领域的研究热点。With the intelligent development of computer technology, biometrics technology has been widely used in many fields such as identity verification as a new human-computer interaction method. As a new biometric technology, lip recognition technology has become a new technology. Research hotspots in the field of human-computer interaction.
特征提取是唇语识别过程中的重要步骤,在对视频进行唇语识别时,需要提取视频的唇语特征。现有的特征提取方法通常会提取视频帧的唇部轮廓,以若干参数表示该唇部轮廓,并对部分参数进行线性组合,得到视频的唇语特征。或者,将视频中的多帧图像作为二维信号,对该二维信号进行图像变换,得到视频的唇语特征。Feature extraction is an important step in the process of lip language recognition. In the lip language recognition of video, it is necessary to extract the lip language features of the video. The existing feature extraction method usually extracts the lip contour of the video frame, expresses the lip contour with several parameters, and linearly combines some parameters to obtain the lip language feature of the video. Alternatively, the multi-frame image in the video is used as a two-dimensional signal, and the two-dimensional signal is image-converted to obtain a lip-speech feature of the video.
由于不同视频中的帧数不固定,会导致采用上述特征提取方法提取唇语特征时,唇语特征的维数也不固定。然而,大部分分类器要求固定的特征维数,这就会导致在应用视频对分类器进行训练或者应用分类器对视频进行分类时,需要对视频的唇语特征的维数进行动态调整,操作繁琐,训练时间和分类时间都很长。Since the number of frames in different videos is not fixed, the dimension of the lip features is not fixed when the lip feature is extracted by the above feature extraction method. However, most classifiers require a fixed feature dimension, which results in the need to dynamically adjust the dimension of the lip feature of the video when the application video trains the classifier or when the classifier is used to classify the video. It is cumbersome, and the training time and classification time are very long.
发明内容Summary of the invention
为了固定唇语特征的维数,本发明实施例提供了一种特征提取方法、唇语分类方法、装置及设备。所述技术方案如下:In order to fix the dimension of the lip feature, the embodiment of the invention provides a feature extraction method, a lip language classification method, a device and a device. The technical solution is as follows:
第一方面,提供了一种特征提取方法,所述方法包括: In a first aspect, a feature extraction method is provided, the method comprising:
根据视频中视频帧的时间顺序,将所述视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域;Dividing the video into M time sub-blocks according to a chronological order of video frames in the video, each time sub-block includes at least two consecutive video frames, each of which includes a lip region;
将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,所述视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
计算每个视频子块的唇语特征向量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息,每个视频子块的唇语特征向量为X维的向量;Calculating a lip language feature vector of each video sub-block, wherein the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block, and the lip language feature vector of each video sub-block is an X-dimensional vector;
将所述视频中的M×N个视频子块的X维唇语特征向量进行组合,得到所述视频的X×M×N维唇语特征向量;Combining the X-dimensional lip feature vectors of the M×N video sub-blocks in the video to obtain an X×M×N-dimensional lip feature vector of the video;
其中,M、N、X均为正整数,×用于表示数值的乘积运算。Where M, N, and X are positive integers, and × is used to represent the product of numerical values.
结合第一方面,在第一方面的第一种可能实现方式中,所述计算每个视频子块的唇语特征向量,包括:With reference to the first aspect, in a first possible implementation manner of the first aspect, the calculating a lip language feature vector of each video sub-block includes:
提取一个视频子块中每个空间子块的X维局部二值模式LBP特征向量,所述一个视频子块中包括Y个空间子块;Extracting an X-dimensional local binary pattern LBP feature vector of each spatial sub-block in a video sub-block, wherein the one video sub-block includes Y spatial sub-blocks;
将Y个空间子块的LBP特征向量进行组合,得到X*Y阶局部纹理特征矩阵;Combining the LBP feature vectors of Y spatial sub-blocks to obtain an X*Y-order local texture feature matrix;
对所述局部纹理特征矩阵进行奇异值分解,得到Y*Y阶第一右奇异矩阵;Performing singular value decomposition on the local texture feature matrix to obtain a Y*Y order first right singular matrix;
提取所述第一右奇异矩阵的第一个列向量,作为投影向量;Extracting a first column vector of the first right singular matrix as a projection vector;
计算所述局部纹理特征矩阵与所述投影向量的乘积,得到所述视频子块的X维唇语特征向量;Calculating a product of the local texture feature matrix and the projection vector to obtain an X-dimensional lip language feature vector of the video sub-block;
其中,Y为正整数,X*Y阶矩阵表示X行Y列的矩阵,Y*Y阶矩阵表示Y行Y列的矩阵。Where Y is a positive integer, the X*Y order matrix represents a matrix of X rows and Y columns, and the Y*Y order matrix represents a matrix of Y rows and Y columns.
第二方面,提供了一种唇语分类方法,所述方法包括:In a second aspect, a lip language classification method is provided, the method comprising:
对于预先选择的D个样本视频中的每个样本视频,根据样本视频中视频帧的时间顺序,将所述样本视频划分为M个时间子块,每个时间子块中包括至 少两个连续的视频帧,每个视频帧中包括唇部区域;For each of the pre-selected D sample videos, the sample video is divided into M time sub-blocks according to the chronological order of the video frames in the sample video, and each time sub-block includes Two consecutive video frames, each of which includes a lip region;
将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,所述样本视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The sample video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
计算每个视频子块的唇语特征向量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息;Calculating a lip language feature vector of each video sub-block, the lip language feature vector being used to describe texture information of a lip region in the corresponding video sub-block;
根据所述D个样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率按照预设训练算法进行训练,直至所述多层分类器的分类准确率满足第一预设条件时停止,得到训练完成的所述多层分类器,所述多层分类器用于对视频的语义信息进行分类;According to the lip feature vector of the video sub-block in the D sample videos, the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm until the classification accuracy of the multi-layer classifier satisfies the first preset When the condition is stopped, the multi-layer classifier that is trained to be completed is obtained, and the multi-layer classifier is used to classify the semantic information of the video;
其中,M、N、D均为正整数,D>1,×用于表示数值的乘积运算。Among them, M, N, and D are positive integers, D>1, and × is used to represent the product of numerical values.
结合第二方面,在第二方面的第一种可能实现方式中,所述根据所述D个样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率按照预设训练算法进行训练,包括:With reference to the second aspect, in a first possible implementation manner of the second aspect, the classification accuracy of the multi-layer classifier according to the lip language feature vector of the video sub-blocks in the D sample videos is according to preset training The algorithm is trained to include:
步骤1:按照预定规则,构造L个指定标识,L个指定标识用于确定对应的隐层节点数目和D个样本视频中被选择的视频子块,L为正整数,L>1;Step 1: According to a predetermined rule, construct L specified identifiers, and the L specified identifiers are used to determine the number of corresponding hidden layer nodes and the selected video sub-blocks in the D sample videos, L is a positive integer, L>1;
步骤2:获取每个指定标识的标志值,每个标志值中包括用于表示隐层节点数目的标志位和用于表示每个视频子块是否被选择的标志位,不同标志值对应的隐层节点数目不同或者对应的被选择的视频子块不同,每个标志值用于训练出一个对应的多层分类器;Step 2: Obtain a flag value for each specified identifier, and each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and a different flag value corresponds to the hidden The number of layer nodes is different or the corresponding selected video sub-blocks are different, and each flag value is used to train a corresponding multi-layer classifier;
步骤3:对于每个指定标识的标志值,根据指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的视频子块的唇语特征向量,对与所述标志值对应的多层分类器的分类准确率进行训练;Step 3: For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, the flag value is Training the classification accuracy of the corresponding multi-layer classifier;
步骤4:根据每个指定标识的标志值以及每个指定标识的标志值训练出的多层分类器的分类准确率,按照预设搜索算法,获取所述L个指定标识的全局最优标志值,并对每个指定标识的标志值进行更新; Step 4: Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain the global optimal flag value of the L designated identifiers according to a preset search algorithm. And update the flag value of each specified identifier;
重复执行上述步骤3至4,直至所述全局最优标志值满足第二预设条件时停止。The above steps 3 to 4 are repeatedly executed until the global optimum flag value satisfies the second preset condition.
结合第二方面的第一种可能实现方式,在第二方面的第二种可能实现方式中,所述根据指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的视频子块的唇语特征向量,对与所述标志值对应的多层分类器的分类准确率进行训练,包括:With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the number of the hidden layer nodes corresponding to the flag value of the specified identifier and the selected one of the D sample videos A lip language feature vector of the video sub-block, training the classification accuracy of the multi-layer classifier corresponding to the flag value, including:
确定所述指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的H个视频子块,H为正整数;Determining, by the number of hidden layer nodes corresponding to the flag value of the specified identifier, and the selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer;
对于D个样本视频中的每个样本视频,将样本视频中被选择的H个视频子块的X维唇语特征向量进行组合,得到所述样本视频的H×X维唇语特征向量;For each of the D sample videos, the X-dimensional lip feature vectors of the selected H video sub-blocks in the sample video are combined to obtain an H×X-dimensional lip feature vector of the sample video;
对D个样本视频的H×X维唇语特征向量进行组合,得到D*(H×X)阶的特征矩阵;Combining H×X-dimensional lip feature vectors of D sample videos to obtain a feature matrix of D*(H×X) order;
对所述特征矩阵进行奇异值分解,得到第二右奇异矩阵;Performing singular value decomposition on the feature matrix to obtain a second right singular matrix;
从所述第二右奇异矩阵中,提取与预设保留维数对应的列向量,作为投影矩阵;Extracting, from the second right singular matrix, a column vector corresponding to the preset retention dimension as a projection matrix;
基于所述投影矩阵、激励函数和所述隐层节点数目,对所述多层分类器的分类准确率进行训练,所述多层分类器至少包括输入层节点和至少一个隐层节点,所述投影矩阵用于表示所述输入层节点的权重。Training the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes, the multi-layer classifier including at least an input layer node and at least one hidden layer node, A projection matrix is used to represent the weight of the input layer node.
结合第二方面的第一种可能实现方式,在第二方面的第三种可能实现方式中,所述根据每个指定标识的标志值以及每个指定标识的标志值训练出的多层分类器的分类准确率,按照预设搜索算法,获取所述L个指定标识的全局最优标志值,并对每个指定标识的标志值进行更新,包括:In conjunction with the first possible implementation of the second aspect, in a third possible implementation of the second aspect, the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier The classification accuracy rate is obtained according to a preset search algorithm, and the global optimal flag values of the L specified identifiers are obtained, and the flag values of each specified identifier are updated, including:
根据每个指定标识的标志值,计算所述L个指定标识的平均最优标志值;Calculating an average optimal flag value of the L designated identifiers according to a flag value of each specified identifier;
根据每个指定标识的标志值训练出的多层分类器的分类准确率,获取每个指定标识的最优标志值以及所述全局最优标志值; Obtaining a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, obtaining an optimal flag value of each specified identifier and the global optimal flag value;
根据每个指定标识的最优标志值和所述全局最优标志值,计算每个指定标识的局部吸引子;Calculating a local attractor for each specified identifier according to an optimal flag value of each specified identifier and the global optimal flag value;
根据每个指定标识的局部吸引子和所述平均最优标志值,按照预设更新算法对每个指定标识的标志值进行更新。The flag value of each specified identifier is updated according to a preset update algorithm according to the local attractor of each specified identifier and the average optimal flag value.
结合第二方面的第三种可能实现方式,在第二方面的第四种可能实现方式中,In conjunction with the third possible implementation of the second aspect, in a fourth possible implementation of the second aspect,
所述根据每个指定标识的标志值训练出的多层分类器的分类准确率,获取每个指定标识的最优标志值以及所述全局最优标志值,包括:The classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, obtaining an optimal flag value of each specified identifier and the global optimal flag value, including:
对于每个指定标识,将指定标识的标志值训练出的多层分类器的分类准确率作为所述指定标识的适应度值;For each specified identifier, the classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier is used as the fitness value of the specified identifier;
根据所述指定标识当前的适应度值与所述指定标识的历史最优适应度值,对所述指定标识的最优标志值进行更新,得到所述指定标识更新后的最优标志值;And updating, according to the current fitness value of the specified identifier and the historical optimal fitness value of the specified identifier, an optimal flag value of the specified identifier, to obtain an optimal flag value after the specified identifier is updated;
根据所述指定标识当前的适应度值与所述L个指定标识的历史全局最优适应度值,对所述L个指定标识的全局最优标志值进行更新,得到更新后的全局最优标志值。And updating, according to the current fitness value of the specified identifier and the historical global optimal fitness value of the L specified identifiers, the global optimal flag values of the L specified identifiers to obtain an updated global optimal identifier. value.
第三方面,提供了一种特征提取装置,所述装置包括:In a third aspect, a feature extraction device is provided, the device comprising:
划分模块,用于根据视频中视频帧的时间顺序,将所述视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域;a dividing module, configured to divide the video into M time sub-blocks according to a chronological order of video frames in a video, where each time sub-block includes at least two consecutive video frames, and each video frame includes a lip region ;
所述划分模块,还用于将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,所述视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;The dividing module is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and to correspond to spatial sub-blocks of the same position in each video frame in the same time sub-block Composing a video sub-block, the video obtains a total of M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
特征计算模块,用于计算所述划分模块得到的每个视频子块的唇语特征向 量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息,每个视频子块的唇语特征向量为X维的向量;a feature calculation module, configured to calculate a lip feature feature of each video sub-block obtained by the dividing module The lip language feature vector is used to describe the texture information of the lip region in the corresponding video sub-block, and the lip feature vector of each video sub-block is an X-dimensional vector;
组合模块,用于将所述特征计算模块得到的所述视频中的M×N个视频子块的X维唇语特征向量进行组合,得到所述视频的X×M×N维唇语特征向量;a combination module, configured to combine the X-dimensional lip language feature vectors of the M×N video sub-blocks in the video obtained by the feature calculation module to obtain an X×M×N-dimensional lip language feature vector of the video ;
其中,M、N、X均为正整数,×用于表示数值的乘积运算。Where M, N, and X are positive integers, and × is used to represent the product of numerical values.
结合第三方面,在第三方面的第一种可能实现方式中,所述特征计算模块包括:With reference to the third aspect, in a first possible implementation manner of the third aspect, the feature calculation module includes:
提取单元,用于提取一个视频子块中每个空间子块的X维局部二值模式LBP特征向量,所述一个视频子块中包括Y个空间子块;An extracting unit, configured to extract an X-dimensional local binary mode LBP feature vector of each spatial sub-block in a video sub-block, where the one video sub-block includes Y spatial sub-blocks;
组合单元,用于将所述提取单元得到的Y个空间子块的LBP特征向量进行组合,得到X*Y阶局部纹理特征矩阵;a combining unit, configured to combine the LBP feature vectors of the Y spatial sub-blocks obtained by the extracting unit to obtain an X*Y-order local texture feature matrix;
分解单元,用于对所述组合单元得到的所述局部纹理特征矩阵进行奇异值分解,得到Y*Y阶第一右奇异矩阵;a decomposition unit, configured to perform singular value decomposition on the local texture feature matrix obtained by the combining unit to obtain a Y*Y order first right singular matrix;
投影单元,用于提取所述分解单元得到的所述第一右奇异矩阵的第一个列向量,作为投影向量;a projection unit, configured to extract a first column vector of the first right singular matrix obtained by the decomposition unit, as a projection vector;
计算单元,用于计算所述组合单元得到的所述局部纹理特征矩阵与所述投影单元得到的所述投影向量的乘积,得到所述视频子块的X维唇语特征向量;a calculation unit, configured to calculate a product of the local texture feature matrix obtained by the combining unit and the projection vector obtained by the projection unit, to obtain an X-dimensional lip language feature vector of the video sub-block;
其中,Y为正整数,X*Y阶矩阵表示X行Y列的矩阵,Y*Y阶矩阵表示Y行Y列的矩阵。Where Y is a positive integer, the X*Y order matrix represents a matrix of X rows and Y columns, and the Y*Y order matrix represents a matrix of Y rows and Y columns.
第四方面,提供了一种唇语分类装置,所述装置包括:In a fourth aspect, a lip language classification device is provided, the device comprising:
划分模块,用于对于预先选择的D个样本视频中的每个样本视频,根据样本视频中视频帧的时间顺序,将所述样本视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域;a dividing module, configured to divide the sample video into M time sub-blocks according to a chronological order of video frames in the sample video for each of the pre-selected D sample videos, each time sub-block includes At least two consecutive video frames, each of which includes a lip region;
所述划分模块,还用于将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块 组成一个视频子块,所述样本视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;The dividing module is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and to correspond to spatial sub-blocks of the same position in each video frame in the same time sub-block Composing a video sub-block, the sample video is obtained by M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
特征计算模块,用于计算所述划分模块得到的每个视频子块的唇语特征向量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息;a feature calculation module, configured to calculate a lip language feature vector of each video sub-block obtained by the dividing module, where the lip language feature vector is used to describe texture information of a lip region in a corresponding video sub-block;
训练模块,用于根据所述特征计算模块得到的所述D个样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率按照预设训练算法进行训练,直至所述多层分类器的分类准确率满足第一预设条件时停止,得到训练完成的所述多层分类器,所述多层分类器用于对视频的语义信息进行分类;a training module, configured to perform a lip language feature vector of the video sub-blocks in the D sample videos obtained by the feature calculation module, and perform a training on the classification accuracy of the multi-layer classifier according to a preset training algorithm until the When the classification accuracy rate of the layer classifier meets the first preset condition, the multi-layer classifier is completed, and the multi-layer classifier is used to classify the semantic information of the video;
其中,M、N、D均为正整数,D>1,×用于表示数值的乘积运算。Among them, M, N, and D are positive integers, D>1, and × is used to represent the product of numerical values.
结合第四方面,在第四方面的第一种可能实现方式中,所述训练模块用于执行下述步骤:In conjunction with the fourth aspect, in a first possible implementation manner of the fourth aspect, the training module is configured to perform the following steps:
步骤1:按照预定规则,构造L个指定标识,L个指定标识用于确定对应的隐层节点数目和D个样本视频中被选择的视频子块,L为正整数,L>1;Step 1: According to a predetermined rule, construct L specified identifiers, and the L specified identifiers are used to determine the number of corresponding hidden layer nodes and the selected video sub-blocks in the D sample videos, L is a positive integer, L>1;
步骤2:获取每个指定标识的标志值,每个标志值中包括用于表示隐层节点数目的标志位和用于表示每个视频子块是否被选择的标志位,不同标志值对应的隐层节点数目不同或者对应的被选择的视频子块不同,每个标志值用于训练出一个对应的多层分类器;Step 2: Obtain a flag value for each specified identifier, and each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and a different flag value corresponds to the hidden The number of layer nodes is different or the corresponding selected video sub-blocks are different, and each flag value is used to train a corresponding multi-layer classifier;
步骤3:对于每个指定标识的标志值,根据指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的视频子块的唇语特征向量,对与所述标志值对应的多层分类器的分类准确率进行训练;Step 3: For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, the flag value is Training the classification accuracy of the corresponding multi-layer classifier;
步骤4:根据每个指定标识的标志值以及每个指定标识的标志值训练出的多层分类器的分类准确率,按照预设搜索算法,获取所述L个指定标识的全局最优标志值,并对每个指定标识的标志值进行更新;Step 4: Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain the global optimal flag value of the L designated identifiers according to a preset search algorithm. And update the flag value of each specified identifier;
重复执行上述步骤3至4,直至所述全局最优标志值满足第二预设条件时停止。The above steps 3 to 4 are repeatedly executed until the global optimum flag value satisfies the second preset condition.
结合第四方面的第一种可能实现方式,在第四方面的第二种可能实现方式 中,所述训练模块还用于确定所述指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的H个视频子块,H为正整数;对于D个样本视频中的每个样本视频,将样本视频中被选择的H个视频子块的X维唇语特征向量进行组合,得到所述样本视频的H×X维唇语特征向量;对D个样本视频的H×X维唇语特征向量进行组合,得到D*(H×X)阶的特征矩阵;对所述特征矩阵进行奇异值分解,得到第二右奇异矩阵;从所述第二右奇异矩阵中,提取与预设保留维数对应的列向量,作为投影矩阵;基于所述投影矩阵、激励函数和所述隐层节点数目,对所述多层分类器的分类准确率进行训练,所述多层分类器至少包括输入层节点和至少一个隐层节点,所述投影矩阵用于表示所述输入层节点的权重。In conjunction with the first possible implementation of the fourth aspect, the second possible implementation of the fourth aspect The training module is further configured to determine a number of hidden layer nodes corresponding to the flag value of the specified identifier and a selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer; for D sample videos Each sample video in the sample, combining the X-dimensional lip feature vectors of the selected H video sub-blocks in the sample video to obtain an H×X-dimensional lip feature vector of the sample video; for D sample videos H×X-dimensional lip eigenvectors are combined to obtain a feature matrix of D*(H×X) order; singular value decomposition is performed on the feature matrix to obtain a second right singular matrix; from the second right singular matrix Extracting a column vector corresponding to the preset retention dimension as a projection matrix; and training the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes, The layer classifier includes at least an input layer node and at least one hidden layer node, the projection matrix being used to represent the weight of the input layer node.
结合第四方面的第一种可能实现方式,在第四方面的第三种可能实现方式中,所述训练模块还用于根据每个指定标识的标志值,计算所述L个指定标识的平均最优标志值;根据每个指定标识的标志值训练出的多层分类器的分类准确率,获取每个指定标识的最优标志值以及所述全局最优标志值;根据每个指定标识的最优标志值和所述全局最优标志值,计算每个指定标识的局部吸引子;根据每个指定标识的局部吸引子和所述平均最优标志值,按照预设更新算法对每个指定标识的标志值进行更新。In conjunction with the first possible implementation of the fourth aspect, in a third possible implementation manner of the fourth aspect, the training module is further configured to calculate an average of the L specified identifiers according to a flag value of each specified identifier. An optimal flag value; a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, obtaining an optimal flag value of each specified identifier and the global optimal flag value; An optimal flag value and the global optimal flag value are calculated, and a local attractor of each specified identifier is calculated; each local designation is performed according to a preset update algorithm according to a local attractor of each specified identifier and the average optimal flag value The flag value of the tag is updated.
结合第四方面的第三种可能实现方式,在第四方面的第四种可能实现方式中,所述训练模块还用于对于每个指定标识,将指定标识的标志值训练出的多层分类器的分类准确率作为所述指定标识的适应度值;根据所述指定标识当前的适应度值与所述指定标识的历史最优适应度值,对所述指定标识的最优标志值进行更新,得到所述指定标识更新后的最优标志值;根据所述指定标识当前的适应度值与所述L个指定标识的历史全局最优适应度值,对所述L个指定标识的全局最优标志值进行更新,得到更新后的全局最优标志值。With reference to the third possible implementation manner of the fourth aspect, in a fourth possible implementation manner of the fourth aspect, the training module is further configured to perform multi-layer classification of the flag value of the specified identifier for each specified identifier. The classification accuracy rate of the device is used as the fitness value of the specified identifier; and the optimal flag value of the specified identifier is updated according to the current fitness value of the specified identifier and the historical optimal fitness value of the specified identifier. Obtaining an optimal flag value after the specified identifier is updated; according to the current fitness value of the specified identifier and the historical global optimal fitness value of the L specified identifiers, the global maximum of the L designated identifiers The good flag value is updated to obtain the updated global optimal flag value.
第五方面,提供了一种特征提取设备,所述设备包括:存储器和处理器, 所述存储器与所述处理器连接,所述存储器存储有程序代码,所述处理器用于调用所述程序代码,执行以下操作:In a fifth aspect, a feature extraction device is provided, the device comprising: a memory and a processor, The memory is coupled to the processor, the memory stores program code, and the processor is configured to invoke the program code to perform the following operations:
根据视频中视频帧的时间顺序,将所述视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域;Dividing the video into M time sub-blocks according to a chronological order of video frames in the video, each time sub-block includes at least two consecutive video frames, each of which includes a lip region;
将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,所述视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
计算每个视频子块的唇语特征向量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息,每个视频子块的唇语特征向量为X维的向量;Calculating a lip language feature vector of each video sub-block, wherein the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block, and the lip language feature vector of each video sub-block is an X-dimensional vector;
将所述视频中的M×N个视频子块的X维唇语特征向量进行组合,得到所述视频的X×M×N维唇语特征向量;Combining the X-dimensional lip feature vectors of the M×N video sub-blocks in the video to obtain an X×M×N-dimensional lip feature vector of the video;
其中,M、N、X均为正整数,×用于表示数值的乘积运算。Where M, N, and X are positive integers, and × is used to represent the product of numerical values.
第六方面,提供了一种唇语分类设备,所述设备包括:存储器和处理器,所述存储器与所述处理器连接,所述存储器存储有程序代码,所述处理器用于调用所述程序代码,执行以下操作:In a sixth aspect, a lip language classification device is provided, the device comprising: a memory and a processor, the memory being coupled to the processor, the memory storing program code, the processor for calling the program Code, do the following:
对于预先选择的D个样本视频中的每个样本视频,根据样本视频中视频帧的时间顺序,将所述样本视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域;For each of the pre-selected D sample videos, the sample video is divided into M time sub-blocks according to the chronological order of the video frames in the sample video, and each time sub-block includes at least two consecutive a video frame, each of which includes a lip region;
将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,所述样本视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The sample video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
计算每个视频子块的唇语特征向量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息;Calculating a lip language feature vector of each video sub-block, the lip language feature vector being used to describe texture information of a lip region in the corresponding video sub-block;
根据所述D个样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率按照预设训练算法进行训练,直至所述多层分类器的分类准确率满足第 一预设条件时停止,得到训练完成的所述多层分类器,所述多层分类器用于对视频的语义信息进行分类;According to the lip feature vector of the video sub-block in the D sample videos, the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm until the classification accuracy of the multi-layer classifier satisfies Stopping at a preset condition, obtaining the multi-layer classifier that is trained to be used, and the multi-layer classifier is used to classify the semantic information of the video;
其中,M、N、D均为正整数,D>1,×用于表示数值的乘积运算。Among them, M, N, and D are positive integers, D>1, and × is used to represent the product of numerical values.
本发明实施例提供的技术方案的有益效果是:The beneficial effects of the technical solutions provided by the embodiments of the present invention are:
本发明实施例提供的方法、装置及设备,通过按照时间维度,将视频划分为M个时间子块,并按照空间维度,每个时间子块中每个视频帧的唇部区域划分为N个空间子块,将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,则该视频总共得到M×N个视频子块,再计算每个视频子块的唇语特征向量,将视频中的M×N个视频子块的唇语特征向量进行组合,得到该视频的唇语特征向量,则不同视频所得到的视频子块的数目相同,使得最终提取到的视频的唇语特征向量维数相同,实现了对特征维数的固定,根据样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率进行训练,无需对特征维数进行动态调整,简化了操作,节省了训练时间,应用训练的多层分类器对视频进行分类时,也节省了分类时间,提高了分类准确率。The method, device and device provided by the embodiments of the present invention divide the video into M time sub-blocks according to the time dimension, and according to the spatial dimension, the lip area of each video frame in each time sub-block is divided into N pieces. The spatial sub-blocks form a video sub-block corresponding to the same position in each video frame in the same time sub-block, and the video obtains a total of M×N video sub-blocks, and then calculates the lip of each video sub-block. The feature vector, combining the lip feature vectors of M×N video sub-blocks in the video to obtain the lip feature vector of the video, the number of video sub-blocks obtained by different videos is the same, so that the final extracted The lip eigenvectors of the video have the same dimension, which realizes the fixed dimension of the feature. According to the lip feature vector of the video sub-block in the sample video, the classification accuracy of the multi-layer classifier is trained without the feature dimension. The dynamic adjustment simplifies the operation and saves the training time. When the multi-layer classifier of the training is used to classify the video, the classification time is saved and the classification accuracy is improved.
附图说明DRAWINGS
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. Other drawings may also be obtained from those of ordinary skill in the art in light of the inventive work.
图1是本发明实施例提供的一种特征提取方法的流程图;FIG. 1 is a flowchart of a feature extraction method according to an embodiment of the present invention;
图2是本发明实施例提供的一种唇语分类方法的流程图;2 is a flowchart of a lip language classification method according to an embodiment of the present invention;
图3是本发明实施例提供的一种特征提取方法的流程图;3 is a flowchart of a feature extraction method according to an embodiment of the present invention;
图4是本发明实施例提供的视频帧的唇部区域示意图;4 is a schematic diagram of a lip region of a video frame according to an embodiment of the present invention;
图5是本发明实施例提供的视频分块示意图; FIG. 5 is a schematic diagram of video blocking according to an embodiment of the present invention; FIG.
图6是本发明实施例提供的像素邻域示意图;FIG. 6 is a schematic diagram of a pixel neighborhood provided by an embodiment of the present invention; FIG.
图7是本发明实施例提供的一种唇语分类方法的流程图;FIG. 7 is a flowchart of a lip language classification method according to an embodiment of the present invention; FIG.
图8为本发明实施例提供的多层分类器的结构示意图;FIG. 8 is a schematic structural diagram of a multi-layer classifier according to an embodiment of the present invention;
图9是本发明实施例提供的一种特征提取装置的结构示意图;FIG. 9 is a schematic structural diagram of a feature extraction apparatus according to an embodiment of the present invention;
图10是本发明实施例提供的一种唇语分类装置的结构示意图;FIG. 10 is a schematic structural diagram of a lip language sorting apparatus according to an embodiment of the present invention; FIG.
图11是本发明实施例提供的一种特征提取设备的结构示意图;FIG. 11 is a schematic structural diagram of a feature extraction device according to an embodiment of the present invention;
图12是本发明实施例提供的一种唇语分类设备的结构示意图。FIG. 12 is a schematic structural diagram of a lip language classification device according to an embodiment of the present invention.
具体实施方式detailed description
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。The embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.
图1是本发明实施例提供的一种特征提取方法的流程图。参见图1,该方法包括:FIG. 1 is a flowchart of a feature extraction method according to an embodiment of the present invention. Referring to Figure 1, the method includes:
步骤101、根据视频中视频帧的时间顺序,将该视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域。Step 101: Divide the video into M time sub-blocks according to a time sequence of video frames in the video, where each time sub-block includes at least two consecutive video frames, and each video frame includes a lip region.
步骤102、将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,该视频共得到M×N个视频子块,每个时间子块中包括N个视频子块。Step 102: Divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and form spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block. The video obtains a total of M×N video sub-blocks, and each time sub-block includes N video sub-blocks.
步骤103、计算每个视频子块的唇语特征向量,该唇语特征向量用于描述对应视频子块中唇部区域的纹理信息,每个视频子块的唇语特征向量为X维的向量。Step 103: Calculate a lip language feature vector of each video sub-block, where the lip language feature vector is used to describe texture information of a lip region in a corresponding video sub-block, and the lip language feature vector of each video sub-block is an X-dimensional vector. .
步骤104、将该视频中的M×N个视频子块的X维唇语特征向量进行组合,得到该视频的X×M×N维唇语特征向量。Step 104: Combine the X-dimensional lip feature vectors of the M×N video sub-blocks in the video to obtain an X×M×N-dimensional lip feature vector of the video.
其中,M、N、X均为正整数,×用于表示数值的乘积运算。Where M, N, and X are positive integers, and × is used to represent the product of numerical values.
本发明实施例中,视频中包括多个视频帧,每个视频帧中包括人的唇部区域,对视频帧的唇部区域进行分类,确定唇部区域所属的类别后,可以确定人 的说话内容。In the embodiment of the present invention, the video includes a plurality of video frames, each of which includes a human lip region, and the lip region of the video frame is classified, and after determining the category to which the lip region belongs, the person can be determined. The content of the talk.
M和N是预先设定的数值,则每个视频所划分得到的视频子块数目均为M×N,是一个固定值。而计算视频子块的唇语特征向量时,不同视频子块的唇语特征向量的维度相同,假设每个视频子块的唇语特征向量为X维的向量,则将视频中多个视频子块的唇语特征向量进行组合时,所得到的该视频的唇语特征向量的维度为X×M×N,也是一个固定值。M and N are preset values, and the number of video sub-blocks divided by each video is M×N, which is a fixed value. When calculating the lip eigenvectors of video sub-blocks, the dimensions of the lip eigenvectors of different video sub-blocks are the same. If the lip-feature vector of each video sub-block is an X-dimensional vector, multiple video sub-pictures are used. When the lip eigenvectors of the block are combined, the obtained lip eigenvector of the video has a dimension of X×M×N, which is also a fixed value.
也即是,本发明实施例通过为不同的视频预先设定视频所划分得到的时间子块的数目M和时间子块中每个视频帧的唇部区域所划分的空间子块的数目N,则对于不同的视频来说,每个视频所划分得到的视频子块的数目是一个固定值,使得每个视频提取出的唇语特征向量的维度也是一个固定值。That is, the embodiment of the present invention pre-sets the number M of time sub-blocks obtained by dividing the video for different videos and the number N of spatial sub-blocks divided by the lip regions of each video frame in the time sub-block, Then, for different videos, the number of video sub-blocks divided by each video is a fixed value, so that the dimension of the lip-feature vector extracted by each video is also a fixed value.
本发明实施例提供的方法,通过按照时间维度,将视频划分为M个时间子块,并按照空间维度,每个时间子块中每个视频帧的唇部区域划分为N个空间子块,将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,则该视频总共得到M×N个视频子块,再计算每个视频子块的唇语特征向量,将视频中的M×N个视频子块的唇语特征向量进行组合,得到该视频的唇语特征向量,则不同视频所得到的视频子块的数目相同,使得最终提取到的视频的唇语特征向量维数相同,实现了对特征维数的固定,无需对特征维数进行动态调整,简化了操作,节省了时间。The method provided by the embodiment of the present invention divides a video into M time sub-blocks according to a time dimension, and divides a lip area of each video frame in each time sub-block into N spatial sub-blocks according to a spatial dimension. If a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M×N video sub-blocks, and then calculates a lip-feature vector of each video sub-block. Combining the lip feature vectors of M×N video sub-blocks in the video to obtain the lip feature vector of the video, the number of video sub-blocks obtained by different videos is the same, so that the lip of the finally extracted video is The eigenvectors have the same dimension, which realizes the fixation of the feature dimension. It does not need to dynamically adjust the feature dimension, which simplifies the operation and saves time.
可选地,该计算每个视频子块的唇语特征向量,包括:Optionally, the calculating the lip feature vector of each video sub-block includes:
提取一个视频子块中每个空间子块的X维局部二值模式LBP特征向量,该一个视频子块中包括Y个空间子块;Extracting an X-dimensional local binary mode LBP feature vector of each spatial sub-block in a video sub-block, wherein the one video sub-block includes Y spatial sub-blocks;
将Y个空间子块的LBP特征向量进行组合,得到X*Y阶局部纹理特征矩阵;Combining the LBP feature vectors of Y spatial sub-blocks to obtain an X*Y-order local texture feature matrix;
对该局部纹理特征矩阵进行奇异值分解,得到Y*Y阶第一右奇异矩阵;Performing singular value decomposition on the local texture feature matrix to obtain a Y*Y order first right singular matrix;
提取该第一右奇异矩阵的第一个列向量,作为投影向量;Extracting a first column vector of the first right singular matrix as a projection vector;
计算该局部纹理特征矩阵与该投影向量的乘积,得到该视频子块的X维唇 语特征向量;Calculating the product of the local texture feature matrix and the projection vector to obtain the X-dimensional lip of the video sub-block Language feature vector
其中,Y为正整数,X*Y阶矩阵表示X行Y列的矩阵,Y*Y阶矩阵表示Y行Y列的矩阵。Where Y is a positive integer, the X*Y order matrix represents a matrix of X rows and Y columns, and the Y*Y order matrix represents a matrix of Y rows and Y columns.
图2是本发明实施例提供的一种唇语分类方法的流程图。参见图2,该方法包括:FIG. 2 is a flowchart of a lip language classification method according to an embodiment of the present invention. Referring to Figure 2, the method includes:
步骤201、对于预先选择的D个样本视频中的每个样本视频,根据样本视频中视频帧的时间顺序,将该样本视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域。Step 201: For each sample video of the pre-selected D sample videos, divide the sample video into M time sub-blocks according to the time sequence of the video frames in the sample video, and include at least two in each time sub-block. Continuous video frames, each of which includes a lip area.
步骤202、将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,该样本视频共得到M×N个视频子块,每个时间子块中包括N个视频子块。Step 202: Divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and form spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block. The sample video obtains a total of M×N video sub-blocks, and each time sub-block includes N video sub-blocks.
步骤203、计算每个视频子块的唇语特征向量,该唇语特征向量用于描述对应视频子块中唇部区域的纹理信息。Step 203: Calculate a lip language feature vector of each video sub-block, where the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block.
步骤204、根据该D个样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率按照预设训练算法进行训练,直至该多层分类器的分类准确率满足第一预设条件时停止,得到训练完成的该多层分类器,该多层分类器用于对视频的语义信息进行分类。Step 204: According to the lip feature vector of the video sub-block in the D sample videos, the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm until the classification accuracy of the multi-layer classifier satisfies the first pre-preparation When the condition is set to stop, the trained multi-layer classifier is obtained, and the multi-layer classifier is used to classify the semantic information of the video.
其中,M、N、D均为正整数,D>1,×用于表示数值的乘积运算。Among them, M, N, and D are positive integers, D>1, and × is used to represent the product of numerical values.
本发明实施例提供的方法,通过按照时间维度,将视频划分为M个时间子块,并按照空间维度,每个时间子块中每个视频帧的唇部区域划分为N个空间子块,将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,则该视频总共得到M×N个视频子块,再计算每个视频子块的X维唇语特征向量,则不同视频所得到的视频子块的数目相同,使得最终提取到的视频的唇语特征向量维数相同,实现了对特征维数的固定,且根据样本视频 中视频子块的唇语特征向量,对多层分类器的分类准确率进行训练,无需对特征维数进行动态调整,简化了操作,节省了训练时间,应用训练的多层分类器对视频进行分类时,也无需对特征维数进行动态调整,简化了操作,节省了分类时间,提高了分类准确率。The method provided by the embodiment of the present invention divides a video into M time sub-blocks according to a time dimension, and divides a lip area of each video frame in each time sub-block into N spatial sub-blocks according to a spatial dimension. When a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M×N video sub-blocks, and then calculates the X-dimensional lip feature of each video sub-block. Vector, the number of video sub-blocks obtained by different videos is the same, so that the lip-feature vector dimension of the finally extracted video is the same, and the feature dimension is fixed, and according to the sample video The lip eigenvectors of the video sub-blocks train the classification accuracy of the multi-layer classifier, eliminating the need to dynamically adjust the feature dimensions, simplifying the operation, saving training time, and applying the trained multi-layer classifier to the video. When classifying, there is no need to dynamically adjust the feature dimension, which simplifies the operation, saves the classification time, and improves the classification accuracy.
可选地,该根据D个样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率按照预设训练算法进行训练,包括:Optionally, according to the lip feature vector of the video sub-block in the D sample video, the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm, including:
步骤1:按照预定规则,构造L个指定标识,L个指定标识用于确定对应的隐层节点数目和D个样本视频中被选择的视频子块,L为正整数,L>1;Step 1: According to a predetermined rule, construct L specified identifiers, and the L specified identifiers are used to determine the number of corresponding hidden layer nodes and the selected video sub-blocks in the D sample videos, L is a positive integer, L>1;
步骤2:获取每个指定标识的标志值,每个标志值中包括用于表示隐层节点数目的标志位和用于表示每个视频子块是否被选择的标志位,不同标志值对应的隐层节点数目不同或者对应的被选择的视频子块不同,每个标志值用于训练出一个对应的多层分类器;Step 2: Obtain a flag value for each specified identifier, and each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and a different flag value corresponds to the hidden The number of layer nodes is different or the corresponding selected video sub-blocks are different, and each flag value is used to train a corresponding multi-layer classifier;
步骤3:对于每个指定标识的标志值,根据指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的视频子块的唇语特征向量,对与该标志值对应的多层分类器的分类准确率进行训练;Step 3: For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, corresponding to the flag value The classification accuracy of the multi-layer classifier is trained;
步骤4:根据每个指定标识的标志值以及每个指定标识的标志值训练出的多层分类器的分类准确率,按照预设搜索算法,获取该L个指定标识的全局最优标志值,并对每个指定标识的标志值进行更新;Step 4: According to the flag value of each specified identifier and the flag value of each specified identifier, the classification accuracy rate of the multi-layer classifier is obtained, and the global optimal flag value of the L designated identifiers is obtained according to a preset search algorithm. And update the flag value of each specified identifier;
重复执行上述步骤3至4,直至该全局最优标志值满足第二预设条件时停止。The above steps 3 to 4 are repeatedly executed until the global optimum flag value satisfies the second preset condition.
可选地,该根据指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的视频子块的唇语特征向量,对与该标志值对应的多层分类器的分类准确率进行训练,包括:Optionally, the classification of the multi-layer classifier corresponding to the flag value according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos Training for accuracy, including:
确定该指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的H个视频子块,H为正整数;Determining the number of hidden layer nodes corresponding to the flag value of the specified identifier and the selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer;
对于D个样本视频中的每个样本视频,将样本视频中被选择的H个视频 子块的X维唇语特征向量进行组合,得到该样本视频的H×X维唇语特征向量;For each of the D sample videos, the selected H videos in the sample video The X-dimensional lip eigenvectors of the sub-blocks are combined to obtain an H×X-dimensional lip eigenvector of the sample video;
对D个样本视频的H×X维唇语特征向量进行组合,得到D*(H×X)阶的特征矩阵;Combining H×X-dimensional lip feature vectors of D sample videos to obtain a feature matrix of D*(H×X) order;
对该特征矩阵进行奇异值分解,得到第二右奇异矩阵;Performing singular value decomposition on the feature matrix to obtain a second right singular matrix;
从该第二右奇异矩阵中,提取与预设保留维数对应的列向量,作为投影矩阵;Extracting, from the second right singular matrix, a column vector corresponding to the preset retention dimension as a projection matrix;
基于该投影矩阵、激励函数和该隐层节点数目,对该多层分类器的分类准确率进行训练,该多层分类器至少包括输入层节点和至少一个隐层节点,该投影矩阵用于表示该输入层节点的权重。And classifying the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes, the multi-layer classifier including at least an input layer node and at least one hidden layer node, wherein the projection matrix is used to represent The weight of the input layer node.
可选地,该根据每个指定标识的标志值以及每个指定标识的标志值训练出的多层分类器的分类准确率,按照预设搜索算法,获取该L个指定标识的全局最优标志值,并对每个指定标识的标志值进行更新,包括:Optionally, the classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier is obtained, and the global optimal flag of the L designated identifiers is obtained according to a preset search algorithm. Values, and updates the flag values for each specified identity, including:
根据每个指定标识的标志值,计算该L个指定标识的平均最优标志值;Calculating an average optimal flag value of the L designated identifiers according to the flag value of each specified identifier;
根据每个指定标识的标志值训练出的多层分类器的分类准确率,获取每个指定标识的最优标志值以及该全局最优标志值;Obtaining a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, obtaining an optimal flag value of each specified identifier and the global optimal flag value;
根据每个指定标识的最优标志值和该全局最优标志值,计算每个指定标识的局部吸引子;Calculating a local attractor for each specified identifier according to an optimal flag value of each specified identifier and the global optimal flag value;
根据每个指定标识的局部吸引子和该平均最优标志值,按照预设更新算法对每个指定标识的标志值进行更新。The flag value of each specified identifier is updated according to a preset update algorithm according to the local attractor of each specified identifier and the average optimal flag value.
可选地,该根据每个指定标识的标志值训练出的多层分类器的分类准确率,获取每个指定标识的最优标志值以及该全局最优标志值,包括:Optionally, the classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier is obtained, and the optimal flag value of each specified identifier and the global optimal flag value are obtained, including:
对于每个指定标识,将指定标识的标志值训练出的多层分类器的分类准确率作为该指定标识的适应度值;For each specified identifier, the classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier is used as the fitness value of the specified identifier;
根据该指定标识当前的适应度值与该指定标识的历史最优适应度值,对该指定标识的最优标志值进行更新,得到该指定标识更新后的最优标志值;And updating, according to the current fitness value of the specified identifier and the historical optimal fitness value of the specified identifier, an optimal flag value of the specified identifier, to obtain an optimal flag value after the specified identifier is updated;
根据该指定标识当前的适应度值与该L个指定标识的历史全局最优适应 度值,对该L个指定标识的全局最优标志值进行更新,得到更新后的全局最优标志值。According to the current fitness value of the specified identifier and the historical global optimal adaptation of the L designated identifiers The degree value is used to update the global optimal flag values of the L specified identifiers to obtain an updated global optimal flag value.
上述所有可选技术方案,可以采用任意结合形成本发明的可选实施例,在此不再一一赘述。All of the above optional technical solutions may be used in any combination to form an optional embodiment of the present invention, and will not be further described herein.
图3是本发明实施例提供的一种特征提取方法的流程图。本发明实施例的执行主体为特征提取装置,参见图3,该方法包括:FIG. 3 is a flowchart of a feature extraction method according to an embodiment of the present invention. The execution body of the embodiment of the present invention is a feature extraction device. Referring to FIG. 3, the method includes:
步骤301、对于每个视频,特征提取装置根据视频中视频帧的时间顺序,将该视频划分为M×N个视频子块。Step 301: For each video, the feature extraction device divides the video into M×N video sub-blocks according to the time sequence of the video frames in the video.
其中,该特征提取装置可以为计算机或者服务器等设备,本发明实施例对此不做限定。The feature extraction device may be a device such as a computer or a server, which is not limited in this embodiment of the present invention.
具体地,该视频中包括多个视频帧,多个视频帧按照时间顺序排列,该特征提取装置可以根据视频中视频帧的时间顺序,将该视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,且该视频的每个视频帧中包括唇部区域。则该分类装置将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,则一个时间子块中包括N个视频子块,该视频共得到M×N个视频子块。其中,M和N为正整数。Specifically, the video includes a plurality of video frames, and the plurality of video frames are arranged in chronological order. The feature extraction device may divide the video into M time sub-blocks according to the time sequence of the video frames in the video, each time sub-timer. At least two consecutive video frames are included in the block, and a lip region is included in each video frame of the video. Then, the classifying device divides the lip region of each video frame in each time sub-block into N spatial sub-blocks, and forms a spatial sub-block corresponding to the same position in each video frame in the same time sub-block into one video sub-block. Block, then a time sub-block includes N video sub-blocks, and the video obtains M×N video sub-blocks in total. Where M and N are positive integers.
该特征提取装置可以先将视频划分为M个时间子块,再对每个视频帧进行定位分割,得到每个视频帧的唇部区域,将每个视频帧的唇部区域划分为N个空间子块,或者,该特征提取装置还可以先对每个视频帧进行定位分割,得到每个视频帧的唇部区域,再将得到的多个唇部区域划分为M个时间子块,将每个时间子块中的唇部区域划分为N个空间子块,本发明实施例对定位分割的时机不做限定。The feature extraction device may first divide the video into M time sub-blocks, and then perform positioning and segmentation on each video frame to obtain a lip region of each video frame, and divide the lip region of each video frame into N spaces. The sub-block, or the feature extraction device may further perform positioning and segmentation for each video frame to obtain a lip region of each video frame, and then divide the obtained plurality of lip regions into M time sub-blocks, each of which The lip region in the time sub-block is divided into N spatial sub-blocks, and the timing of the positioning and segmentation is not limited in the embodiment of the present invention.
参见图4,t表示视频帧的时间,x和y构成的坐标系表示视频帧所在的空间。视频中包括多个视频帧,对每个视频帧进行定位分割,可以得到每个视频 帧的唇部区域。Referring to Figure 4, t represents the time of the video frame, and the coordinate system formed by x and y represents the space in which the video frame is located. The video includes multiple video frames, and each video frame is segmented and positioned to obtain each video. The lip area of the frame.
该特征提取装置可以先按照时间维度进行划分,根据该视频中视频帧的时间顺序,对视频帧进行分组,将至少两个视频帧作为一个时间子块,将该视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,再按照空间维度进行划分,将每个视频帧的唇部区域划分为N个空间子块,则同一时间子块中包括多个空间子块,该特征提取装置将同一空间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,从而得到M×N个视频子块。当然,该特征提取装置也可以先按照空间维度进行划分,将每个视频帧的唇部区域划分为N个空间子块,再按照时间维度进行划分,将至少两个视频帧的空间子块作为一个时间子块,将该视频划分为M个时间子块,再将每个时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,从而得到M×N个视频子块,本发明实施例对此不做限定。The feature extraction device may first divide according to the time dimension, group the video frames according to the time sequence of the video frames in the video, and divide the video into M time sub-blocks by using at least two video frames as one time sub-block. Each time sub-block includes at least two consecutive video frames, and then divided according to a spatial dimension, and the lip region of each video frame is divided into N spatial sub-blocks, and the sub-block includes multiple spaces at the same time. The sub-blocks, the feature extraction device combines spatial sub-blocks corresponding to the same position in each video frame in the same spatial sub-block into one video sub-block, thereby obtaining M×N video sub-blocks. Of course, the feature extraction device may also first divide according to the spatial dimension, divide the lip region of each video frame into N spatial sub-blocks, and then divide according to the time dimension, and use the spatial sub-blocks of at least two video frames as A time sub-block divides the video into M time sub-blocks, and then combines spatial sub-blocks corresponding to the same position in each video frame in each time sub-block into one video sub-block, thereby obtaining M×N video sub-blocks. The embodiment of the present invention does not limit this.
参见图5,该特征提取装置可以按照时间维度对视频进行划分,得到M个时间子块。对于每个时间子块来说,该时间子块中包括多个视频帧,再按照空间维度进行划分,将每个视频帧的唇部区域划分为N个空间子块,并将各个视频帧中对应相同位置的空间子块组成一个视频子块,则该时间子块可以得到N个视频子块,该视频总共得到M×N个视频子块。图5中以N为4为例,每个时间子块中包括3个视频帧的唇部区域,将每个视频帧的唇部区域划分为4个空间子块:左上角唇部区域、右上角唇部区域、左下角唇部区域和右下角唇部区域,则将3个视频帧的左上角唇部区域组成视频子块(1),将3个视频帧的右上角唇部区域组成视频子块(2),将3个视频帧的左下角唇部区域组成视频子块(3),将3个视频帧的右下角唇部区域组成视频子块(4),共得到4个视频子块。Referring to FIG. 5, the feature extraction apparatus may divide the video according to a time dimension to obtain M time sub-blocks. For each time sub-block, the time sub-block includes a plurality of video frames, and then divides according to the spatial dimension, and divides the lip region of each video frame into N spatial sub-blocks, and each video frame is If the spatial sub-blocks corresponding to the same location form a video sub-block, the time sub-block can obtain N video sub-blocks, and the video obtains a total of M×N video sub-blocks. In FIG. 5, taking N as 4 as an example, each time sub-block includes a lip region of three video frames, and the lip region of each video frame is divided into four spatial sub-blocks: upper left corner lip region, upper right The corner lip area, the lower left corner lip area and the lower right corner lip area form the video sub-block (1) in the upper left corner lip region of the three video frames, and the upper right corner lip region of the three video frames constitute the video. Sub-block (2), the lower left corner lip region of the three video frames is composed into a video sub-block (3), and the lower right corner lip region of the three video frames is formed into a video sub-block (4), and four video sub-blocks are obtained. Piece.
另外,不同视频中视频帧数也不同,为了满足将视频划分为M时间子块,该特征提取装置可以根据该视频的帧数,对视频帧进行过滤或者重复分块。进一步地,对于某一视频来说,该视频中每个时间子块所包括的视频帧的数目可 以相同,例如,该视频中包括1-10这十个视频帧,M为4,则该特征提取装置可以将1-4这四个视频帧作为一组,构成一个时间子块,再将3-6这四个视频帧作为一组,构成一个时间子块,将5-8这四个视频帧作为一组,构成一个时间子块,7-10这四个视频帧作为一组,构成一个时间子块,最终划分得到了4个时间子块,且每个时间子块中包括4个视频帧。In addition, the number of video frames in different videos is also different. In order to satisfy the division of the video into M time sub-blocks, the feature extraction device may filter or repeat the block according to the number of frames of the video. Further, for a certain video, the number of video frames included in each time sub-block in the video may be In the same way, for example, if the video includes 1-10 of the ten video frames, and M is 4, the feature extraction device may group the four video frames 1-4 into one group to form a time sub-block, and then 3 -6 These four video frames are grouped into one time sub-block, and the four video frames 5-8 are grouped to form one time sub-block, and the four video frames 7-10 are grouped to form one. The time sub-block is finally divided into 4 time sub-blocks, and each time sub-block includes 4 video frames.
将M称为第一预设数目,将N称为第二预设数目,该特征提取装置预先确定该第一预设数目M和该第二预设数目N,该第一预设数目M用于规定每个视频中划分的时间子块的数目,该第二预设数目N用于规定时间子块中每个视频帧划分的空间子块的数目。对于不同的视频,该特征提取装置均可采用该第一预设数目M和该第二预设数目N对其进行划分。其中,M和N可以由该特征提取装置根据视频子块的精确度需求确定,M可以为4、5或其他数值,N可以为5、6或其他数值等,本发明实施例对此不做限定。The M is referred to as a first preset number, and N is referred to as a second preset number. The feature extraction device predetermines the first preset number M and the second preset number N, and the first preset number M is used. The number of time sub-blocks divided in each video is specified, and the second preset number N is used to specify the number of spatial sub-blocks divided by each video frame in the time sub-block. For different videos, the feature extraction device may divide the first preset number M and the second preset number N. The M and N may be determined by the feature extraction device according to the accuracy requirement of the video sub-block, and the M may be 4, 5 or other values, and the N may be 5, 6, or other values. limited.
该特征提取装置还可以预先设定每个时间子块中包括的视频帧的数目,如设定每个时间子块中包括第三预设数目Q的视频帧,则在进行划分时,当该视频的帧数大于该第一预设数目与该第三预设数目的乘积M×Q时,该特征提取装置可以对该视频中的视频帧进行过滤,过滤掉第一指定数目的视频帧,该第一指定数目等于该视频帧数与M×Q之间的差,使得过滤后该视频的帧数等于M×Q,再对视频帧进行分块,使得将视频划分为M个时间子块时,能够满足每个时间子块中包括Q个视频帧。而当该视频的帧数小于M×Q时,该特征提取装置可以选取第二指定数目的视频帧,该第二指定数目等于M×Q与该视频帧数之间的差,将该第二指定数目的视频帧重复划分至两个时间子块中,使得将视频划分为M个时间子块时,能够满足每个时间子块中包括Q个视频帧,其中Q为正整数,Q可以由该特征提取装置预先根据视频子块的精确度需求确定,本发明实施例对此不做限定。The feature extraction apparatus may further preset a number of video frames included in each time sub-block, such as setting a video frame including a third preset number Q in each time sub-block, when performing the division, when When the number of frames of the video is greater than the product of the first preset number and the third preset number M×Q, the feature extraction apparatus may filter the video frame in the video to filter out the first specified number of video frames. The first specified number is equal to the difference between the number of video frames and M×Q, so that the number of frames of the video after filtering is equal to M×Q, and then the video frame is divided into blocks, so that the video is divided into M time sub-blocks. When it is possible to satisfy Q video frames in each time sub-block. When the number of frames of the video is less than M×Q, the feature extraction apparatus may select a second specified number of video frames, the second specified number being equal to the difference between the M×Q and the number of the video frames, and the second The specified number of video frames are repeatedly divided into two time sub-blocks, so that when the video is divided into M time sub-blocks, it is possible to satisfy Q video frames in each time sub-block, where Q is a positive integer, and Q can be The feature extraction device is determined in advance according to the accuracy requirement of the video sub-block, which is not limited by the embodiment of the present invention.
本发明实施例中,为了固定特征维数,在对视频进行特征提取之前,该特征提取装置可以先对视频进行分块,得到M×N个视频子块。且,该特征提取 装置按照时间维度和空间维度对视频进行分块,将原始的视频划分为一组视频子块,增加了后续所提取特征中的时间信息和空间信息,能够更好的表达唇语特征。In the embodiment of the present invention, in order to fix the feature dimension, before the feature extraction of the video, the feature extraction device may first block the video to obtain M×N video sub-blocks. And, the feature extraction The device divides the video according to the time dimension and the spatial dimension, divides the original video into a set of video sub-blocks, increases the time information and spatial information in the subsequently extracted features, and can better express the lip language features.
步骤302、该特征提取装置计算每个视频子块的X维唇语特征向量,该唇语特征向量用于描述对应视频子块中唇部区域的纹理信息。Step 302: The feature extraction device calculates an X-dimensional lip language feature vector of each video sub-block, and the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block.
本发明实施例中,对视频进行划分之后,每个视频子块中包括至少一个空间子块,每个空间子块实际为视频帧中的一部分唇部区域。对于视频中的每个视频子块,该步骤302可以包括以下步骤302a至302c:In the embodiment of the present invention, after the video is divided, each video sub-block includes at least one spatial sub-block, and each spatial sub-block is actually a part of the lip area in the video frame. For each video sub-block in the video, this step 302 can include the following steps 302a-302c:
步骤302a、该特征提取装置提取该视频子块中每个空间子块的X维LBP特征向量,若该视频子块中包括Y个空间子块,将Y个空间子块的LBP特征向量进行组合,得到X*Y阶局部纹理特征矩阵。Step 302a: The feature extraction device extracts an X-dimensional LBP feature vector of each spatial sub-block in the video sub-block. If the video sub-block includes Y spatial sub-blocks, the LBP feature vectors of the Y spatial sub-blocks are combined. , to obtain an X*Y order local texture feature matrix.
该特征提取装置可以提取该视频子块中每个空间子块的X维LBP特征向量,以LBP特征向量来描述对应空间子块中唇部区域的纹理信息,并将每个空间子块的LBP特征向量作为一列,将Y个空间子块的LBP特征向量进行组合,得到X*Y阶的矩阵,即为该视频子块的局部纹理特征矩阵。其中,Y为正整数,X*Y阶矩阵表示X行Y列的矩阵。The feature extraction device may extract an X-dimensional LBP feature vector of each spatial sub-block in the video sub-block, describe the texture information of the lip region in the corresponding spatial sub-block with the LBP feature vector, and use the LBP of each spatial sub-block. The feature vector is used as a column, and the LBP feature vectors of the Y spatial sub-blocks are combined to obtain a matrix of X*Y order, which is the local texture feature matrix of the video sub-block. Where Y is a positive integer and the X*Y order matrix represents a matrix of X rows and Y columns.
在获取空间子块的LBP特征向量时,对于该空间子块中的每个像素,该特征提取装置将该像素作为中心像素,获取与该中心像素相邻的每个指定像素,对于该中心像素的每个指定像素来说,将该指定像素的像素值与该空间子块的中心像素的像素值进行比较,如果该指定像素的像素值大于中心像素的像素值,则确定该指定像素的特征值为1,如果该指定像素的像素值不大于中心像素的像素值,则确定该指定像素的特征值为0,则为每个指定像素设置了一个二进制的特征值,将该中心像素的所有指定像素的特征值进行组合,将组合得到的二进制数转换为十进制数,即可得到该中心像素的LBP特征值。该特征提取装置可以获取该空间子块中每个像素的LBP特征值,根据每个像素的LBP特征值,计算该空间子块的统计直方图,对该统计直方图进行归一化处理, 得到该空间子块的LBP特征向量。When acquiring the LBP feature vector of the spatial sub-block, for each pixel in the spatial sub-block, the feature extraction device takes the pixel as a central pixel, and acquires each designated pixel adjacent to the central pixel, for the central pixel For each of the designated pixels, the pixel value of the specified pixel is compared with the pixel value of the central pixel of the spatial sub-block, and if the pixel value of the specified pixel is greater than the pixel value of the central pixel, determining the characteristic of the designated pixel A value of 1, if the pixel value of the specified pixel is not greater than the pixel value of the center pixel, determining that the feature value of the specified pixel is 0, a binary feature value is set for each specified pixel, and all the central pixels are The eigenvalues of the specified pixel are combined, and the combined binary number is converted into a decimal number to obtain the LBP feature value of the central pixel. The feature extraction device may acquire an LBP feature value of each pixel in the spatial sub-block, calculate a statistical histogram of the spatial sub-block according to an LBP feature value of each pixel, and normalize the statistical histogram. The LBP feature vector of the spatial sub-block is obtained.
参见图6,对于空间子块中的每个像素来说,将该像素作为中心像素,获取该中心像素的邻域,该邻域为3*3的像素区域,该邻域中每个像素的像素值如图6中的a图所示,则将中心像素周围的每个指定像素的像素值与中心像素的像素值进行比较,可以得到每个指定像素的特征值,如图6中的b图所示。以位于左上角的指定像素作为最右位,按照顺时针方向,依次将每个指定像素的特征值组合,可以得到该邻域的LBP特征向量为11110001。还可以计算该LBP特征向量的十进制数,即为该中心像素的LBP特征值,如从左上角的指定像素开始,按照顺时针方向,中心像素周围的每个指定像素的权重分别为1、2、4、8、16、32、64、128,则计算出的LBP特征值为1+16+32+64+128=241。计算出该空间子块中每个像素的LBP特征值之后,根据每个像素的LBP特征值计算该空间子块的统计直方图,对该统计直方图进行归一化处理,得到该空间子块的LBP特征向量。Referring to FIG. 6 , for each pixel in the spatial sub-block, the pixel is used as a central pixel, and the neighborhood of the central pixel is acquired, and the neighborhood is a pixel region of 3*3, and each pixel in the neighborhood is The pixel value is as shown in a diagram of FIG. 6, and the pixel value of each specified pixel around the center pixel is compared with the pixel value of the center pixel, and the feature value of each specified pixel can be obtained, as shown in FIG. The figure shows. The specified pixel in the upper left corner is used as the rightmost bit, and the feature values of each specified pixel are sequentially combined in a clockwise direction, and the LBP feature vector of the neighborhood is obtained as 11110001. It is also possible to calculate the decimal number of the LBP feature vector, that is, the LBP feature value of the center pixel, such as starting from the specified pixel in the upper left corner, and clockwise, each specified pixel around the center pixel has a weight of 1, 2, respectively. For 4, 8, 16, 32, 64, and 128, the calculated LBP feature value is 1+16+32+64+128=241. After calculating the LBP feature value of each pixel in the spatial sub-block, calculating a statistical histogram of the spatial sub-block according to the LBP feature value of each pixel, and normalizing the statistical histogram to obtain the spatial sub-block The LBP feature vector.
LBP本质上用于对视频中每个空间子块的纹理信息进行描述,所提取的LBP特征具有旋转不变性和灰度不变性等显著的优点,增强了唇语特征向量对光照条件、旋转、平移等因素的鲁棒性,提高了分类准确率。且LBP特征向量的辨别力强大,计算简单。LBP is essentially used to describe the texture information of each spatial sub-block in the video. The extracted LBP features have significant advantages such as rotation invariance and gray invariance, and enhance the lip-feature vector to illumination conditions, rotation, The robustness of factors such as translation improves the classification accuracy. And the LBP feature vector has strong discriminating power and simple calculation.
步骤302b、该特征提取装置对该局部纹理特征矩阵进行奇异值分解,得到Y*Y阶第一右奇异矩阵,提取该第一右奇异矩阵的第一个列向量,作为投影向量,Y*Y阶矩阵表示Y行Y列的矩阵。Step 302b, the feature extraction device performs singular value decomposition on the local texture feature matrix to obtain a Y*Y order first right singular matrix, and extracts a first column vector of the first right singular matrix as a projection vector, Y*Y The order matrix represents a matrix of Y rows and Y columns.
在提取到该视频子块的局部纹理特征矩阵后,该特征提取装置可以对该局部纹理特征矩阵进行奇异值分解,获取分解后得到的右奇异矩阵,作为第一右奇异矩阵,并提取该第一右奇异矩阵的第一个列向量,作为投影向量。After extracting the local texture feature matrix of the video sub-block, the feature extraction device may perform singular value decomposition on the local texture feature matrix, obtain the right singular matrix obtained by the decomposition, as the first right singular matrix, and extract the first The first column vector of a right singular matrix as a projection vector.
该特征提取装置可以应用以下公式,对该局部纹理特征矩阵进行奇异值分解:The feature extraction device may apply the following formula to perform singular value decomposition on the local texture feature matrix:
[u,s,v]=svd(X); [u,s,v]=svd(X);
其中,X表示该局部纹理特征矩阵,则矩阵u为该局部纹理特征矩阵X分解得到的左奇异矩阵,矩阵s为该局部纹理特征矩阵X分解得到的奇异值矩阵,矩阵v即为该局部纹理特征矩阵X分解得到的右奇异矩阵。该局部纹理特征矩阵为X*Y阶矩阵,则对该局部纹理特征矩阵进行奇异值分解后所得到的第一右奇异矩阵为Y*Y阶矩阵,该投影向量为Y维向量。Wherein, X represents the local texture feature matrix, and the matrix u is a left singular matrix obtained by decomposing the local texture feature matrix X, and the matrix s is a singular value matrix obtained by decomposing the local texture feature matrix X, and the matrix v is the local texture The right singular matrix obtained by the decomposition of the feature matrix X. The local texture feature matrix is an X*Y order matrix, and the first right singular matrix obtained by performing singular value decomposition on the local texture feature matrix is a Y*Y order matrix, and the projection vector is a Y-dimensional vector.
步骤302c、该特征提取装置计算该局部纹理特征矩阵与该投影向量的乘积,得到该视频子块的X维唇语特征向量。Step 302c: The feature extraction device calculates a product of the local texture feature matrix and the projection vector to obtain an X-dimensional lip language feature vector of the video sub-block.
该特征提取装置可以应用以下公式,计算该视频子块的唇语特征向量:The feature extraction device may calculate a lip language feature vector of the video sub-block by applying the following formula:
fPLBP=X*pVect;f PLBP =X*pVect;
其中,*表示矩阵的乘积运算,pVect表示矩阵v的第一个列向量,fPLBP表示该唇语特征向量。该局部纹理特征矩阵为X*Y阶矩阵,该投影向量为Y维向量,则计算出的唇语特征向量为X维向量。Where * represents the product of the matrix, pVect represents the first column vector of the matrix v, and f PLBP represents the lip eigenvector. The local texture feature matrix is an X*Y order matrix, and the projection vector is a Y-dimensional vector, and the calculated lip language feature vector is an X-dimensional vector.
则视频的唇语特征向量可以如下:Then the lip eigenvector of the video can be as follows:
Figure PCTCN2015081824-appb-000001
其中,fPLBPi表示视频中第i个视频子块的PLBP特征,m表示视频划分出的视频子块的数目(M×N),F表示该视频的唇语特征向量。由于不同视频所划分出的视频子块的数目固定,则视频的唇语特征向量F的维数也是固定的,该唇语特征向量可以用于对视频进行分类。
Figure PCTCN2015081824-appb-000001
Wherein, f PLBPi represents the PLBP feature of the i-th video sub-block in the video, m represents the number of video sub-blocks (M×N) divided by the video, and F represents the lip-feature vector of the video. Since the number of video sub-blocks divided by different videos is fixed, the dimension of the lip feature vector F of the video is also fixed, and the lip-feature vector can be used to classify the video.
本发明实施例在提取到视频子块的局部纹理特征矩阵的前提下,对提取的局部纹理特征矩阵进行奇异值分解,利用分解出的右奇异矩阵的第一个列向量作为最佳投影向量,对局部纹理特征矩阵进行投影,提取出视频子块的PLBP(Projection Local Binary Patterns,投影局部二值模式)特征。In the embodiment of the present invention, the singular value decomposition is performed on the extracted local texture feature matrix on the premise of extracting the local texture feature matrix of the video sub-block, and the first column vector of the resolved right singular matrix is used as the optimal projection vector. The local texture feature matrix is projected, and the PLBP (Projection Local Binary Patterns) feature of the video sub-block is extracted.
PLBP特征是一种基于LBP特征的图片序列特征选取方法,其基本原理为分别提取图片序列中每一帧图片对应的LBP特征向量,将所有帧图片的特征向量组合为特征矩阵,特征矩阵中的每一列对应于一帧图片的特征向量,从不 同帧数的图片序列中提取出固定维数的特征,通过寻找最佳投影方向,找到最佳投影向量,基于该最佳投影向量对特征矩阵进行投影。The PLBP feature is a method for selecting feature features of image sequences based on LBP features. The basic principle is to extract the LBP feature vectors corresponding to each frame of the image sequence, and combine the feature vectors of all frame images into feature matrices. Each column corresponds to the feature vector of a frame of pictures, never The feature of the fixed dimension is extracted from the image sequence of the same frame number. By finding the optimal projection direction, the optimal projection vector is found, and the feature matrix is projected based on the optimal projection vector.
进一步地,本发明实施例采用了分块提取特征的思想,将视频在时间和空间中划分为若干个视频子块,再分别提取视频子块的PLBP特征,最后将视频子块的PLBP特征组合成为新的特征,作为最终的特征输出。分块技术的应用增加了最终提取的特征向量中所包含的空间与时间信息,能够更好地描述图片序列中唇部的运动情况,极大地利于后期分类器的训练算法的优化,在后期的唇语分类阶段显著地提高了唇读识别率,对相关的视频识别技术也具有极大的借鉴意义。Further, the embodiment of the present invention adopts the idea of extracting features by block, divides video into several video sub-blocks in time and space, extracts PLBP features of video sub-blocks, and finally combines PLBP features of video sub-blocks. Become a new feature and output as the final feature. The application of the blocking technique increases the spatial and temporal information contained in the finally extracted feature vector, which can better describe the movement of the lip in the image sequence, which greatly facilitates the optimization of the training algorithm of the late classifier. The lip language classification stage significantly improves the lip reading recognition rate, and has great reference significance for related video recognition technology.
步骤303、该特征提取装置将该视频中的M×N个视频子块的X维唇语特征向量进行组合,得到该视频的X×M×N维唇语特征向量。Step 303: The feature extraction device combines the X-dimensional lip feature vectors of the M×N video sub-blocks in the video to obtain an X×M×N-dimensional lip feature vector of the video.
该特征提取装置可以获取到该视频中每个视频子块的X维唇语特征向量,将M×N个视频子块的X维唇语特征向量进行组合,将M×N个视频子块中任一视频子块的X维唇语特征向量排列在上一视频子块的X维唇语特征向量之后,从而得到一个X×M×N维的唇语特征向量。The feature extraction device may acquire an X-dimensional lip language feature vector of each video sub-block in the video, and combine the X-dimensional lip feature vectors of the M×N video sub-blocks into the M×N video sub-blocks. The X-dimensional lip feature vector of any video sub-block is arranged after the X-dimensional lip feature vector of the previous video sub-block, thereby obtaining an X×M×N-dimensional lip feature vector.
采用上述步骤301-303,可以提取出任一视频的唇语特征向量,该唇语特征向量可以用于对视频的语义信息进行分类。该特征提取装置可以采用上述步骤301-303,获取多个样本视频的唇语特征向量,根据多个样本视频的唇语特征向量,训练分类器,每当要对视频的语义信息进行分类时,采用上述步骤301-303,获取该视频的唇语特征向量,将该唇语特征向量输入至训练完成的分类器,即可得到分类结果。Using the above steps 301-33, a lip language feature vector of any video can be extracted, and the lip language feature vector can be used to classify the semantic information of the video. The feature extraction device may obtain the lip language feature vector of the plurality of sample videos by using the above steps 301-303, and train the classifier according to the lip feature vector of the plurality of sample videos, and whenever the semantic information of the video is to be classified, Using the above steps 301-303, the lip language feature vector of the video is obtained, and the lip language feature vector is input to the trained classifier to obtain the classification result.
本发明实施例提供的方法,通过按照时间维度,将视频划分为M个时间子块,并按照空间维度,每个时间子块中每个视频帧的唇部区域划分为N个空间子块,将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,则该视频总共得到M×N个视频子块,再计算每个视频子块的唇语特征向量,将视频中的M×N个视频子块的唇语特征向量进行组合,得到该视 频的唇语特征向量,则不同视频所得到的视频子块的数目相同,使得最终提取到的视频的唇语特征向量维数相同,实现了对特征维数的固定,在分类时无需对特征维数进行动态调整,简化了操作,节省了分类时间。且提取视频子块的局部纹理特征矩阵后,对该局部纹理特征矩阵进行投影,得到唇语特征向量,增强了唇语特征的鲁棒性。The method provided by the embodiment of the present invention divides a video into M time sub-blocks according to a time dimension, and divides a lip area of each video frame in each time sub-block into N spatial sub-blocks according to a spatial dimension. If a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M×N video sub-blocks, and then calculates a lip-feature vector of each video sub-block. Combining the lip feature vectors of M×N video sub-blocks in the video to obtain the view The frequency of the lip eigenvectors, the number of video sub-blocks obtained by different videos is the same, so that the final extracted video has the same lip-dimensional feature vector dimension, which realizes the fixation of the feature dimension, and does not need to be featured in the classification. Dynamic adjustment of the dimension simplifies operation and saves classification time. After extracting the local texture feature matrix of the video sub-block, the local texture feature matrix is projected to obtain the lip-feature feature vector, which enhances the robustness of the lip-feature feature.
本发明实施例提供的特征提取方法,可以提取到维数固定的唇语特征向量,该唇语特征向量可以用于对视频的语义信息进行分类,分类过程详见下一实施例。The feature extraction method provided by the embodiment of the present invention can extract a lip-feature feature vector with a fixed dimension, and the lip-feature feature vector can be used to classify the semantic information of the video. The classification process is detailed in the next embodiment.
图7是本发明实施例提供的一种唇语分类方法的流程图。本发明实施例的执行主体为分类装置,参见图7,该方法包括:FIG. 7 is a flowchart of a lip language classification method according to an embodiment of the present invention. The execution subject of the embodiment of the present invention is a classification device. Referring to FIG. 7, the method includes:
步骤701、分类装置提取预先选择的D个样本视频中视频子块的唇语特征向量。Step 701: The classification device extracts a lip feature vector of the video sub-block in the pre-selected D sample videos.
为了训练出多层分类器,该分类装置预先获取D个样本视频,将每个样本视频划分为M×N个视频子块,再提取每个视频子块的唇语特征向量,具体过程与上述步骤301-302类似,在此不再赘述。In order to train the multi-layer classifier, the classifying device pre-acquires D sample videos, divides each sample video into M×N video sub-blocks, and extracts the lip-language feature vector of each video sub-block, the specific process and the above Steps 301-302 are similar, and are not described herein again.
步骤702、该分类装置根据D个样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率按照预设训练算法进行训练,直至该多层分类器的分类准确率满足第一预设条件时停止,得到训练完成的多层分类器,D为正整数,D>1。Step 702: The classification device trains the classification accuracy of the multi-layer classifier according to a preset training algorithm according to the lip feature vector of the video sub-block in the D sample videos until the classification accuracy of the multi-layer classifier satisfies When a preset condition is stopped, a trained multi-layer classifier is obtained, D is a positive integer, D>1.
本发明实施例中,该分类装置可以根据D个样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率按照预设训练算法进行训练,直至该多层分类器的分类准确率满足第一预设条件时停止,训练得到的多层分类器即可用于对待分类视频的语义信息进行分类,从而实现唇语识别。其中,该预设训练算法由该分类装置预先确定,可以为SVM(Support Vector Machine,支持向量机)分类算法、人工神经网络算法等,该第一预设条件用于确定该多层分类器 的训练目标,可以根据对分类准确率的需求确定,该第一预设条件可以为该多层分类器当前的分类准确率达到预设分类准确率,或者,该多层分类器当前的分类准确率与上一次的分类准确率之间的差值小于预设差值,或者,训练过程的迭代次数达到最大迭代次数等,本发明实施例对此均不做限定。In the embodiment of the present invention, the classification device may train the classification accuracy of the multi-layer classifier according to a preset training algorithm according to the lip feature vector of the video sub-block in the D sample videos until the classification of the multi-layer classifier When the accuracy rate is stopped when the first preset condition is met, the multi-layer classifier obtained by the training can be used to classify the semantic information of the video to be classified, thereby implementing lip language recognition. The preset training algorithm is pre-determined by the classification device, and may be an SVM (Support Vector Machine) classification algorithm, an artificial neural network algorithm, etc., and the first preset condition is used to determine the multi-layer classifier. The training target may be determined according to the requirement of the classification accuracy rate, and the first preset condition may be that the current classification accuracy rate of the multi-layer classifier reaches a preset classification accuracy rate, or the current classification of the multi-layer classifier is accurate. The difference between the rate and the previous classification accuracy is less than the preset difference, or the number of iterations of the training process reaches the maximum number of iterations, etc., which is not limited by the embodiment of the present invention.
实际应用时,该分类装置可以将每个样本视频中所有视频子块的唇语特征向量组合成一个唇语特征向量,具体过程与上述步骤303“该特征提取装置将该视频中的M×N个视频子块的X维唇语特征向量进行组合,得到该视频的X×M×N维唇语特征向量”类似,在此不再赘述。之后,该分类装置可以将每个样本视频的唇语特征向量进行组合,得到特征矩阵,将该特征矩阵作为多层分类器的输入,采用ELM(Extreme Learning Machine,极速学习机)算法,对多层分类器的分类准确率进行训练,该多层分类器中包括输入层节点和至少一个隐层节点,输入权值用于表示该输入层节点的权重,训练时只需确定隐层节点的数目、输入权值、偏置项和激励函数,即可对该多层分类器进行训练,训练完成时,根据当前训练的输入权值和输出权值即可确定该多层分类器。但是,ELM算法所采用的隐层节点的输入权值和偏置项均由随机赋值得到,随机赋值会导致该多层分类器在高维的小样本时性能不稳定,很难得到最优的参数值。In practical application, the classification device may combine the lip feature vectors of all the video sub-blocks in each sample video into one lip language feature vector, and the specific process and the above step 303 “the feature extraction device M×N in the video The X-dimensional lip eigenvectors of the video sub-blocks are combined to obtain the X×M×N-dimensional lip eigenvectors of the video, which are similar, and will not be described herein. Then, the classification device can combine the lip feature vectors of each sample video to obtain a feature matrix, and use the feature matrix as an input of the multi-layer classifier, using an ELM (Extreme Learning Machine) algorithm, The classification accuracy of the layer classifier is trained. The multi-layer classifier includes an input layer node and at least one hidden layer node, and the input weight is used to represent the weight of the input layer node, and only the number of hidden layer nodes needs to be determined during training. The multi-level classifier can be trained by inputting the weight, the offset term and the excitation function. When the training is completed, the multi-layer classifier can be determined according to the input weight and the output weight of the current training. However, the input weights and offsets of the hidden layer nodes used by the ELM algorithm are obtained by random assignment. Random assignment will cause the multi-layer classifier to be unstable in high-dimensional small samples, and it is difficult to obtain optimal results. Parameter value.
为此,该分类装置可以采用投影的方式,确定该多层分类器的输入权值,并基于确定的输入权值,对该多层分类器的分类准确率进行训练。To this end, the classification device may determine the input weight of the multi-layer classifier by means of projection, and train the classification accuracy of the multi-layer classifier based on the determined input weight.
本发明实施例可以采用PELM(Projection Extreme Learning Machine,投影极速学习机)作为区分不同语义信息(说话内容)的多层分类器,图8为PELM多层分类器的结构图,该PELM多层分类器包括输入节点、隐层节点和输出节点,其中D表示输入特征向量的维数,N为隐层节点数目,m为输出向量的维数(即所要区分说话内容的类别数)。In the embodiment of the present invention, a PELM (Projection Extreme Learning Machine) can be used as a multi-layer classifier for distinguishing different semantic information (speech content), and FIG. 8 is a structural diagram of a PELM multi-layer classifier, the PELM multi-layer classification The device includes an input node, a hidden layer node, and an output node, where D represents the dimension of the input feature vector, N is the number of hidden layer nodes, and m is the dimension of the output vector (ie, the number of categories of the spoken content to be distinguished).
给定训练样本{X,T},其中
Figure PCTCN2015081824-appb-000002
为输入样本矩阵,每一行对 应一个输入样本(一个视频子块的特征向量);
Figure PCTCN2015081824-appb-000003
为与X对应的类别向量,每一行表示一个样本的类别向量(该样本属于哪个类别,则向量中与该类别对应的位置为1,其余位置为0);wDN表示从第D个输入节点到第N个隐层节点的输入权值,βNm表示从第N个隐层节点到第m个输出节点的输出权值。训练PELM多层分类器的模型,就是根据{X,T},找到恰当的
Figure PCTCN2015081824-appb-000004
Figure PCTCN2015081824-appb-000005
使得T=g(XW)β,其中g(x)为隐层节点激励函数。
Given a training sample {X,T}, where
Figure PCTCN2015081824-appb-000002
To input the sample matrix, each row corresponds to one input sample (the eigenvector of a video sub-block);
Figure PCTCN2015081824-appb-000003
For the class vector corresponding to X, each row represents a class vector of a sample (what class the sample belongs to, the position corresponding to the class in the vector is 1, and the remaining positions are 0); w DN represents the D input node To the input weight of the Nth hidden layer node, β Nm represents the output weight from the Nth hidden layer node to the mth output node. The model for training PELM multi-layer classifiers is to find the appropriate one based on {X,T}.
Figure PCTCN2015081824-appb-000004
with
Figure PCTCN2015081824-appb-000005
Let T = g(XW)β, where g(x) is the hidden layer node excitation function.
该PELM多层分类器的训练过程可以为:该分类装置获取多个样本视频的唇语特征向量,组合后得到特征矩阵,对该特征矩阵进行奇异值分解,获取分解后得到的右奇异矩阵,根据预设保留维数,从该右奇异矩阵中,提取与该预设保留维数对应的列向量,作为投影矩阵,将该投影矩阵作为多层分类器的输入权值,以该投影矩阵来表示该多层分类器的输入层节点的权重,而不再对输入权值进行随机赋值,基于该投影矩阵、当前的隐层节点数目和激励函数,对该多层分类器的分类准确率进行训练,训练完成时,根据训练出的输入权值和输出权值,确定该多层分类器,使得该多层分类器用于对视频的语义信息进行分类。其中,该预设保留维数用于规定该投影矩阵的列数,且该预设保留维数小于该右奇异矩阵的列数,可以为1、2或者其他数值,本发明实施例对此不做限定。The training process of the PELM multi-layer classifier may be: the classification device acquires a lip language feature vector of a plurality of sample videos, combines to obtain a feature matrix, performs singular value decomposition on the feature matrix, and obtains a right singular matrix obtained by decomposition, Extracting a column vector corresponding to the preset retention dimension from the right singular matrix according to a preset retention dimension, as a projection matrix, using the projection matrix as an input weight of the multi-layer classifier, and using the projection matrix Representing the weight of the input layer node of the multi-layer classifier, and no longer randomly assigning the input weight, based on the projection matrix, the current number of hidden layer nodes, and the excitation function, the classification accuracy of the multi-layer classifier is performed. After the training is completed, the multi-layer classifier is determined according to the trained input weight and the output weight, so that the multi-layer classifier is used to classify the semantic information of the video. The preset retention dimension is used to specify the number of columns of the projection matrix, and the preset retention dimension is less than the number of columns of the right singular matrix, and may be 1, 2 or other values, which is not in this embodiment of the present invention. Make a limit.
具体地,该分类装置获取D个样本视频的唇语特征向量,假设每个样本视频的唇语特征向量为R维向量,则将D个样本视频的R维唇语特征向量进行组合,得到D*R阶的特征矩阵。应用公式[P,S,QT]=svd(X),对该特征矩阵进行奇异值分解,获取分解后得到的右奇异矩阵QT,并根据预设保留维数K,从该右奇异矩阵中,提取与该预设保留维数对应的列向量,作为投影矩阵QK。 将该投影矩阵作为多层分类器的输入权值W=QN,另H=g(XW),应用公式β=H-1T计算β,至此,PELM多层分类器训练完毕。当获取到一个新的视频的特征矩阵xnew=[x1,x2,…,xD]时,应用公式t=g(xnewW)β,其中,t=[t1,t2,…,tm]。寻找t1,t2,…,tm中的最大值对应的类别即为该视频的分类结果。Specifically, the classification device acquires a lip feature vector of D sample videos, and if the lip feature vector of each sample video is an R-dimensional vector, the R-dimensional lip feature vectors of the D sample videos are combined to obtain D. * The characteristic matrix of the Rth order. Applying the formula [P, S, Q T ]=svd(X), the eigenvalue decomposition is performed on the feature matrix, and the right singular matrix Q T obtained after decomposition is obtained, and the right singular matrix is retained according to the preset retention dimension K. And extracting a column vector corresponding to the preset retention dimension as the projection matrix Q K . The projection matrix is used as the input weight W=Q N of the multi-layer classifier, and H=g(XW), and β is calculated by applying the formula β=H -1 T. At this point, the PELM multi-layer classifier is trained. When the feature matrix x new =[x 1 ,x 2 ,...,x D ] of a new video is obtained, the formula t=g(x new W)β is applied, where t=[t 1 , t 2 , ...,t m ]. Finding the category corresponding to the maximum value in t 1 , t 2 , ..., t m is the classification result of the video.
PELM是一种简单易用、有效的单隐层神经网络学习算法。与传统的神经网络学习算法需要人为设置大量的网络训练参数,且很容易产生局部最优解相比,PELM只需要将多个样本视频的唇语特征向量组合而成的特征矩阵的投影矩阵作为网络的输入权值,并设置网络的隐层节点的数目,在算法执行过程中不需要调整网络的输入权值以及隐层节点的偏置项,并且能够产生唯一的最优解,因此具有学习速度快且泛化性能好的优点,使得训练出的多层分类器能够获得稳定的识别率。PELM is an easy-to-use and effective single hidden layer neural network learning algorithm. Compared with the traditional neural network learning algorithm, it is necessary to artificially set a large number of network training parameters, and it is easy to generate a local optimal solution. PELM only needs to use the projection matrix of the feature matrix composed of the lip language feature vectors of multiple sample videos as the projection matrix. The input weight of the network, and the number of hidden layer nodes of the network, do not need to adjust the input weight of the network and the bias term of the hidden layer node during the execution of the algorithm, and can generate a unique optimal solution, thus having learning The advantages of fast speed and good generalization performance enable the trained multi-layer classifier to obtain a stable recognition rate.
需要说明的是,本发明实施例仅以该分类装置将D个视频的视频子块均作为样本为例进行说明,而在实际应用时,该分类装置还可以采用预设选择策略,对多个样本视频的视频子块进行选择,挑选出能够很好地描述唇语信息的视频子块,对被选择的视频子块进行重组,仅将被选择的视频子块作为样本,这些样本可以用于对多层分类器的分类准确率进行训练,以减小冗余的特征,提高计算速度。且,不同样本视频中被选择的视频子块的数目相同,以保证不同样本视频的唇语特征向量的维数固定。其中,该预设选择策略用于确定对视频子块的选择方式,可以由该分类装置预先确定,本发明实施例对此不做限定。It should be noted that, in the embodiment of the present invention, only the video sub-blocks of the D videos are taken as samples as an example, and in actual application, the classification device may also adopt a preset selection policy to multiple The video sub-block of the sample video is selected, and a video sub-block capable of well describing the lip-speech information is selected, and the selected video sub-block is recombined, and only the selected video sub-block is taken as a sample, and the samples can be used for The classification accuracy of the multi-layer classifier is trained to reduce the redundancy characteristics and improve the calculation speed. Moreover, the number of selected video sub-blocks in different sample videos is the same to ensure that the dimension of the lip-feature vector of different sample videos is fixed. The preset selection policy is used to determine the selection mode of the video sub-block, which may be determined in advance by the classification device, which is not limited by the embodiment of the present invention.
可选地,该分类装置可以采用以下步骤702a-702c提供的训练方法,对样本视频中的视频子块进行选择,基于被选择的视频子块的唇语特征向量训练出一个多层分类器:Optionally, the classifying device may select a video sub-block in the sample video by using the training method provided in the following steps 702a-702c, and train a multi-layer classifier based on the lip feature vector of the selected video sub-block:
步骤702a、按照预定规则,构造L个指定标识,获取每个指定标识的标志值,L个指定标识用于确定对应的隐层节点数目和D个样本视频中被选择的视频子块。Step 702a: Constructing L designated identifiers according to a predetermined rule, and acquiring a flag value of each specified identifier, where the L specified identifiers are used to determine the corresponding number of hidden layer nodes and the selected video sub-blocks in the D sample videos.
该分类装置可以按照预定规则,构造L个指定标识,L为正整数,L>1。 再根据不同的隐层节点数目和视频子块的不同选择方式,对每个指定标识进行初始化,为每个指定标识赋予一个标志值,且可以随机为每个指定标识赋予一个标志值,也可以为每个指定标识赋予预设的标志值,如标志值0000或者1111等,本发明实施例对此不做限定。每个标志值中包括用于表示隐层节点数目的标志位和用于表示每个视频子块是否被选择的标志位,不同标志值对应的隐层节点数目不同或者对应的被选择的视频子块不同,即每个标志值与一个隐层节点数目对应,并与D个样本视频中被选择的视频子块对应,根据每个标志值可以确定对应的隐层节点数目和每个样本视频中应选择的视频子块。The classifying device can construct L designated identifiers according to a predetermined rule, L is a positive integer, L>1. Then, according to different number of hidden layer nodes and different selection methods of video sub-blocks, each specified identifier is initialized, a flag value is assigned to each specified identifier, and a flag value can be randomly assigned to each specified identifier, or The preset flag value is assigned to each of the designated identifiers, such as the flag value 0000 or 1111, and the like, which is not limited by the embodiment of the present invention. Each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and the number of hidden layer nodes corresponding to different flag values is different or the corresponding selected video sub- The blocks are different, that is, each flag value corresponds to a number of hidden layer nodes, and corresponds to the selected video sub-blocks in the D sample videos, and the number of corresponding hidden layer nodes and each sample video can be determined according to each flag value. The video sub-block should be selected.
例如,该分类装置可以采用10个标志位作为表示隐层节点数目的标志位,该10个标志位所组成的二进制数对应的十进制数值即为隐层节点数目。且,该分类装置还可以采用m个标志位作为表示每个视频子块是否被选择的标志位,m为每个样本视频中视频子块的数目,当m个标志位中的某一标志位为1时,表示与该标志位对应的视频子块被选择,当m个标志位中的某一标志位为0时,表示与该标志位对应的视频子块未被选择。如当m个标志位为1000时,表示选择每个视频中的第一个视频子块。该分类装置将这10+m个标志位组成的二进制数作为一个标志值,10+m表示标志值的位数,并将该标志值赋值给一个指定标识。For example, the classification device may use 10 flag bits as the flag bits indicating the number of hidden layer nodes, and the decimal value corresponding to the binary number formed by the 10 flag bits is the number of hidden layer nodes. Moreover, the classification device may also adopt m flag bits as flag bits indicating whether each video sub-block is selected, and m is the number of video sub-blocks in each sample video, when one of the m flag bits When it is 1, it indicates that the video sub-block corresponding to the flag bit is selected. When one of the m flag bits is 0, it indicates that the video sub-block corresponding to the flag bit is not selected. For example, when m flag bits are 1000, it means that the first video sub-block in each video is selected. The classification device takes the binary number composed of the 10+m flag bits as a flag value, 10+m represents the number of bits of the flag value, and assigns the flag value to a specified identifier.
步骤702b、对于每个指定标识的标志值,根据指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的视频子块的唇语特征向量,对与该标志值对应的多层分类器的分类准确率进行训练。Step 702b: For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, corresponding to the flag value The classification accuracy of the multi-layer classifier is trained.
在本发明实施例中,每个标志值用于训练出一个对应的多层分类器。对于每个指定标识来说,当获取到该指定标识的标志值时,可以确定该标志值对应的隐层节点数目以及对应的D个样本视频中被选择的视频子块,获取该标志值对应的D个样本视频中被选择的视频子块的唇语特征向量,根据该标志值对应的隐层节点数目以及对应的D个样本视频中被选择的视频子块的唇语特征向量,对与该标志值对应的多层分类器的分类准确率进行训练,直至该多层分类 器的分类准确率满足第一预设条件时停止,得到训练完成的多层分类器,则L个指定标识的标志值可以训练出L个多层分类器。In an embodiment of the invention, each flag value is used to train a corresponding multi-layer classifier. For each specified identifier, when the flag value of the specified identifier is obtained, the number of hidden layer nodes corresponding to the flag value and the selected video sub-block in the corresponding D sample videos may be determined, and the flag value is obtained. a lip feature vector of the selected video sub-block in the D sample videos, according to the number of hidden layer nodes corresponding to the flag value and the lip feature vector of the selected video sub-block in the corresponding D sample videos, Training the classification accuracy of the multi-level classifier corresponding to the flag value until the multi-layer classification When the classification accuracy of the device is stopped when the first preset condition is met, and the trained multi-layer classifier is obtained, the L-numbered identifiers of the specified identifiers can train L multi-layer classifiers.
该步骤702b具体可以包括以下步骤702b-1至702b-4:The step 702b may specifically include the following steps 702b-1 to 702b-4:
步骤702b-1、确定该指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的H个视频子块,H为正整数。Step 702b-1: Determine the number of hidden layer nodes corresponding to the flag value of the specified identifier and the selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer.
该分类装置根据该标志值中的多个标志位上的值,可以确定该标志值对应的隐层节点数目以及对应的D个样本视频中被选择的H个视频子块。The classifying device may determine the number of hidden layer nodes corresponding to the flag value and the selected H video sub-blocks of the corresponding D sample videos according to values on the plurality of flag bits in the flag value.
基于步骤702a的举例,该标志值中包括10+m个标志位,该分类装置计算该标志值中前10个标志位对应的十进制数值,即为该标志值对应的隐层节点数目,并获取后m个标志位中的值为1的标志位,值为1的标志位所对应的视频子块即为每个样本视频中被选择的视频子块。Based on the example of step 702a, the flag value includes 10+m flag bits, and the classifying device calculates a decimal value corresponding to the first 10 flag bits in the flag value, that is, the number of hidden layer nodes corresponding to the flag value, and obtains A flag bit having a value of 1 in the last m flag bits, and a video sub-block corresponding to the flag bit having a value of 1 is the selected video sub-block in each sample video.
步骤702b-2、对于D个样本视频中的每个样本视频,将样本视频中被选择的H个视频子块的X维唇语特征向量进行组合,得到该样本视频的H×X维唇语特征向量,对D个样本视频的H×X维唇语特征向量进行组合,得到D*(H×X)阶的特征矩阵。Step 702b-2, combining, for each sample video in the D sample videos, an X-dimensional lip feature vector of the selected H video sub-blocks in the sample video to obtain an H×X-dimensional lip language of the sample video. The feature vector combines the H×X-dimensional lip feature vectors of the D sample videos to obtain a feature matrix of D*(H×X) order.
按照该标志值对视频子块进行选择后,该分类装置仅根据被选择的视频子块训练多层分类器,因此,对于D个样本视频中的每个样本视频,将样本视频中被选择的H个视频子块的X维唇语特征向量进行组合,得到该样本视频的H×X维唇语特征向量,而不再将该样本视频中所有的视频子块的唇语特征向量进行组合。每个样本视频得到一个H×X维唇语特征向量,将每个样本视频的H×X维唇语特征向量作为一行,对D个样本视频的H×X维唇语特征向量进行组合,得到D*(H×X)阶的特征矩阵。After selecting the video sub-block according to the flag value, the classification device trains the multi-layer classifier only according to the selected video sub-block, and therefore, for each sample video in the D sample videos, the selected sample video is selected. The X-dimensional lip eigenvectors of the H video sub-blocks are combined to obtain the H×X-dimensional lip eigenvectors of the sample video, and the lip eigenvectors of all the video sub-blocks in the sample video are no longer combined. An H×X-dimensional lip feature vector is obtained for each sample video. The H×X-dimensional lip feature vector of each sample video is used as a row, and the H×X-dimensional lip feature vectors of D sample videos are combined to obtain A feature matrix of D*(H×X) order.
步骤702b-3、对该特征矩阵进行奇异值分解,得到第二右奇异矩阵,从该第二右奇异矩阵中,提取与预设保留维数对应的列向量,作为投影矩阵。Step 702b-3, performing singular value decomposition on the feature matrix to obtain a second right singular matrix, and extracting, from the second right singular matrix, a column vector corresponding to the preset retention dimension as a projection matrix.
在获取到该特征矩阵以后,该分类装置对该特征矩阵进行奇异值分解,得到右奇异矩阵,作为第二右奇异矩阵,并从该第二右奇异矩阵中,提取与预设 保留维数对应的列向量,作为投影矩阵。该投影过程与上述步骤302b类似,在此不再赘述。After acquiring the feature matrix, the classifying device performs singular value decomposition on the feature matrix to obtain a right singular matrix as a second right singular matrix, and extracts and presets from the second right singular matrix. The column vector corresponding to the dimension is reserved as the projection matrix. The projection process is similar to the above step 302b, and details are not described herein again.
步骤702b-4、基于该投影矩阵、激励函数和该隐层节点数目,对该多层分类器的分类准确率进行训练,该多层分类器至少包括输入层节点和至少一个隐层节点,该投影矩阵用于表示该输入层节点的权重。Step 702b-4, training the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes, the multi-layer classifier including at least an input layer node and at least one hidden layer node, The projection matrix is used to represent the weight of the input layer node.
该分类装置计算出该投影矩阵之后,以该投影矩阵来表示多层分类器的输入层节点的权重,基于该投影矩阵、激励函数和该隐层节点数目,对该多层分类器的分类准确率进行训练,直至该多层分类器的分类准确率满足该第一预设条件时停止。After the classification device calculates the projection matrix, the weight of the input layer node of the multi-layer classifier is represented by the projection matrix, and the classification of the multi-layer classifier is accurate based on the projection matrix, the excitation function and the number of the hidden layer nodes. The rate is trained until the classification accuracy of the multi-layer classifier stops when the first preset condition is met.
训练过程中,该分类装置可以将D个样本视频作为训练样本视频,获取该D个样本视频对应的投影矩阵,基于该投影矩阵、激励函数和该隐层节点数目,训练出一个多层分类器,再获取D’个测试样本视频,获取该D’个样本视频对应的特征矩阵,将该特征矩阵输入该多层分类器中,得到每个测试样本视频的分类结果,并将每个测试样本视频的分类结果与该测试样本视频实际划分的类别进行比较,计算出该多层分类器的分类准确率。During the training process, the classification device can use D sample video as the training sample video, obtain the projection matrix corresponding to the D sample videos, and train a multi-layer classifier based on the projection matrix, the excitation function and the number of the hidden layer nodes. And acquiring D' test sample videos, obtaining a feature matrix corresponding to the D' sample videos, inputting the feature matrix into the multi-layer classifier, obtaining a classification result of each test sample video, and each test sample The classification result of the video is compared with the category actually divided by the test sample video, and the classification accuracy of the multi-layer classifier is calculated.
步骤702c、根据每个指定标识的标志值以及每个指定标识的标志值训练出的多层分类器的分类准确率,按照预设搜索算法,获取该L个指定标识的全局最优标志值,并对每个指定标识的标志值进行更新。Step 702c: Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain a global optimal flag value of the L designated identifiers according to a preset search algorithm. The flag value of each specified identifier is updated.
L个指定标识中,每个指定标识的标志值训练出了一个多层分类器,则该分类装置可以按照预设搜索算法,对L个指定标识的标志值进行搜索,查找到该L个指定标识的全局最优标志值,从而获取该全局最优标志值对应的多层分类器。其中,该预设搜索算法可以由该分类装置预先确定,本发明实施例对此不做限定。Among the L designated identifiers, the flag value of each specified identifier trains a multi-layer classifier, and the classification device can search for the flag values of the L specified identifiers according to a preset search algorithm, and find the L designations. The global optimal flag value of the identifier is obtained, thereby obtaining a multi-layer classifier corresponding to the global optimal flag value. The preset search algorithm may be determined in advance by the classification device, which is not limited by the embodiment of the present invention.
可选地,该分类装置可以采用以下步骤702c-1至702c-4提供的搜索方法,获取全局最优标志值:Optionally, the classification device may obtain the global optimal flag value by using the search methods provided in the following steps 702c-1 to 702c-4:
步骤702c-1、根据每个指定标识的标志值,计算该L个指定标识的平均最 优标志值。Step 702c-1: Calculate an average of the L designated identifiers according to the flag value of each specified identifier. Excellent flag value.
该分类装置可以根据每个指定标识的标志值,应用以下公式,计算出该L个指定标识的平均最优标志值:The classifying device may calculate an average optimal flag value of the L designated identifiers according to the flag value of each specified identifier by applying the following formula:
Figure PCTCN2015081824-appb-000006
Figure PCTCN2015081824-appb-000006
其中,mbest表示该L个指定标识的平均最优标志值,L表示指定标识的数目,t表示当前的迭代次数,n表示指定标识的标志值的维度数目,Pi,n(t)表示第i个指定标识的标志值中第n个标志位上的值。Where m best represents the average optimal flag value of the L specified identifiers, L represents the number of specified identifiers, t represents the current number of iterations, n represents the number of dimensions of the flag value of the specified identifier, and P i,n (t) represents The value of the nth flag bit in the flag value of the i-th specified identifier.
步骤702c-2、根据每个指定标识的标志值训练出的多层分类器的分类准确率,获取每个指定标识的最优标志值以及该全局最优标志值。Step 702c-2: Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, and obtain an optimal flag value of each specified identifier and the global optimal flag value.
对于每个指定标识,该分类装置可以将指定标识的标志值训练出的多层分类器的分类准确率作为该指定标识的适应度值,利用该适应度值对L个指定标识的标志值进行调整进化,根据该指定标识当前的适应度值与该指定标识的历史最优适应度值,对该指定标识的最优标志值进行更新,得到该指定标识更新后的最优标志值,根据该指定标识当前的适应度值与该L个指定标识的历史全局最优适应度值,对该L个指定标识的全局最优标志值进行更新,得到更新后的全局最优标志值。For each specified identifier, the classification device may use the classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier as the fitness value of the specified identifier, and use the fitness value to perform the flag values of the L specified identifiers. Adjusting the evolution, according to the current fitness value of the specified identifier and the historical optimal fitness value of the specified identifier, updating the optimal flag value of the specified identifier, and obtaining the optimal flag value after the specified identifier is updated, according to the Specifying a current fitness value and a historical global optimal fitness value of the L specified identifiers, and updating the global optimal flag values of the L specified identifiers to obtain an updated global optimal flag value.
其中,指定标识的最优适应度值是指多次迭代过程中,指定标识的多个标志值对应的适应度值中最大的适应度值,指定标识的最优标志值是指多次迭代过程中,指定标识的多个标志值中适应度值最大的标志值。The optimal fitness value of the specified identifier refers to the maximum fitness value among the fitness values corresponding to the plurality of flag values of the specified identifier, and the optimal flag value of the specified identifier refers to multiple iterations. Medium, the flag value of the plurality of flag values of the specified identifier having the largest fitness value.
对指定标识的最优标志值进行更新时,该分类装置获取该指定标识当前的适应度值以及历史最优适应度值,如果该指定标识当前的适应度值大于该指定标识的历史最优适应度值,则将该指定标识当前的标志值作为该指定标识的最优标志值,而如果该指定标识当前的适应度值不大于该指定标识的历史最优适应度值,则该指定标识的最优标志值不变,仍为该指定标识的历史最优适应度值。 When the optimal flag value of the specified identifier is updated, the classification device obtains the current fitness value and the historical optimal fitness value of the specified identifier, and if the current fitness value of the specified identifier is greater than the historical optimal adaptation of the specified identifier The value of the specified identifier is used as the optimal flag value of the specified identifier, and if the current fitness value of the specified identifier is not greater than the historical optimal fitness value of the specified identifier, the specified identifier is The optimal flag value is unchanged, and is still the historical optimal fitness value of the specified identifier.
其中,L个指定标识的全局最优适应度值是指L个指定标识中每个指定标识的最优适应度值中的最大适应度值,L个指定标识的全局最优标志值是指多次迭代过程中,每个指定标识的多个标志值中适应度值最大的标志值。The global optimal fitness value of the L specified identifiers refers to the maximum fitness value in the optimal fitness value of each of the L specified identifiers, and the global optimal flag values of the L specified identifiers are multiple The value of the flag with the highest fitness value among the multiple flag values of each specified identifier during the iteration.
对L个指定标识的全局最优标志值进行更新时,该分类装置获取该指定标识当前的适应度值以及该L个指定标识的历史全局最优适应度值,如果该指定标识当前的适应度值大于该L个指定标识的历史全局最优适应度值,则将该指定标识当前的标志值作为L个指定标识的全局最优标志值,而如果该指定标识当前的适应度值不大于L个指定标识的历史全局最优适应度值,则L个指定标识的全局最优适应度值不变,仍为L个指定标识的历史全局最优适应度值。When the global optimal flag value of the L specified identifiers is updated, the classification device obtains the current fitness value of the specified identifier and the historical global optimal fitness value of the L designated identifiers, if the current identifier of the specified identifier If the value is greater than the historical global optimal fitness value of the L specified identifiers, the current flag value of the specified identifier is used as the global optimal flag value of the L specified identifiers, and if the current fitness value of the specified identifier is not greater than L For the historical global optimal fitness value of the specified identifier, the global optimal fitness value of the L specified identifiers is unchanged, and is still the historical global optimal fitness value of the L specified identifiers.
步骤702c-3、根据每个指定标识的最优标志值和该全局最优标志值,计算每个指定标识的局部吸引子。Step 702c-3: Calculate a local attractor of each specified identifier according to an optimal flag value of each specified identifier and the global optimal flag value.
具体地,该分类装置可以根据指定标识的最优标志值和该全局最优标志值,应用以下公式,计算该指定标识的局部吸引子:Specifically, the classification device may calculate the local attractor of the specified identifier according to the optimal flag value of the specified identifier and the global optimal flag value by applying the following formula:
Figure PCTCN2015081824-appb-000007
Figure PCTCN2015081824-appb-000007
其中,
Figure PCTCN2015081824-appb-000008
为服从在(0,1)均匀分布的随机数,
Figure PCTCN2015081824-appb-000009
p(t)表示指定标识的局部吸引子,Pi(t)表示该指定标识的最优标志值,Pg(t)表示L个指定标识的全局最优标志值。
among them,
Figure PCTCN2015081824-appb-000008
To obey the random number evenly distributed at (0,1),
Figure PCTCN2015081824-appb-000009
p(t) represents the local attractor of the specified identifier, P i (t) represents the optimal flag value of the specified identifier, and P g (t) represents the global optimal flag value of the L specified identifiers.
步骤702c-4、根据每个指定标识的局部吸引子和该平均最优标志值,按照预设更新算法对每个指定标识的标志值进行更新。Step 702c-4: Update the flag value of each specified identifier according to a preset update algorithm according to the local attractor of each specified identifier and the average optimal flag value.
具体地,该分类装置可以根据指定标识的局部吸引子和该平均最优标志值,应用以下公式,对该指定标识的标志值进行更新:Specifically, the classification device may apply the following formula to update the flag value of the specified identifier according to the local attractor of the specified identifier and the average optimal flag value:
x(t+1)=p(t)±β*abs(mbest-x(t))*ln(1/u);x(t+1)=p(t)±β*abs(m best -x(t))*ln(1/u);
其中,x(t+1)表示该指定标识的更新后的标志值,p(t)表示该指定标识的局部吸引子,β表示收缩扩张系数,mbest表示该平均最优标志值,x(t)表示该指定标识更新之前的标志值,u为服从在(0,1)均匀分布的随机数,u:U(0,1)。Where x(t+1) represents the updated flag value of the specified identifier, p(t) represents the local attractor of the specified identifier, β represents the contraction expansion coefficient, and m best represents the average optimal flag value, x ( t) indicates the flag value before the specified identifier is updated, u is a random number uniformly obeyed at (0, 1), u: U(0, 1).
进一步地,β=0.5*(Maxiter-count)/Maxiter+0.5;Maxiter表示预先设定的最 大迭代次数,count表示当前的迭代次数。Further, β=0.5*(Maxiter-count)/Maxiter+0.5; Maxiter indicates the most preset The number of iterations, count represents the current number of iterations.
在对每个指定标识的标志值进行更新之后,每个指定标识的标志值发生变化,则该分类装置基于每个指定标识更新后的标志值,重复执行上述步骤702b至702c,直至计算得到的全局最优标志值满足第二预设条件时停止,获取该全局最优标志值训练的多层分类器。After the flag value of each specified identifier is updated, the flag value of each specified identifier changes, and the classifying device repeatedly performs the above steps 702b to 702c based on the updated flag value of each specified identifier until the calculated value is obtained. When the global optimal flag value satisfies the second preset condition, the multi-level classifier trained by the global optimal flag value is acquired.
其中,该第二预设条件可以包括该全局最优标志值达到预设最优标志值、该全局最优标志值对应的全局最优适应度值达到最大适应度值、该全局最优标志值对应的全局最优适应度值与上一次得到的全局最优标志值对应的全局最优适应度值之间的差值小于预设全局差值、当前的迭代次数达到最大迭代次数、该全局最优标志值所训练出的多层分类器的分类准确率达到预设准确率中的至少一项,本发明实施例对此不做限定。The second preset condition may include that the global optimal flag value reaches a preset optimal flag value, and the global optimal fitness value corresponding to the global optimal flag value reaches a maximum fitness value, and the global optimal flag value The difference between the global optimal fitness value corresponding to the global optimal fitness value obtained last time is smaller than the preset global difference, the current number of iterations reaches the maximum number of iterations, and the global maximum The classification accuracy of the multi-layer classifier trained by the superior flag value reaches at least one of the preset accuracy rates, which is not limited by the embodiment of the present invention.
步骤703、该分类装置基于该多层分类器,对待分类视频的语义信息进行分类。Step 703: The classification device classifies the semantic information of the video to be classified based on the multi-layer classifier.
在查找到全局最优标志值后,该分类装置获取该全局最优标志值训练的多层分类器,得到该多层分类器的输入权值和输出权值,该多层分类器即可用于对视频的语义信息进行分类。当该分类装置获取到待分类的视频时,采用上述步骤301-303所示的方法,提取该视频中每个视频子块的唇语特征向量,将每个视频子块的唇语特征向量输入至该多层分类器中,根据该多层分类器所训练出的输入权值和输出权值,计算分类结果,即可实现对视频的唇语识别过程,获取视频的语义信息。After finding the global optimal flag value, the classifying device acquires the multi-layer classifier trained by the global optimal flag value, and obtains the input weight and the output weight of the multi-layer classifier, and the multi-layer classifier can be used for Classify the semantic information of the video. When the classification device acquires the video to be classified, the lip language feature vector of each video sub-block in the video is extracted by using the method shown in the above steps 301-303, and the lip language feature vector of each video sub-block is input. To the multi-layer classifier, according to the input weights and output weights trained by the multi-layer classifier, the classification result is calculated, and the lip recognition process of the video can be realized, and the semantic information of the video is obtained.
本发明实施例提供的方法,通过按照时间维度,将视频划分为M个时间子块,并按照空间维度,每个时间子块中每个视频帧的唇部区域划分为N个空间子块,将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,则该视频总共得到M×N个视频子块,再计算每个视频子块的唇语特征向量,将视频中的M×N个视频子块的唇语特征向量进行组合,得到该视频的唇语特征向量,则不同视频所得到的视频子块的数目相同,使得最终提取 到的视频的唇语特征向量维数相同,实现了对特征维数的固定,在分类时无需对特征维数进行动态调整,简化了操作,节省了训练时间,应用训练的多层分类器对视频进行分类时,也无需对特征维数进行动态调整,简化了操作,节省了分类时间,提高了分类准确率。且提取视频子块的局部纹理特征矩阵后,对该局部纹理特征矩阵进行投影,得到唇语特征向量,增强了唇语特征的鲁棒性。并应用预设选择策略对视频子块进行选择,挑选出能够很好地描述唇语信息的视频子块,克服分块PLBP特征中每个视频子块的唇语信息不同的缺陷,减小了冗余的特征,提高了计算速度和分类准确率。The method provided by the embodiment of the present invention divides a video into M time sub-blocks according to a time dimension, and divides a lip area of each video frame in each time sub-block into N spatial sub-blocks according to a spatial dimension. If a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M×N video sub-blocks, and then calculates a lip-feature vector of each video sub-block. Combining the lip feature vectors of M×N video sub-blocks in the video to obtain the lip feature vector of the video, the number of video sub-blocks obtained by different videos is the same, so that the final extraction is performed. The video has the same lip-dimensional feature vector dimension, which realizes the fixed dimension of the feature. It does not need to dynamically adjust the feature dimension in the classification, which simplifies the operation, saves the training time, and applies the multi-layer classifier of the training. When the video is classified, the feature dimension is not dynamically adjusted, which simplifies the operation, saves the classification time, and improves the classification accuracy. After extracting the local texture feature matrix of the video sub-block, the local texture feature matrix is projected to obtain the lip-feature feature vector, which enhances the robustness of the lip-feature feature. The video sub-block is selected by using a preset selection strategy, and a video sub-block capable of describing the lip-speech information is selected to overcome the defects of different lip-speech information of each video sub-block in the block PLBP feature, and the defect is reduced. Redundant features increase computational speed and classification accuracy.
本发明实施例针对如今的唇语特征维数不固定以及含有大量冗余信息等缺点,创新地提出了一种视频划分方法以及一种描述算子PLBP,能够在保证信息的完整性前提下,有效地增加视频中的空间与时间信息,能够将维数不固定的视频用维数固定的特征表示,能够很好的描述时空特征,极大地利于后期识别算法的优化,从而在后续的唇语识别阶段显著提高唇读的识别率。In view of the shortcomings of the current lip language feature dimension and a large amount of redundant information, the present invention innovatively proposes a video segmentation method and a description operator PLBP, which can ensure the integrity of information. Effectively increase the spatial and temporal information in the video, and can display the video with fixed dimension in the dimension with fixed dimension, which can describe the spatio-temporal feature well, which greatly facilitates the optimization of the post-recognition algorithm, and thus the subsequent lip language. The recognition phase significantly improves the recognition rate of lip reading.
本发明实施例采用了BQPSO(Binary Quantum Particle Swarm Optimization,二进制量子粒子群优化)算法与PELM算法相结合的方式,将构造的指定标识作为粒子,以BQPSO作为选择特征组合的搜索算法,以PELM多层分类器对样本视频的分类准确率作为适应度值,不断调整L个指定标识的标志值,搜索到使适应度值最优的标志值,从而确定使适应度值最优的特征组合。通过BQPSO对视频子块进行选择,并通过PELM的多层分类器进行训练和分类,可显著提高唇语识别过程中样本训练的速度,并且达到更高的识别率,能够增强在移动终端的适用性,对其他生物特征识别技术在移动终端上的应用提供了借鉴方案。The embodiment of the invention adopts a combination of BQPSO (Binary Quantum Particle Swarm Optimization) algorithm and PELM algorithm, and uses the specified identifier of the structure as a particle, and BQPSO as a search algorithm for selecting a feature combination, with more PELM. The classifier classifies the classification accuracy of the sample video as the fitness value, continuously adjusts the flag values of the L specified identifiers, searches for the flag value that optimizes the fitness value, and determines the feature combination that optimizes the fitness value. The video sub-blocks are selected by BQPSO and trained and classified by PELM's multi-layer classifier, which can significantly improve the speed of sample training in lip-speech recognition process, and achieve higher recognition rate, which can enhance the application in mobile terminals. Sexuality provides a reference for the application of other biometrics on mobile terminals.
为了体现本发明实施例所提供方法的效果,应用现有技术中的HMM(Hidden Markov Model,隐马尔可夫模型)算法和本发明实施例提供的方法进行了实验,实验时共采取了20条实验命令,每条实验命令采用了5个样本作为训练样本,5个样本作为测试样本,则一共取得了100个训练样本和100个 测试样本。In order to embody the effect of the method provided by the embodiment of the present invention, the HMM (Hidden Markov Model) algorithm in the prior art and the method provided by the embodiment of the present invention are used for experiments, and a total of 20 experiments are taken during the experiment. In the experimental order, 5 samples were used as training samples for each experimental command, and 5 samples were used as test samples. A total of 100 training samples and 100 samples were obtained. Test the sample.
应用现有技术中的HMM算法和100个训练样本进行训练时所用的训练时间以及应用本发明实施例提供的方法和100个训练样本进行训练时所用的训练时间如下表1所示。应用现有技术中的HMM算法训练得到的分类器对100个测试样本进行分类时的分类准确率以及应用本发明实施例提供的方法训练得到的多层分类器对100个测试样本进行分类时的分类准确率如下表2所示。The training time used in the training using the HMM algorithm and the 100 training samples in the prior art and the training time used in applying the method provided by the embodiment of the present invention and 100 training samples are shown in Table 1 below. The classification accuracy of the classification of 100 test samples by the classifier trained by the HMM algorithm in the prior art and the classification of 100 test samples by the multi-layer classifier trained by the method provided by the embodiment of the present invention The classification accuracy rate is shown in Table 2 below.
由下表1可以看出,本发明实施例所提供方法的训练时间均低于0.1s,而传统的HMM算法的训练时间长达4.538s。由下表2可以看出,本发明实施例所提供方法的平均分类准确率高达97.2%,而传统的HMM算法的平均分类准确率仅为84.5%。It can be seen from the following Table 1 that the training time of the method provided by the embodiment of the present invention is less than 0.1 s, and the training time of the traditional HMM algorithm is as long as 4.538 s. It can be seen from Table 2 below that the average classification accuracy of the method provided by the embodiment of the present invention is as high as 97.2%, while the average classification accuracy of the traditional HMM algorithm is only 84.5%.
表1Table 1
志愿者volunteer HMM算法的训练时间Training time of HMM algorithm 本发明的训练时间Training time of the present invention
11 8.75178.7517 0.07800.0780
22 3.72843.7284 0.01560.0156
33 5.33525.3352 0.01560.0156
44 1.99681.9968 0.07800.0780
55 2.41802.4180 0.12480.1248
66 7.11367.1136 0.06240.0624
77 8.50218.5021 0.01560.0156
88 3.82203.8220 0.01560.0156
99 1.74721.7472 0.07800.0780
1010 1.96561.9656 0.03120.0312
表2Table 2
志愿者volunteer HMM算法的分类准确率Classification accuracy of HMM algorithm 本发明的分类准确率Classification accuracy rate of the present invention
11 93%93% 89%89%
22 87%87% 95%95%
33 96%96% 97%97%
44 87%87% 100%100%
55 81%81% 100%100%
66 84%84% 98%98%
77 83%83% 100%100%
88 86%86% 99%99%
99 81%81% 98%98%
1010 67%67% 96%96%
图9是本发明实施例提供的一种特征提取装置的结构示意图,参见图9,该装置包括:FIG. 9 is a schematic structural diagram of a feature extraction apparatus according to an embodiment of the present invention. Referring to FIG. 9, the apparatus includes:
划分模块901,用于根据视频中视频帧的时间顺序,将所述视频划分为第一预设数目的时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域;The dividing module 901 is configured to divide the video into a first preset number of time sub-blocks according to a chronological order of video frames in the video, where each time sub-block includes at least two consecutive video frames, each video frame Including the lip area;
所述划分模块901,还用于将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,所述视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;The dividing module 901 is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and corresponding space spaces of the same position in each video frame in the same time sub-block The blocks form a video sub-block, and the video obtains a total of M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
特征计算模块902,用于计算所述划分模块901得到的每个视频子块的唇语特征向量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息,每个视频子块的唇语特征向量为X维的向量。The feature calculation module 902 is configured to calculate a lip language feature vector of each video sub-block obtained by the dividing module 901, where the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block, each video The lip eigenvectors of the sub-blocks are X-dimensional vectors.
组合模块903,用于将所述特征计算模块902得到的所述视频中的多个视频子块的唇语特征向量进行组合,得到所述视频的唇语特征向量。 The combination module 903 is configured to combine the lip feature vectors of the plurality of video sub-blocks in the video obtained by the feature calculation module 902 to obtain a lip feature vector of the video.
本发明实施例提供的装置,通过按照时间维度,将视频划分为M个时间子块,并按照空间维度,每个时间子块中每个视频帧的唇部区域划分为N个空间子块,将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,则该视频总共得到M×N个视频子块,再计算每个视频子块的唇语特征向量,将视频中的M×N个视频子块的唇语特征向量进行组合,得到该视频的唇语特征向量,则不同视频所得到的视频子块的数目相同,使得最终提取到的视频的唇语特征向量维数相同,实现了对特征维数的固定,无需对特征维数进行动态调整,简化了操作,节省了时间。The device provided by the embodiment of the present invention divides the video into M time sub-blocks according to the time dimension, and divides the lip area of each video frame in each time sub-block into N spatial sub-blocks according to the spatial dimension. If a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M×N video sub-blocks, and then calculates a lip-feature vector of each video sub-block. Combining the lip feature vectors of M×N video sub-blocks in the video to obtain the lip feature vector of the video, the number of video sub-blocks obtained by different videos is the same, so that the lip of the finally extracted video is The eigenvectors have the same dimension, which realizes the fixation of the feature dimension. It does not need to dynamically adjust the feature dimension, which simplifies the operation and saves time.
可选地,所述特征计算模块902包括:Optionally, the feature calculation module 902 includes:
提取单元,用于提取一个视频子块中每个空间子块的X维局部二值模式LBP特征向量,所述一个视频子块中包括Y个空间子块;An extracting unit, configured to extract an X-dimensional local binary mode LBP feature vector of each spatial sub-block in a video sub-block, where the one video sub-block includes Y spatial sub-blocks;
组合单元,用于将所述提取单元得到的Y个空间子块的LBP特征向量进行组合,得到X*Y阶局部纹理特征矩阵;a combining unit, configured to combine the LBP feature vectors of the Y spatial sub-blocks obtained by the extracting unit to obtain an X*Y-order local texture feature matrix;
分解单元,用于对所述组合单元得到的所述局部纹理特征矩阵进行奇异值分解,得到Y*Y阶第一右奇异矩阵;a decomposition unit, configured to perform singular value decomposition on the local texture feature matrix obtained by the combining unit to obtain a Y*Y order first right singular matrix;
投影单元,用于提取所述分解单元得到的所述第一右奇异矩阵的第一个列向量,作为投影向量;a projection unit, configured to extract a first column vector of the first right singular matrix obtained by the decomposition unit, as a projection vector;
计算单元,用于计算所述组合单元得到的所述局部纹理特征矩阵与所述投影单元得到的所述投影向量的乘积,得到所述视频子块的X维唇语特征向量;a calculation unit, configured to calculate a product of the local texture feature matrix obtained by the combining unit and the projection vector obtained by the projection unit, to obtain an X-dimensional lip language feature vector of the video sub-block;
其中,Y为正整数,X*Y阶矩阵表示X行Y列的矩阵,Y*Y阶矩阵表示Y行Y列的矩阵。Where Y is a positive integer, the X*Y order matrix represents a matrix of X rows and Y columns, and the Y*Y order matrix represents a matrix of Y rows and Y columns.
上述所有可选技术方案,可以采用任意结合形成本发明的可选实施例,在此不再一一赘述。All of the above optional technical solutions may be used in any combination to form an optional embodiment of the present invention, and will not be further described herein.
图10是本发明实施例提供的一种唇语分类装置的结构示意图,参见图10,该装置包括: FIG. 10 is a schematic structural diagram of a lip language sorting apparatus according to an embodiment of the present invention. Referring to FIG. 10, the apparatus includes:
划分模块1001,用于对于预先选择的D个样本视频中的每个样本视频,根据样本视频中视频帧的时间顺序,将所述样本视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域;The dividing module 1001 is configured to, for each sample video of the preselected D sample videos, divide the sample video into M time sub-blocks according to a chronological order of video frames in the sample video, in each time sub-block Include at least two consecutive video frames, each of which includes a lip region;
所述划分模块1001,还用于将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,所述样本视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;The dividing module 1001 is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and corresponding space sub-frames corresponding to the same position in each video frame in the same time sub-block Blocks form a video sub-block, the sample video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
特征计算模块1002,用于计算所述划分模块1001得到的每个视频子块的唇语特征向量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息;a feature calculation module 1002, configured to calculate a lip language feature vector of each video sub-block obtained by the dividing module 1001, where the lip language feature vector is used to describe texture information of a lip region in a corresponding video sub-block;
训练模块1003,用于根据所述特征计算模块1002得到的所述D个样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率按照预设训练算法进行训练,直至所述多层分类器的分类准确率满足第一预设条件时停止,得到训练完成的所述多层分类器,所述多层分类器用于对视频的语义信息进行分类;The training module 1003 is configured to perform, according to the lip training feature vector of the video sub-blocks in the D sample videos obtained by the feature calculation module 1002, the classification accuracy of the multi-layer classifier according to a preset training algorithm, until the The multi-level classifier is used to classify the semantic information of the video when the classification accuracy rate of the multi-layer classifier is stopped when the first preset condition is met, and the training is completed.
其中,M、N、D均为正整数,D>1,×用于表示数值的乘积运算。Among them, M, N, and D are positive integers, D>1, and × is used to represent the product of numerical values.
本发明实施例提供的装置,通过按照时间维度,将视频划分为M个时间子块,并按照空间维度,每个时间子块中每个视频帧的唇部区域划分为N个空间子块,将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,则该视频总共得到M×N个视频子块,再计算每个视频子块的X维唇语特征向量,则不同视频所得到的视频子块的数目相同,使得最终提取到的视频的唇语特征向量维数相同,实现了对特征维数的固定,且根据样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率进行训练,无需对特征维数进行动态调整,简化了操作,节省了训练时间,应用训练的多层分类器对视频进行分类时,也无需对特征维数进行动态调整,简化了操作,节省了分类时间,提高了分类准确率。 The device provided by the embodiment of the present invention divides the video into M time sub-blocks according to the time dimension, and divides the lip area of each video frame in each time sub-block into N spatial sub-blocks according to the spatial dimension. When a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M×N video sub-blocks, and then calculates the X-dimensional lip feature of each video sub-block. Vector, the number of video sub-blocks obtained by different videos is the same, so that the lip-feature vector dimension of the finally extracted video is the same, and the feature dimension is fixed, and according to the lip language of the video sub-block in the sample video. The eigenvectors train the classification accuracy of the multi-layer classifier without dynamic adjustment of the feature dimension, simplifying the operation and saving the training time. When applying the trained multi-layer classifier to classify the video, there is no need to feature Dynamic adjustment of the dimension simplifies the operation, saves the classification time, and improves the classification accuracy.
可选地,所述训练模块1003用于执行下述步骤:Optionally, the training module 1003 is configured to perform the following steps:
步骤1:按照预定规则,构造L个指定标识,L个指定标识用于确定对应的隐层节点数目和D个样本视频中被选择的视频子块,L为正整数,L>1;Step 1: According to a predetermined rule, construct L specified identifiers, and the L specified identifiers are used to determine the number of corresponding hidden layer nodes and the selected video sub-blocks in the D sample videos, L is a positive integer, L>1;
步骤2:获取每个指定标识的标志值,每个标志值中包括用于表示隐层节点数目的标志位和用于表示每个视频子块是否被选择的标志位,不同标志值对应的隐层节点数目不同或者对应的被选择的视频子块不同,每个标志值用于训练出一个对应的多层分类器;Step 2: Obtain a flag value for each specified identifier, and each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and a different flag value corresponds to the hidden The number of layer nodes is different or the corresponding selected video sub-blocks are different, and each flag value is used to train a corresponding multi-layer classifier;
步骤3:对于每个指定标识的标志值,根据指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的视频子块的唇语特征向量,对与所述标志值对应的多层分类器的分类准确率进行训练;Step 3: For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, the flag value is Training the classification accuracy of the corresponding multi-layer classifier;
步骤4:根据每个指定标识的标志值以及每个指定标识的标志值训练出的多层分类器的分类准确率,按照预设搜索算法,获取所述L个指定标识的全局最优标志值,并对每个指定标识的标志值进行更新;Step 4: Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain the global optimal flag value of the L designated identifiers according to a preset search algorithm. And update the flag value of each specified identifier;
重复执行上述步骤3至4,直至所述全局最优标志值满足第二预设条件时停止。The above steps 3 to 4 are repeatedly executed until the global optimum flag value satisfies the second preset condition.
可选地,所述训练模块1003还用于确定所述指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的H个视频子块,H为正整数;对于D个样本视频中的每个样本视频,将样本视频中被选择的H个视频子块的X维唇语特征向量进行组合,得到所述样本视频的H×X维唇语特征向量;对D个样本视频的H×X维唇语特征向量进行组合,得到D*(H×X)阶的特征矩阵;对所述特征矩阵进行奇异值分解,得到第二右奇异矩阵;从所述第二右奇异矩阵中,提取与预设保留维数对应的列向量,作为投影矩阵;基于所述投影矩阵、激励函数和所述隐层节点数目,对所述多层分类器的分类准确率进行训练,所述多层分类器至少包括输入层节点和至少一个隐层节点,所述投影矩阵用于表示所述输入层节点的权重。Optionally, the training module 1003 is further configured to determine a number of hidden layer nodes corresponding to the flag value of the specified identifier and a selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer; Each sample video in the sample video combines the X-dimensional lip feature vectors of the selected H video sub-blocks in the sample video to obtain an H×X-dimensional lip feature vector of the sample video; The H×X-dimensional lip eigenvectors of the sample video are combined to obtain a feature matrix of D*(H×X) order; singular value decomposition is performed on the feature matrix to obtain a second right singular matrix; from the second right In the singular matrix, extracting a column vector corresponding to the preset retention dimension as a projection matrix; and training the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes, The multi-layer classifier includes at least an input layer node and at least one hidden layer node, the projection matrix being used to represent the weight of the input layer node.
可选地,所述训练模块1003还用于根据每个指定标识的标志值,计算所 述L个指定标识的平均最优标志值;根据每个指定标识的标志值训练出的多层分类器的分类准确率,获取每个指定标识的最优标志值以及所述全局最优标志值;根据每个指定标识的最优标志值和所述全局最优标志值,计算每个指定标识的局部吸引子;根据每个指定标识的局部吸引子和所述平均最优标志值,按照预设更新算法对每个指定标识的标志值进行更新。Optionally, the training module 1003 is further configured to calculate, according to the flag value of each specified identifier. An average optimal flag value of the L specified identifiers; a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, obtaining an optimal flag value of each specified identifier and the global optimal flag value Calculating a local attractor for each specified identifier according to an optimal flag value of each specified identifier and the global optimal flag value; according to the local attractor of each specified identifier and the average optimal flag value, according to Let the update algorithm update the flag value of each specified identifier.
可选地,所述训练模块1003还用于对于每个指定标识,将指定标识的标志值训练出的多层分类器的分类准确率作为所述指定标识的适应度值;根据所述指定标识当前的适应度值与所述指定标识的历史最优适应度值,对所述指定标识的最优标志值进行更新,得到所述指定标识更新后的最优标志值;根据所述指定标识当前的适应度值与所述L个指定标识的历史全局最优适应度值,对所述L个指定标识的全局最优标志值进行更新,得到更新后的全局最优标志值。Optionally, the training module 1003 is further configured to, for each specified identifier, a classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier as the fitness value of the specified identifier; according to the specified identifier The current fitness value and the historical optimal fitness value of the specified identifier are updated, and the optimal flag value of the specified identifier is updated to obtain an optimal flag value after the specified identifier is updated; The fitness value and the historical global optimal fitness value of the L specified identifiers are updated, and the global optimal flag values of the L specified identifiers are updated to obtain an updated global optimal flag value.
上述所有可选技术方案,可以采用任意结合形成本发明的可选实施例,在此不再一一赘述。All of the above optional technical solutions may be used in any combination to form an optional embodiment of the present invention, and will not be further described herein.
图11是本发明实施例提供的一种特征提取设备的结构示意图,参见图11,该设备包括:存储器1101和处理器1102,所述存储器1101与所述处理器1102连接,所述存储器1101存储有程序代码,所述处理器1102用于调用所述程序代码,执行以下操作:11 is a schematic structural diagram of a feature extraction device according to an embodiment of the present invention. Referring to FIG. 11, the device includes: a memory 1101 and a processor 1102. The memory 1101 is connected to the processor 1102, and the memory 1101 is stored. There is program code, and the processor 1102 is configured to invoke the program code to perform the following operations:
根据视频中视频帧的时间顺序,将所述视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域;Dividing the video into M time sub-blocks according to a chronological order of video frames in the video, each time sub-block includes at least two consecutive video frames, each of which includes a lip region;
将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,所述视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
计算每个视频子块的唇语特征向量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息,每个视频子块的唇语特征向量为X维的向量; Calculating a lip language feature vector of each video sub-block, wherein the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block, and the lip language feature vector of each video sub-block is an X-dimensional vector;
将所述视频中的M×N个视频子块的X维唇语特征向量进行组合,得到所述视频的X×M×N维唇语特征向量;Combining the X-dimensional lip feature vectors of the M×N video sub-blocks in the video to obtain an X×M×N-dimensional lip feature vector of the video;
其中,M、N、X均为正整数,×用于表示数值的乘积运算。Where M, N, and X are positive integers, and × is used to represent the product of numerical values.
本发明实施例提供的设备,通过按照时间维度,将视频划分为M个时间子块,并按照空间维度,每个时间子块中每个视频帧的唇部区域划分为N个空间子块,将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,则该视频总共得到M×N个视频子块,再计算每个视频子块的唇语特征向量,将视频中的M×N个视频子块的唇语特征向量进行组合,得到该视频的唇语特征向量,则不同视频所得到的视频子块的数目相同,使得最终提取到的视频的唇语特征向量维数相同,实现了对特征维数的固定,无需对特征维数进行动态调整,简化了操作,节省了时间。The device provided by the embodiment of the present invention divides the video into M time sub-blocks according to the time dimension, and divides the lip area of each video frame in each time sub-block into N spatial sub-blocks according to the spatial dimension. If a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M×N video sub-blocks, and then calculates a lip-feature vector of each video sub-block. Combining the lip feature vectors of M×N video sub-blocks in the video to obtain the lip feature vector of the video, the number of video sub-blocks obtained by different videos is the same, so that the lip of the finally extracted video is The eigenvectors have the same dimension, which realizes the fixation of the feature dimension. It does not need to dynamically adjust the feature dimension, which simplifies the operation and saves time.
可选地,所述处理器1102还用于调用所述程序代码,执行以下操作:Optionally, the processor 1102 is further configured to invoke the program code, and perform the following operations:
提取一个视频子块中每个空间子块的X维局部二值模式LBP特征向量,所述一个视频子块中包括Y个空间子块;Extracting an X-dimensional local binary pattern LBP feature vector of each spatial sub-block in a video sub-block, wherein the one video sub-block includes Y spatial sub-blocks;
将Y个空间子块的LBP特征向量进行组合,得到X*Y阶局部纹理特征矩阵;Combining the LBP feature vectors of Y spatial sub-blocks to obtain an X*Y-order local texture feature matrix;
对所述局部纹理特征矩阵进行奇异值分解,得到Y*Y阶第一右奇异矩阵;Performing singular value decomposition on the local texture feature matrix to obtain a Y*Y order first right singular matrix;
提取所述第一右奇异矩阵的第一个列向量,作为投影向量;Extracting a first column vector of the first right singular matrix as a projection vector;
计算所述局部纹理特征矩阵与所述投影向量的乘积,得到所述视频子块的X维唇语特征向量;Calculating a product of the local texture feature matrix and the projection vector to obtain an X-dimensional lip language feature vector of the video sub-block;
其中,Y为正整数,X*Y阶矩阵表示X行Y列的矩阵,Y*Y阶矩阵表示Y行Y列的矩阵。Where Y is a positive integer, the X*Y order matrix represents a matrix of X rows and Y columns, and the Y*Y order matrix represents a matrix of Y rows and Y columns.
图12是本发明实施例提供的一种唇语分类设备的结构示意图,参见图12,该设备包括:存储器1201和处理器1202,所述存储器1201与所述处理器1202连接,所述存储器1201存储有程序代码,所述处理器1202用于调用所述程序 代码,执行以下操作:FIG. 12 is a schematic structural diagram of a lip language sorting apparatus according to an embodiment of the present invention. Referring to FIG. 12, the apparatus includes: a memory 1201 and a processor 1202. The memory 1201 is connected to the processor 1202, and the memory 1201 is connected to the processor 1202. Stored with program code, the processor 1202 is configured to invoke the program Code, do the following:
对于预先选择的D个样本视频中的每个样本视频,根据样本视频中视频帧的时间顺序,将所述样本视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域;For each of the pre-selected D sample videos, the sample video is divided into M time sub-blocks according to the chronological order of the video frames in the sample video, and each time sub-block includes at least two consecutive a video frame, each of which includes a lip region;
将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,所述样本视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The sample video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
计算每个视频子块的唇语特征向量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息;Calculating a lip language feature vector of each video sub-block, the lip language feature vector being used to describe texture information of a lip region in the corresponding video sub-block;
根据所述D个样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率按照预设训练算法进行训练,直至所述多层分类器的分类准确率满足第一预设条件时停止,得到训练完成的所述多层分类器,所述多层分类器用于对视频的语义信息进行分类;According to the lip feature vector of the video sub-block in the D sample videos, the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm until the classification accuracy of the multi-layer classifier satisfies the first preset When the condition is stopped, the multi-layer classifier that is trained to be completed is obtained, and the multi-layer classifier is used to classify the semantic information of the video;
其中,M、N、D均为正整数,D>1,×用于表示数值的乘积运算。Among them, M, N, and D are positive integers, D>1, and × is used to represent the product of numerical values.
本发明实施例提供的设备,通过按照时间维度,将视频划分为M个时间子块,并按照空间维度,每个时间子块中每个视频帧的唇部区域划分为N个空间子块,将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,则该视频总共得到M×N个视频子块,再计算每个视频子块的X维唇语特征向量,则不同视频所得到的视频子块的数目相同,使得最终提取到的视频的唇语特征向量维数相同,实现了对特征维数的固定,且根据样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率进行训练,无需对特征维数进行动态调整,简化了操作,节省了训练时间,应用训练的多层分类器对视频进行分类时,也无需对特征维数进行动态调整,简化了操作,节省了分类时间,提高了分类准确率。The device provided by the embodiment of the present invention divides the video into M time sub-blocks according to the time dimension, and divides the lip area of each video frame in each time sub-block into N spatial sub-blocks according to the spatial dimension. When a spatial sub-block corresponding to the same position in each video frame in the same time sub-block is formed into one video sub-block, the video obtains a total of M×N video sub-blocks, and then calculates the X-dimensional lip feature of each video sub-block. Vector, the number of video sub-blocks obtained by different videos is the same, so that the lip-feature vector dimension of the finally extracted video is the same, and the feature dimension is fixed, and according to the lip language of the video sub-block in the sample video. The eigenvectors train the classification accuracy of the multi-layer classifier without dynamic adjustment of the feature dimension, simplifying the operation and saving the training time. When applying the trained multi-layer classifier to classify the video, there is no need to feature Dynamic adjustment of the dimension simplifies the operation, saves the classification time, and improves the classification accuracy.
可选地,所述处理器1202还用于调用所述程序代码,执行以下操作:Optionally, the processor 1202 is further configured to invoke the program code, and perform the following operations:
步骤1:按照预定规则,构造L个指定标识,L个指定标识用于确定对应 的隐层节点数目和D个样本视频中被选择的视频子块,L为正整数,L>1;Step 1: According to a predetermined rule, construct L designated identifiers, and L designated identifiers are used to determine corresponding Number of hidden layer nodes and selected video sub-blocks in D sample video, L is a positive integer, L>1;
步骤2:获取每个指定标识的标志值,每个标志值中包括用于表示隐层节点数目的标志位和用于表示每个视频子块是否被选择的标志位,不同标志值对应的隐层节点数目不同或者对应的被选择的视频子块不同,每个标志值用于训练出一个对应的多层分类器;Step 2: Obtain a flag value for each specified identifier, and each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and a different flag value corresponds to the hidden The number of layer nodes is different or the corresponding selected video sub-blocks are different, and each flag value is used to train a corresponding multi-layer classifier;
步骤3:对于每个指定标识的标志值,根据指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的视频子块的唇语特征向量,对与所述标志值对应的多层分类器的分类准确率进行训练;Step 3: For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, the flag value is Training the classification accuracy of the corresponding multi-layer classifier;
步骤4:根据每个指定标识的标志值以及每个指定标识的标志值训练出的多层分类器的分类准确率,按照预设搜索算法,获取所述L个指定标识的全局最优标志值,并对每个指定标识的标志值进行更新;Step 4: Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain the global optimal flag value of the L designated identifiers according to a preset search algorithm. And update the flag value of each specified identifier;
重复执行上述步骤3至4,直至所述全局最优标志值满足第二预设条件时停止。The above steps 3 to 4 are repeatedly executed until the global optimum flag value satisfies the second preset condition.
可选地,所述处理器1202还用于调用所述程序代码,执行以下操作:Optionally, the processor 1202 is further configured to invoke the program code, and perform the following operations:
确定所述指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的H个视频子块,H为正整数;Determining, by the number of hidden layer nodes corresponding to the flag value of the specified identifier, and the selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer;
对于D个样本视频中的每个样本视频,将样本视频中被选择的H个视频子块的X维唇语特征向量进行组合,得到所述样本视频的H×X维唇语特征向量;For each of the D sample videos, the X-dimensional lip feature vectors of the selected H video sub-blocks in the sample video are combined to obtain an H×X-dimensional lip feature vector of the sample video;
对D个样本视频的H×X维唇语特征向量进行组合,得到D*(H×X)阶的特征矩阵;Combining H×X-dimensional lip feature vectors of D sample videos to obtain a feature matrix of D*(H×X) order;
对所述特征矩阵进行奇异值分解,得到第二右奇异矩阵;Performing singular value decomposition on the feature matrix to obtain a second right singular matrix;
从所述第二右奇异矩阵中,提取与预设保留维数对应的列向量,作为投影矩阵;Extracting, from the second right singular matrix, a column vector corresponding to the preset retention dimension as a projection matrix;
基于所述投影矩阵、激励函数和所述隐层节点数目,对所述多层分类器的分类准确率进行训练,所述多层分类器至少包括输入层节点和至少一个隐层节 点,所述投影矩阵用于表示所述输入层节点的权重。Training the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes, the multi-layer classifier including at least an input layer node and at least one hidden layer node Point, the projection matrix is used to represent the weight of the input layer node.
可选地,所述处理器1202还用于调用所述程序代码,执行以下操作:Optionally, the processor 1202 is further configured to invoke the program code, and perform the following operations:
根据每个指定标识的标志值,计算所述L个指定标识的平均最优标志值;Calculating an average optimal flag value of the L designated identifiers according to a flag value of each specified identifier;
根据每个指定标识的标志值训练出的多层分类器的分类准确率,获取每个指定标识的最优标志值以及所述全局最优标志值;Obtaining a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, obtaining an optimal flag value of each specified identifier and the global optimal flag value;
根据每个指定标识的最优标志值和所述全局最优标志值,计算每个指定标识的局部吸引子;Calculating a local attractor for each specified identifier according to an optimal flag value of each specified identifier and the global optimal flag value;
根据每个指定标识的局部吸引子和所述平均最优标志值,按照预设更新算法对每个指定标识的标志值进行更新。The flag value of each specified identifier is updated according to a preset update algorithm according to the local attractor of each specified identifier and the average optimal flag value.
可选地,所述处理器1202还用于调用所述程序代码,执行以下操作:Optionally, the processor 1202 is further configured to invoke the program code, and perform the following operations:
对于每个指定标识,将指定标识的标志值训练出的多层分类器的分类准确率作为所述指定标识的适应度值;For each specified identifier, the classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier is used as the fitness value of the specified identifier;
根据所述指定标识当前的适应度值与所述指定标识的历史最优适应度值,对所述指定标识的最优标志值进行更新,得到所述指定标识更新后的最优标志值;And updating, according to the current fitness value of the specified identifier and the historical optimal fitness value of the specified identifier, an optimal flag value of the specified identifier, to obtain an optimal flag value after the specified identifier is updated;
根据所述指定标识当前的适应度值与所述L个指定标识的历史全局最优适应度值,对所述L个指定标识的全局最优标志值进行更新,得到更新后的全局最优标志值。And updating, according to the current fitness value of the specified identifier and the historical global optimal fitness value of the L specified identifiers, the global optimal flag values of the L specified identifiers to obtain an updated global optimal identifier. value.
上述所有可选技术方案,可以采用任意结合形成本发明的可选实施例,在此不再一一赘述。All of the above optional technical solutions may be used in any combination to form an optional embodiment of the present invention, and will not be further described herein.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。A person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium. The storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的 精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。 The above is only the preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles are intended to be included within the scope of the present invention.

Claims (16)

  1. 一种特征提取方法,其特征在于,所述方法包括:A feature extraction method, the method comprising:
    根据视频中视频帧的时间顺序,将所述视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域;Dividing the video into M time sub-blocks according to a chronological order of video frames in the video, each time sub-block includes at least two consecutive video frames, each of which includes a lip region;
    将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,所述视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
    计算每个视频子块的唇语特征向量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息,每个视频子块的唇语特征向量为X维的向量;Calculating a lip language feature vector of each video sub-block, wherein the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block, and the lip language feature vector of each video sub-block is an X-dimensional vector;
    将所述视频中的M×N个视频子块的X维唇语特征向量进行组合,得到所述视频的X×M×N维唇语特征向量;Combining the X-dimensional lip feature vectors of the M×N video sub-blocks in the video to obtain an X×M×N-dimensional lip feature vector of the video;
    其中,M、N、X均为正整数,×用于表示数值的乘积运算。Where M, N, and X are positive integers, and × is used to represent the product of numerical values.
  2. 根据权利要求1所述的方法,其特征在于,所述计算每个视频子块的唇语特征向量,包括:The method according to claim 1, wherein said calculating a lip feature vector of each video sub-block comprises:
    提取一个视频子块中每个空间子块的X维局部二值模式LBP特征向量,所述一个视频子块中包括Y个空间子块;Extracting an X-dimensional local binary pattern LBP feature vector of each spatial sub-block in a video sub-block, wherein the one video sub-block includes Y spatial sub-blocks;
    将Y个空间子块的LBP特征向量进行组合,得到X*Y阶局部纹理特征矩阵;Combining the LBP feature vectors of Y spatial sub-blocks to obtain an X*Y-order local texture feature matrix;
    对所述局部纹理特征矩阵进行奇异值分解,得到Y*Y阶第一右奇异矩阵;Performing singular value decomposition on the local texture feature matrix to obtain a Y*Y order first right singular matrix;
    提取所述第一右奇异矩阵的第一个列向量,作为投影向量;Extracting a first column vector of the first right singular matrix as a projection vector;
    计算所述局部纹理特征矩阵与所述投影向量的乘积,得到所述视频子块的X维唇语特征向量;Calculating a product of the local texture feature matrix and the projection vector to obtain an X-dimensional lip language feature vector of the video sub-block;
    其中,Y为正整数,X*Y阶矩阵表示X行Y列的矩阵,Y*Y阶矩阵表示Y行Y列的矩阵。 Where Y is a positive integer, the X*Y order matrix represents a matrix of X rows and Y columns, and the Y*Y order matrix represents a matrix of Y rows and Y columns.
  3. 一种唇语分类方法,其特征在于,所述方法包括:A lip language classification method, the method comprising:
    对于预先选择的D个样本视频中的每个样本视频,根据样本视频中视频帧的时间顺序,将所述样本视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域;For each of the pre-selected D sample videos, the sample video is divided into M time sub-blocks according to the chronological order of the video frames in the sample video, and each time sub-block includes at least two consecutive a video frame, each of which includes a lip region;
    将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,所述样本视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The sample video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
    计算每个视频子块的唇语特征向量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息;Calculating a lip language feature vector of each video sub-block, the lip language feature vector being used to describe texture information of a lip region in the corresponding video sub-block;
    根据所述D个样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率按照预设训练算法进行训练,直至所述多层分类器的分类准确率满足第一预设条件时停止,得到训练完成的所述多层分类器,所述多层分类器用于对视频的语义信息进行分类;According to the lip feature vector of the video sub-block in the D sample videos, the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm until the classification accuracy of the multi-layer classifier satisfies the first preset When the condition is stopped, the multi-layer classifier that is trained to be completed is obtained, and the multi-layer classifier is used to classify the semantic information of the video;
    其中,M、N、D均为正整数,D>1,×用于表示数值的乘积运算。Among them, M, N, and D are positive integers, D>1, and × is used to represent the product of numerical values.
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述D个样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率按照预设训练算法进行训练,包括:The method according to claim 3, wherein the classification accuracy rate of the multi-layer classifier is trained according to a preset training algorithm according to a lip feature vector of the video sub-blocks in the D sample videos, including :
    步骤1:按照预定规则,构造L个指定标识,L个指定标识用于确定对应的隐层节点数目和D个样本视频中被选择的视频子块,L为正整数,L>1;Step 1: According to a predetermined rule, construct L specified identifiers, and the L specified identifiers are used to determine the number of corresponding hidden layer nodes and the selected video sub-blocks in the D sample videos, L is a positive integer, L>1;
    步骤2:获取每个指定标识的标志值,每个标志值中包括用于表示隐层节点数目的标志位和用于表示每个视频子块是否被选择的标志位,不同标志值对应的隐层节点数目不同或者对应的被选择的视频子块不同,每个标志值用于训练出一个对应的多层分类器;Step 2: Obtain a flag value for each specified identifier, and each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and a different flag value corresponds to the hidden The number of layer nodes is different or the corresponding selected video sub-blocks are different, and each flag value is used to train a corresponding multi-layer classifier;
    步骤3:对于每个指定标识的标志值,根据指定标识的标志值对应的隐层节 点数目以及对应的D个样本视频中被选择的视频子块的唇语特征向量,对与所述标志值对应的多层分类器的分类准确率进行训练;Step 3: For each flag of the specified identifier, the hidden layer section corresponding to the flag value of the specified identifier The number of points and the lip feature vector of the selected video sub-block of the corresponding D sample videos are used to train the classification accuracy of the multi-layer classifier corresponding to the flag value;
    步骤4:根据每个指定标识的标志值以及每个指定标识的标志值训练出的多层分类器的分类准确率,按照预设搜索算法,获取所述L个指定标识的全局最优标志值,并对每个指定标识的标志值进行更新;Step 4: Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain the global optimal flag value of the L designated identifiers according to a preset search algorithm. And update the flag value of each specified identifier;
    重复执行上述步骤3至4,直至所述全局最优标志值满足第二预设条件时停止。The above steps 3 to 4 are repeatedly executed until the global optimum flag value satisfies the second preset condition.
  5. 根据权利要求4所述的方法,其特征在于,所述根据指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的视频子块的唇语特征向量,对与所述标志值对应的多层分类器的分类准确率进行训练,包括:The method according to claim 4, wherein the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip feature vector of the selected video sub-block in the corresponding D sample videos are Training the classification accuracy of the multi-level classifier corresponding to the flag value, including:
    确定所述指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的H个视频子块,H为正整数;Determining, by the number of hidden layer nodes corresponding to the flag value of the specified identifier, and the selected H video sub-blocks in the corresponding D sample videos, where H is a positive integer;
    对于D个样本视频中的每个样本视频,将样本视频中被选择的H个视频子块的X维唇语特征向量进行组合,得到所述样本视频的H×X维唇语特征向量;For each of the D sample videos, the X-dimensional lip feature vectors of the selected H video sub-blocks in the sample video are combined to obtain an H×X-dimensional lip feature vector of the sample video;
    对D个样本视频的H×X维唇语特征向量进行组合,得到D*(H×X)阶的特征矩阵;Combining H×X-dimensional lip feature vectors of D sample videos to obtain a feature matrix of D*(H×X) order;
    对所述特征矩阵进行奇异值分解,得到第二右奇异矩阵;Performing singular value decomposition on the feature matrix to obtain a second right singular matrix;
    从所述第二右奇异矩阵中,提取与预设保留维数对应的列向量,作为投影矩阵;Extracting, from the second right singular matrix, a column vector corresponding to the preset retention dimension as a projection matrix;
    基于所述投影矩阵、激励函数和所述隐层节点数目,对所述多层分类器的分类准确率进行训练,所述多层分类器至少包括输入层节点和至少一个隐层节点,所述投影矩阵用于表示所述输入层节点的权重。Training the classification accuracy of the multi-layer classifier based on the projection matrix, the excitation function, and the number of the hidden layer nodes, the multi-layer classifier including at least an input layer node and at least one hidden layer node, A projection matrix is used to represent the weight of the input layer node.
  6. 根据权利要求4所述的方法,其特征在于,所述根据每个指定标识的标志值以及每个指定标识的标志值训练出的多层分类器的分类准确率,按照预设 搜索算法,获取所述L个指定标识的全局最优标志值,并对每个指定标识的标志值进行更新,包括:The method according to claim 4, wherein the classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier is preset The search algorithm obtains the global optimal flag values of the L specified identifiers, and updates the flag values of each specified identifier, including:
    根据每个指定标识的标志值,计算所述L个指定标识的平均最优标志值;Calculating an average optimal flag value of the L designated identifiers according to a flag value of each specified identifier;
    根据每个指定标识的标志值训练出的多层分类器的分类准确率,获取每个指定标识的最优标志值以及所述全局最优标志值;Obtaining a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier, obtaining an optimal flag value of each specified identifier and the global optimal flag value;
    根据每个指定标识的最优标志值和所述全局最优标志值,计算每个指定标识的局部吸引子;Calculating a local attractor for each specified identifier according to an optimal flag value of each specified identifier and the global optimal flag value;
    根据每个指定标识的局部吸引子和所述平均最优标志值,按照预设更新算法对每个指定标识的标志值进行更新。The flag value of each specified identifier is updated according to a preset update algorithm according to the local attractor of each specified identifier and the average optimal flag value.
  7. 根据权利要求6所述的方法,其特征在于,所述根据每个指定标识的标志值训练出的多层分类器的分类准确率,获取每个指定标识的最优标志值以及所述全局最优标志值,包括:The method according to claim 6, wherein the classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier acquires an optimal flag value of each specified identifier and the global maximum Excellent flag value, including:
    对于每个指定标识,将指定标识的标志值训练出的多层分类器的分类准确率作为所述指定标识的适应度值;For each specified identifier, the classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier is used as the fitness value of the specified identifier;
    根据所述指定标识当前的适应度值与所述指定标识的历史最优适应度值,对所述指定标识的最优标志值进行更新,得到所述指定标识更新后的最优标志值;And updating, according to the current fitness value of the specified identifier and the historical optimal fitness value of the specified identifier, an optimal flag value of the specified identifier, to obtain an optimal flag value after the specified identifier is updated;
    根据所述指定标识当前的适应度值与所述L个指定标识的历史全局最优适应度值,对所述L个指定标识的全局最优标志值进行更新,得到更新后的全局最优标志值。And updating, according to the current fitness value of the specified identifier and the historical global optimal fitness value of the L specified identifiers, the global optimal flag values of the L specified identifiers to obtain an updated global optimal identifier. value.
  8. 一种特征提取装置,其特征在于,所述装置包括:A feature extraction device, characterized in that the device comprises:
    划分模块,用于根据视频中视频帧的时间顺序,将所述视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域; a dividing module, configured to divide the video into M time sub-blocks according to a chronological order of video frames in a video, where each time sub-block includes at least two consecutive video frames, and each video frame includes a lip region ;
    所述划分模块,还用于将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,所述视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;The dividing module is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and to correspond to spatial sub-blocks of the same position in each video frame in the same time sub-block Composing a video sub-block, the video obtains a total of M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
    特征计算模块,用于计算所述划分模块得到的每个视频子块的唇语特征向量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息,每个视频子块的唇语特征向量为X维的向量;a feature calculation module, configured to calculate a lip language feature vector of each video sub-block obtained by the dividing module, where the lip language feature vector is used to describe texture information of a lip region in a corresponding video sub-block, each video sub-block The lip eigenvector is a vector of X dimensions;
    组合模块,用于将所述特征计算模块得到的所述视频中的M×N个视频子块的X维唇语特征向量进行组合,得到所述视频的X×M×N维唇语特征向量;a combination module, configured to combine the X-dimensional lip language feature vectors of the M×N video sub-blocks in the video obtained by the feature calculation module to obtain an X×M×N-dimensional lip language feature vector of the video ;
    其中,M、N、X均为正整数,×用于表示数值的乘积运算。Where M, N, and X are positive integers, and × is used to represent the product of numerical values.
  9. 根据权利要求8所述的装置,其特征在于,所述特征计算模块包括:The device according to claim 8, wherein the feature calculation module comprises:
    提取单元,用于提取一个视频子块中每个空间子块的X维局部二值模式LBP特征向量,所述一个视频子块中包括Y个空间子块;An extracting unit, configured to extract an X-dimensional local binary mode LBP feature vector of each spatial sub-block in a video sub-block, where the one video sub-block includes Y spatial sub-blocks;
    组合单元,用于将所述提取单元得到的Y个空间子块的LBP特征向量进行组合,得到X*Y阶局部纹理特征矩阵;a combining unit, configured to combine the LBP feature vectors of the Y spatial sub-blocks obtained by the extracting unit to obtain an X*Y-order local texture feature matrix;
    分解单元,用于对所述组合单元得到的所述局部纹理特征矩阵进行奇异值分解,得到Y*Y阶第一右奇异矩阵;a decomposition unit, configured to perform singular value decomposition on the local texture feature matrix obtained by the combining unit to obtain a Y*Y order first right singular matrix;
    投影单元,用于提取所述分解单元得到的所述第一右奇异矩阵的第一个列向量,作为投影向量;a projection unit, configured to extract a first column vector of the first right singular matrix obtained by the decomposition unit, as a projection vector;
    计算单元,用于计算所述组合单元得到的所述局部纹理特征矩阵与所述投影单元得到的所述投影向量的乘积,得到所述视频子块的X维唇语特征向量;a calculation unit, configured to calculate a product of the local texture feature matrix obtained by the combining unit and the projection vector obtained by the projection unit, to obtain an X-dimensional lip language feature vector of the video sub-block;
    其中,Y为正整数,X*Y阶矩阵表示X行Y列的矩阵,Y*Y阶矩阵表示Y行Y列的矩阵。Where Y is a positive integer, the X*Y order matrix represents a matrix of X rows and Y columns, and the Y*Y order matrix represents a matrix of Y rows and Y columns.
  10. 一种唇语分类装置,其特征在于,所述装置包括: A lip language classification device, characterized in that the device comprises:
    划分模块,用于对于预先选择的D个样本视频中的每个样本视频,根据样本视频中视频帧的时间顺序,将所述样本视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域;a dividing module, configured to divide the sample video into M time sub-blocks according to a chronological order of video frames in the sample video for each of the pre-selected D sample videos, each time sub-block includes At least two consecutive video frames, each of which includes a lip region;
    所述划分模块,还用于将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,所述样本视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;The dividing module is further configured to divide a lip region of each video frame in each time sub-block into N spatial sub-blocks, and to correspond to spatial sub-blocks of the same position in each video frame in the same time sub-block Composing a video sub-block, the sample video is obtained by M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
    特征计算模块,用于计算所述划分模块得到的每个视频子块的唇语特征向量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息;a feature calculation module, configured to calculate a lip language feature vector of each video sub-block obtained by the dividing module, where the lip language feature vector is used to describe texture information of a lip region in a corresponding video sub-block;
    训练模块,用于根据所述特征计算模块得到的D个样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率按照预设训练算法进行训练,直至所述多层分类器的分类准确率满足第一预设条件时停止,得到训练完成的所述多层分类器,所述多层分类器用于对视频的语义信息进行分类;a training module, configured to perform a lip language feature vector of a video sub-block in the D sample videos obtained by the feature calculation module, and perform a training on the classification accuracy of the multi-layer classifier according to a preset training algorithm until the multi-layer classification The classification accuracy rate of the device is stopped when the first preset condition is met, and the multi-layer classifier is completed, and the multi-layer classifier is used to classify the semantic information of the video;
    其中,M、N、D均为正整数,D>1,×用于表示数值的乘积运算。Among them, M, N, and D are positive integers, D>1, and × is used to represent the product of numerical values.
  11. 根据权利要求10所述的装置,其特征在于,所述训练模块用于执行下述步骤:The apparatus of claim 10 wherein said training module is operative to perform the steps of:
    步骤1:按照预定规则,构造L个指定标识,L个指定标识用于确定对应的隐层节点数目和D个样本视频中被选择的视频子块,L为正整数,L>1;Step 1: According to a predetermined rule, construct L specified identifiers, and the L specified identifiers are used to determine the number of corresponding hidden layer nodes and the selected video sub-blocks in the D sample videos, L is a positive integer, L>1;
    步骤2:获取每个指定标识的标志值,每个标志值中包括用于表示隐层节点数目的标志位和用于表示每个视频子块是否被选择的标志位,不同标志值对应的隐层节点数目不同或者对应的被选择的视频子块不同,每个标志值用于训练出一个对应的多层分类器;Step 2: Obtain a flag value for each specified identifier, and each flag value includes a flag bit for indicating the number of hidden layer nodes and a flag bit for indicating whether each video sub-block is selected, and a different flag value corresponds to the hidden The number of layer nodes is different or the corresponding selected video sub-blocks are different, and each flag value is used to train a corresponding multi-layer classifier;
    步骤3:对于每个指定标识的标志值,根据指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的视频子块的唇语特征向量,对与所述标志值对应的多层分类器的分类准确率进行训练; Step 3: For each flag value of the specified identifier, according to the number of hidden layer nodes corresponding to the flag value of the specified identifier and the lip language feature vector of the selected video sub-block in the corresponding D sample videos, the flag value is Training the classification accuracy of the corresponding multi-layer classifier;
    步骤4:根据每个指定标识的标志值以及每个指定标识的标志值训练出的多层分类器的分类准确率,按照预设搜索算法,获取所述L个指定标识的全局最优标志值,并对每个指定标识的标志值进行更新;Step 4: Obtain a classification accuracy rate of the multi-layer classifier trained according to the flag value of each specified identifier and the flag value of each specified identifier, and obtain the global optimal flag value of the L designated identifiers according to a preset search algorithm. And update the flag value of each specified identifier;
    重复执行上述步骤3至4,直至所述全局最优标志值满足第二预设条件时停止。The above steps 3 to 4 are repeatedly executed until the global optimum flag value satisfies the second preset condition.
  12. 根据权利要求11所述的装置,其特征在于,所述训练模块还用于确定所述指定标识的标志值对应的隐层节点数目以及对应的D个样本视频中被选择的H个视频子块,H为正整数;对于D个样本视频中的每个样本视频,将样本视频中被选择的H个视频子块的X维唇语特征向量进行组合,得到所述样本视频的H×X维唇语特征向量;对D个样本视频的H×X维唇语特征向量进行组合,得到D*(H×X)阶的特征矩阵;对所述特征矩阵进行奇异值分解,得到第二右奇异矩阵;从所述第二右奇异矩阵中,提取与预设保留维数对应的列向量,作为投影矩阵;基于所述投影矩阵、激励函数和所述隐层节点数目,对所述多层分类器的分类准确率进行训练,所述多层分类器至少包括输入层节点和至少一个隐层节点,所述投影矩阵用于表示所述输入层节点的权重。The apparatus according to claim 11, wherein the training module is further configured to determine a number of hidden layer nodes corresponding to the flag value of the specified identifier and a selected H video sub-blocks in the corresponding D sample videos. , H is a positive integer; for each sample video in the D sample videos, the X-dimensional lip feature vectors of the selected H video sub-blocks in the sample video are combined to obtain an H×X dimension of the sample video. Lip-language feature vector; combining H×X-dimensional lip feature vectors of D sample videos to obtain feature matrix of D*(H×X) order; performing singular value decomposition on the feature matrix to obtain second right singularity a matrix; extracting, from the second right singular matrix, a column vector corresponding to a preset retention dimension as a projection matrix; classifying the plurality of layers based on the projection matrix, an excitation function, and the number of the hidden layer nodes The classification accuracy of the device is trained, the multi-layer classifier comprising at least an input layer node and at least one hidden layer node, the projection matrix being used to represent the weight of the input layer node.
  13. 根据权利要求11所述的装置,其特征在于,所述训练模块还用于根据每个指定标识的标志值,计算所述L个指定标识的平均最优标志值;根据每个指定标识的标志值训练出的多层分类器的分类准确率,获取每个指定标识的最优标志值以及所述全局最优标志值;根据每个指定标识的最优标志值和所述全局最优标志值,计算每个指定标识的局部吸引子;根据每个指定标识的局部吸引子和所述平均最优标志值,按照预设更新算法对每个指定标识的标志值进行更新。The apparatus according to claim 11, wherein the training module is further configured to calculate an average optimal flag value of the L specified identifiers according to a flag value of each specified identifier; a classification accuracy rate of the multi-layer classifier trained by the value, obtaining an optimal flag value of each specified identifier and the global optimal flag value; an optimal flag value according to each specified identifier and the global optimal flag value Calculating a local attractor for each specified identifier; updating the flag value of each specified identifier according to a preset update algorithm according to the local attractor of each specified identifier and the average optimal flag value.
  14. 根据权利要求13所述的装置,其特征在于,所述训练模块还用于对于 每个指定标识,将指定标识的标志值训练出的多层分类器的分类准确率作为所述指定标识的适应度值;根据所述指定标识当前的适应度值与所述指定标识的历史最优适应度值,对所述指定标识的最优标志值进行更新,得到所述指定标识更新后的最优标志值;根据所述指定标识当前的适应度值与所述L个指定标识的历史全局最优适应度值,对所述L个指定标识的全局最优标志值进行更新,得到更新后的全局最优标志值。The apparatus of claim 13 wherein said training module is further Each specified identifier, the classification accuracy rate of the multi-layer classifier trained by the flag value of the specified identifier is used as the fitness value of the specified identifier; according to the current fitness value of the specified identifier and the history of the specified identifier The optimal fitness value is updated, and the optimal flag value of the specified identifier is updated to obtain an optimal flag value after the specified identifier is updated; and the current fitness value and the history of the L designated identifiers are determined according to the specified identifier. The global optimal fitness value is updated, and the global optimal flag values of the L specified identifiers are updated to obtain an updated global optimal flag value.
  15. 一种特征提取设备,其特征在于,所述设备包括:存储器和处理器,所述存储器与所述处理器连接,所述存储器存储有程序代码,所述处理器用于调用所述程序代码,执行以下操作:A feature extraction device, comprising: a memory and a processor, wherein the memory is connected to the processor, the memory stores program code, and the processor is configured to invoke the program code to execute The following operations:
    根据视频中视频帧的时间顺序,将所述视频划分为M个时间子块,每个时间子块中包括至少两个连续的视频帧,每个视频帧中包括唇部区域;Dividing the video into M time sub-blocks according to a chronological order of video frames in the video, each time sub-block includes at least two consecutive video frames, each of which includes a lip region;
    将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,所述视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
    计算每个视频子块的唇语特征向量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息,每个视频子块的唇语特征向量为X维的向量;Calculating a lip language feature vector of each video sub-block, wherein the lip language feature vector is used to describe texture information of a lip region in the corresponding video sub-block, and the lip language feature vector of each video sub-block is an X-dimensional vector;
    将所述视频中的M×N个视频子块的X维唇语特征向量进行组合,得到所述视频的X×M×N维唇语特征向量;Combining the X-dimensional lip feature vectors of the M×N video sub-blocks in the video to obtain an X×M×N-dimensional lip feature vector of the video;
    其中,M、N、X均为正整数,×用于表示数值的乘积运算。Where M, N, and X are positive integers, and × is used to represent the product of numerical values.
  16. 一种唇语分类设备,其特征在于,所述设备包括:存储器和处理器,所述存储器与所述处理器连接,所述存储器存储有程序代码,所述处理器用于调用所述程序代码,执行以下操作:A lip language classification device, comprising: a memory and a processor, wherein the memory is connected to the processor, the memory stores program code, and the processor is configured to invoke the program code, Do the following:
    对于预先选择的D个样本视频中的每个样本视频,根据样本视频中视频帧的时间顺序,将所述样本视频划分为M个时间子块,每个时间子块中包括至少 两个连续的视频帧,每个视频帧中包括唇部区域;For each of the pre-selected D sample videos, the sample video is divided into M time sub-blocks according to the chronological order of the video frames in the sample video, and each time sub-block includes at least Two consecutive video frames, each of which includes a lip region;
    将每个时间子块中每个视频帧的唇部区域划分为N个空间子块,并将同一时间子块中的各个视频帧中对应相同位置的空间子块组成一个视频子块,所述样本视频共得到M×N个视频子块,每个时间子块中包括N个视频子块;Dividing a lip region of each video frame in each time sub-block into N spatial sub-blocks, and forming spatial sub-blocks corresponding to the same position in each video frame in the same time sub-block into one video sub-block, The sample video obtains M×N video sub-blocks, and each time sub-block includes N video sub-blocks;
    计算每个视频子块的唇语特征向量,所述唇语特征向量用于描述对应视频子块中唇部区域的纹理信息;Calculating a lip language feature vector of each video sub-block, the lip language feature vector being used to describe texture information of a lip region in the corresponding video sub-block;
    根据所述D个样本视频中视频子块的唇语特征向量,对多层分类器的分类准确率按照预设训练算法进行训练,直至所述多层分类器的分类准确率满足第一预设条件时停止,得到训练完成的所述多层分类器,所述多层分类器用于对视频的语义信息进行分类;According to the lip feature vector of the video sub-block in the D sample videos, the classification accuracy of the multi-layer classifier is trained according to a preset training algorithm until the classification accuracy of the multi-layer classifier satisfies the first preset When the condition is stopped, the multi-layer classifier that is trained to be completed is obtained, and the multi-layer classifier is used to classify the semantic information of the video;
    其中,M、N、D均为正整数,D>1,×用于表示数值的乘积运算。 Among them, M, N, and D are positive integers, D>1, and × is used to represent the product of numerical values.
PCT/CN2015/081824 2015-06-18 2015-06-18 Feature extraction method, lip-reading classification method, device and apparatus WO2016201679A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/081824 WO2016201679A1 (en) 2015-06-18 2015-06-18 Feature extraction method, lip-reading classification method, device and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/081824 WO2016201679A1 (en) 2015-06-18 2015-06-18 Feature extraction method, lip-reading classification method, device and apparatus

Publications (1)

Publication Number Publication Date
WO2016201679A1 true WO2016201679A1 (en) 2016-12-22

Family

ID=57544670

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/081824 WO2016201679A1 (en) 2015-06-18 2015-06-18 Feature extraction method, lip-reading classification method, device and apparatus

Country Status (1)

Country Link
WO (1) WO2016201679A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109598189A (en) * 2018-10-17 2019-04-09 天津大学 A kind of video classification methods based on Feature Dimension Reduction
CN110113319A (en) * 2019-04-16 2019-08-09 深圳壹账通智能科技有限公司 Identity identifying method, device, computer equipment and storage medium
CN110929239A (en) * 2019-10-30 2020-03-27 中国科学院自动化研究所南京人工智能芯片创新研究院 Terminal unlocking method based on lip language instruction
CN111062277A (en) * 2019-12-03 2020-04-24 东华大学 Sign language-lip language conversion method based on monocular vision
CN111460880A (en) * 2019-02-28 2020-07-28 杭州芯影科技有限公司 Multimodal biometric fusion method and system
CN111582195A (en) * 2020-05-12 2020-08-25 中国矿业大学(北京) Method for constructing Chinese lip language monosyllabic recognition classifier
CN111612056A (en) * 2020-05-16 2020-09-01 青岛鼎信通讯股份有限公司 Low-pressure customer variation relation identification method based on fuzzy clustering and zero-crossing offset
CN111988652A (en) * 2019-05-23 2020-11-24 北京地平线机器人技术研发有限公司 Method and device for extracting lip language training data
CN113076942A (en) * 2020-01-03 2021-07-06 上海依图网络科技有限公司 Method, device, chip and computer readable storage medium for detecting preset mark

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020161582A1 (en) * 2001-04-27 2002-10-31 International Business Machines Corporation Method and apparatus for presenting images representative of an utterance with corresponding decoded speech
CN101751692A (en) * 2009-12-24 2010-06-23 四川大学 Method for voice-driven lip animation
CN103092329A (en) * 2011-10-31 2013-05-08 南开大学 Lip reading technology based lip language input method
CN104680144A (en) * 2015-03-02 2015-06-03 华为技术有限公司 Lip language recognition method and device based on projection extreme learning machine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020161582A1 (en) * 2001-04-27 2002-10-31 International Business Machines Corporation Method and apparatus for presenting images representative of an utterance with corresponding decoded speech
CN101751692A (en) * 2009-12-24 2010-06-23 四川大学 Method for voice-driven lip animation
CN103092329A (en) * 2011-10-31 2013-05-08 南开大学 Lip reading technology based lip language input method
CN104680144A (en) * 2015-03-02 2015-06-03 华为技术有限公司 Lip language recognition method and device based on projection extreme learning machine

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ONG, E.J. ET AL.: "Learning Temporal Signatures for Lip Reading", 2011 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, 31 December 2011 (2011-12-31), pages 958 - 965, XP032095352 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109598189A (en) * 2018-10-17 2019-04-09 天津大学 A kind of video classification methods based on Feature Dimension Reduction
CN109598189B (en) * 2018-10-17 2023-04-28 天津大学 Feature dimension reduction-based video classification method
CN111460880A (en) * 2019-02-28 2020-07-28 杭州芯影科技有限公司 Multimodal biometric fusion method and system
CN111460880B (en) * 2019-02-28 2024-03-05 杭州芯影科技有限公司 Multimode biological feature fusion method and system
CN110113319A (en) * 2019-04-16 2019-08-09 深圳壹账通智能科技有限公司 Identity identifying method, device, computer equipment and storage medium
CN111988652A (en) * 2019-05-23 2020-11-24 北京地平线机器人技术研发有限公司 Method and device for extracting lip language training data
CN110929239B (en) * 2019-10-30 2021-11-19 中科南京人工智能创新研究院 Terminal unlocking method based on lip language instruction
CN110929239A (en) * 2019-10-30 2020-03-27 中国科学院自动化研究所南京人工智能芯片创新研究院 Terminal unlocking method based on lip language instruction
CN111062277A (en) * 2019-12-03 2020-04-24 东华大学 Sign language-lip language conversion method based on monocular vision
CN111062277B (en) * 2019-12-03 2023-07-11 东华大学 Sign language-lip language conversion method based on monocular vision
CN113076942A (en) * 2020-01-03 2021-07-06 上海依图网络科技有限公司 Method, device, chip and computer readable storage medium for detecting preset mark
CN111582195B (en) * 2020-05-12 2024-01-26 中国矿业大学(北京) Construction method of Chinese lip language monosyllabic recognition classifier
CN111582195A (en) * 2020-05-12 2020-08-25 中国矿业大学(北京) Method for constructing Chinese lip language monosyllabic recognition classifier
CN111612056A (en) * 2020-05-16 2020-09-01 青岛鼎信通讯股份有限公司 Low-pressure customer variation relation identification method based on fuzzy clustering and zero-crossing offset
CN111612056B (en) * 2020-05-16 2023-06-02 青岛鼎信通讯股份有限公司 Low-voltage user variable relation recognition method based on fuzzy clustering and zero crossing offset

Similar Documents

Publication Publication Date Title
WO2016201679A1 (en) Feature extraction method, lip-reading classification method, device and apparatus
CN109815826B (en) Method and device for generating face attribute model
WO2020063527A1 (en) Human hairstyle generation method based on multi-feature retrieval and deformation
CN111460968B (en) Unmanned aerial vehicle identification and tracking method and device based on video
WO2016138838A1 (en) Method and device for recognizing lip-reading based on projection extreme learning machine
CN110929679B (en) GAN-based unsupervised self-adaptive pedestrian re-identification method
CN109583483A (en) A kind of object detection method and system based on convolutional neural networks
CN109740537B (en) Method and system for accurately marking attributes of pedestrian images in crowd video images
WO2021082168A1 (en) Method for matching specific target object in scene image
CN111310800B (en) Image classification model generation method, device, computer equipment and storage medium
CN106780639B (en) Hash coding method based on significance characteristic sparse embedding and extreme learning machine
CN111860587B (en) Detection method for small targets of pictures
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
Wei et al. Accurate facial image parsing at real-time speed
CN107871103B (en) Face authentication method and device
CN104318271B (en) Image classification method based on adaptability coding and geometrical smooth convergence
TW202036329A (en) Framework for combining multiple global descriptors for image retrieval
CN111860679B (en) Vehicle detection method based on YOLO v3 improved algorithm
CN115862045B (en) Case automatic identification method, system, equipment and storage medium based on image-text identification technology
CN115019173A (en) Garbage identification and classification method based on ResNet50
CN113963026A (en) Target tracking method and system based on non-local feature fusion and online updating
CN113449671A (en) Multi-scale and multi-feature fusion pedestrian re-identification method and device
US20240087352A1 (en) System for identifying companion animal and method therefor
CN114565755B (en) Image segmentation method, device, equipment and storage medium
CN114333062A (en) Pedestrian re-recognition model training method based on heterogeneous dual networks and feature consistency

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15895252

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15895252

Country of ref document: EP

Kind code of ref document: A1