CN114022938A

CN114022938A - Method, device, equipment and storage medium for visual element identification

Info

Publication number: CN114022938A
Application number: CN202111334252.4A
Authority: CN
Inventors: 安入东; 丁彧; 王钇翔; 吕唐杰; 范长杰; 胡志鹏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2022-02-08

Abstract

The application provides a method, a device, equipment and a storage medium for visual element identification, wherein the method comprises the following steps: acquiring an image sequence which comprises a mouth region and corresponds to a target video to be identified; acquiring a mouth feature sequence corresponding to the mouth region according to the image sequence; identifying a target image belonging to a visual element change boundary in the image sequence according to the mouth feature sequence and a preset time convolution network; acquiring a visual element boundary characteristic of the target video according to the mouth characteristic sequence and the target image; and obtaining a visual element boundary classification result corresponding to the target video according to the visual element boundary characteristics and a preset first classifier. According to the method and the device, the new identification branch is introduced, the viseme boundary characteristics of the image sequence are identified according to the time sequence characteristics, and the viseme boundaries are finely classified, so that the viseme identification accuracy is improved.

Description

Method, device, equipment and storage medium for visual element identification

Technical Field

The present application relates to the field of image recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for viseme recognition.

Background

Visual element Recognition belongs to a sub-field of VSR (Visual Speech Recognition), and aims to recognize Visual elements corresponding to mouth shapes in each frame of video image in an input speaking face video based on a mouth image sequence obtained by the input speaking face video. The method can be used as an alternative scheme to perform video speech recognition under the condition that a sound signal is weak or strong noise exists and other speaking scenes, and then the speaking content of a person in a video is recognized.

Compared with the method for recognizing a word or a sentence, the visual element recognition does not need a specific language or a word library, and can recognize the speaking content only by the same phoneme set, so that the cross-language speech content recognition can be realized. The technology can be applied to the fields of lip language identification, mouth shape generation quality judgment, speaking mouth shape video synthesis and the like, and can effectively reduce the cost of manpower and material resources while meeting the requirements of relevant application scenes. The current visual identification is still in the stage of starting, related researches are few, the effect of the current scheme is not good, and a plurality of defects and shortcomings exist. For example, the recognition error rate of the visual element recognition at the boundary of the visual element change in the video image sequence is very high, and the phenomenon of shifting of a plurality of frames is often generated, so that the accuracy rate of the visual element recognition is limited.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method, an apparatus, a device, and a storage medium for visual element identification, in which a new identification branch is introduced, visual element boundary characteristics of an image sequence are identified according to timing characteristics, and the visual element boundaries are finely classified, so that accuracy of visual element identification is improved.

A first aspect of an embodiment of the present application provides a method for visual element identification, including: acquiring an image sequence which comprises a mouth region and corresponds to a target video to be identified; acquiring a mouth feature sequence corresponding to the mouth region according to the image sequence; identifying a target image belonging to a visual element change boundary in the image sequence according to the mouth feature sequence and a preset time convolution network; acquiring a visual element boundary characteristic of the target video according to the mouth characteristic sequence and the target image; and obtaining a visual element boundary classification result corresponding to the target video according to the visual element boundary characteristics and a preset first classifier.

In an embodiment, the acquiring a mouth feature sequence corresponding to the mouth region according to the image sequence includes: acquiring initial mouth features based on a short time domain corresponding to the mouth region based on the image sequence and a preset first convolution neural network; based on the initial mouth feature and a preset second convolutional neural network, obtaining a mouth feature corresponding to a mouth region of each frame of image in the image sequence, and obtaining the mouth feature sequence according to the mouth feature corresponding to the mouth region of each frame of image; wherein the second convolutional neural network is a network comprising a position adaptive generation convolutional kernel.

In one embodiment, the method further comprises: acquiring mouth features respectively corresponding to different time domain ranges according to the mouth feature sequence and the preset time convolution network; and obtaining a classification result of the visual elements corresponding to the target video according to the mouth features, the visual element boundary classification result and a preset second classifier which respectively correspond to the different time domain ranges.

In an embodiment, the identifying, according to the mouth feature sequence and a preset time convolution network, a target image belonging to a view element change boundary in the image sequence includes: obtaining the confidence coefficient of each frame of image in the image sequence, which belongs to the target image of the visual element change boundary, according to the mouth feature sequence and a preset time convolution network; and taking the image with the confidence coefficient higher than a preset threshold value as the target image.

In an embodiment, the obtaining the visual boundary feature of the target video according to the mouth feature sequence and the target image includes: acquiring a mask of the target image; and acquiring the view pixel boundary characteristics of the target video according to the mask of the target image and the mouth characteristics corresponding to the target image in the mouth characteristic sequence.

In one embodiment, the method further comprises: acquiring an initial sample image sequence, and counting the visual element distribution information in the sample image sequence; according to the visual element distribution information, fusing visual element features with the distribution quantity proportion smaller than a preset threshold value to objects with different identity information in the sample image sequence to obtain a sample image sequence after face changing, and taking the initial sample image sequence and the sample image sequence after face changing as final sample image sequences; and training a neural network model by adopting the final sample image sequence to obtain the classifier, wherein the classifier comprises the first classifier and the second classifier.

In an embodiment, the acquiring an image sequence including a mouth region corresponding to a target video to be recognized includes: and carrying out face recognition on each frame of image of the target video, and cutting out an image sequence containing a mouth area from the recognized face image.

A second aspect of the embodiments of the present application provides a device for visual element identification, including: the first acquisition module is used for acquiring an image sequence which comprises a mouth region and corresponds to a target video to be identified; the first extraction module is used for acquiring a mouth feature sequence corresponding to the mouth region according to the image sequence; the first identification module is used for identifying a target image belonging to a visual element change boundary in the image sequence according to the mouth feature sequence and a preset time convolution network; the second identification module is used for obtaining the visual element boundary characteristics of the target video according to the mouth characteristic sequence and the target image; and the first classification module is used for obtaining a visual element boundary classification result corresponding to the target video according to the visual element boundary characteristics and a preset first classifier.

In one embodiment, the first extraction module is configured to: acquiring initial mouth features based on a short time domain corresponding to the mouth region based on the image sequence and a preset first convolution neural network; based on the initial mouth feature and a preset second convolutional neural network, obtaining a mouth feature corresponding to a mouth region of each frame of image in the image sequence, and obtaining the mouth feature sequence according to the mouth feature corresponding to the mouth region of each frame of image; wherein the second convolutional neural network is a network comprising a position adaptive generation convolutional kernel.

In one embodiment, the apparatus further comprises: the second extraction module is used for obtaining mouth features respectively corresponding to different time domain ranges according to the mouth feature sequence and the preset time convolution network; and the second classification module is used for obtaining the classification result of the visual element corresponding to the target video according to the mouth features respectively corresponding to the different time domain ranges, the visual element boundary classification result and a preset second classifier.

In one embodiment, the first identification module is configured to: obtaining the confidence coefficient of each frame of image in the image sequence, which belongs to the target image of the visual element change boundary, according to the mouth feature sequence and a preset time convolution network; and taking the image with the confidence coefficient higher than a preset threshold value as the target image.

In one embodiment, the second identification module is configured to: acquiring a mask of the target image; and acquiring the view pixel boundary characteristics of the target video according to the mask of the target image and the mouth characteristics corresponding to the target image in the mouth characteristic sequence.

In one embodiment, the apparatus further comprises: the second acquisition module is used for acquiring an initial sample image sequence and counting the distribution information of the pixels in the sample image sequence; the fusion module is used for fusing the visual element features with the distribution quantity proportion smaller than a preset threshold value to objects with different identity information in the sample image sequence according to the visual element distribution information to obtain a sample image sequence after face changing, and taking the initial sample image sequence and the sample image sequence after face changing as final sample image sequences; and the training module is used for training a neural network model by adopting the final sample image sequence to obtain the classifier, and the classifier comprises the first classifier and the second classifier.

In one embodiment, the first obtaining module is configured to: and carrying out face recognition on each frame of image of the target video, and cutting out an image sequence containing a mouth area from the recognized face image.

A third aspect of embodiments of the present application provides an electronic device, including: a memory to store a computer program; a processor configured to execute the computer program to implement the method of the first aspect and any embodiment of the present application.

A fourth aspect of embodiments of the present application provides a non-transitory electronic device-readable storage medium, including: a program which, when run by an electronic device, causes the electronic device to perform the method of the first aspect of an embodiment of the present application and any embodiment thereof.

According to the method, the device, the equipment and the storage medium for recognizing the visual elements, the characteristic extraction is carried out on the image sequence to be recognized, which contains the mouth area, so that the mouth characteristic sequence corresponding to the mouth area is obtained, then the target image which belongs to the visual element change boundary in the image sequence is recognized according to the mouth characteristic sequence and the preset time convolution network, the visual element boundary characteristic of the target video is further obtained, and finally, the visual element boundary classification result corresponding to the target video is obtained according to the visual element boundary characteristic and the preset first classifier, so that the detailed classification of the visual element boundary is realized, and the accuracy of visual element recognition is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 2A is a diagram illustrating a neural network used in a method of visual element identification according to an embodiment of the present application;

FIG. 2B is a diagram illustrating the correspondence between viseme types and phonemes according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating a method for visual element identification according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a method for visual element identification according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a device for visual element identification according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. In the description of the present application, the terms "first," "second," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

As shown in fig. 1, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor being exemplified in fig. 1. The processor 11 and the memory 12 are connected by a bus 10. The memory 12 stores instructions executable by the processor 11, and the instructions are executed by the processor 11, so that the electronic device 1 can execute all or part of the flow of the method in the embodiments described below, and the accuracy of the visual element identification is improved.

In an embodiment, the electronic device 1 may be a mobile phone, a tablet computer, a notebook computer, a desktop computer, or the like.

As shown in fig. 2A, a schematic diagram of a neural network used in the method for visual element recognition according to an embodiment of the present application is first input to a video, where the video includes a series of image sequences that change with time, such as a sequence of images of a region of interest of a mouth of a given target human face. Then, a mouth region of interest (ROI) in each image is cut out from the image sequence, mouth feature extraction is performed on the mouth region of interest, so as to obtain mouth features of each image, for example, a shallow layer 3D convolution and a deep layer 2D convolution network can be used to obtain frame-by-frame mouth features. And then, performing time sequence modeling, capturing long-term dependence among different frames through a time convolution module, and then entering a classifier to classify each frame to obtain a predicted visual classification result.

In an actual scene, a plurality of frames of images are included in an image sequence, the mouth motion of each frame of image can correspond to a visual element characteristic, the visual element characteristic can be a mouth shape, and the mouth shape and a preset mouth shape and phoneme corresponding table are used for determining the mouth shape, that is, the type of viseme of the mouth shape can be identified, as shown in FIG. 2B, it is a table of mouth shape and phoneme, the first row of characters is the corresponding phoneme type (Si/sp, B/M/P, F/V, DH/TH, D/S/T/Z/L/N, CH/JH/SH/ZH, G/K/NG/HH, AA/AH/AW/AY, IH/IY, R, EH/EY, AO/OW/OY), the second row of pictures is the corresponding viseme (i.e. the mouth shape of each group of phoneme pronunciation). Each mouth shape can correspond to a plurality of phonemes, so if mouth shape recognition is not accurate enough, the final visual element recognition result is not accurate enough. However, there are many mouth movements between two mouth shapes in an image sequence, for example, the mouth shape of a first frame image is "AA", the mouth shape of a third frame image is "OW", and a second frame image is an excessive image in which the mouth of a person in the image changes from the mouth shape "AA" to "OW", that is, a boundary image belonging to a boundary of a view pixel change boundary, and the mouth shape in the excessive image is compared with the mouth shapes in the first frame image and the second frame image, and is easily recognized erroneously as the mouth shape "AA" or "OW", and this image is a view pixel boundary image, and the recognition accuracy of the view pixel boundary image directly affects the view pixel recognition rate of the final whole image sequence.

The neural network in fig. 2A introduces a neural network branch, inputs an image sequence containing mouth features and time-series features into a boundary identification network, identifies one or more boundary images at a view change boundary in the image sequence, marks the boundary images in the image sequence, extracts the mouth features of each frame of boundary image from the image sequence containing the mouth features based on the marked boundary images, inputs the mouth features of the boundary images into a corresponding classifier, and outputs a boundary view classification result. The boundary visual element classification result can be used as supervision information to supervise the recognition process of the whole neural network, and the final output visual element classification result is promoted to be more accurate.

Please refer to fig. 3, which is a method for visual element recognition according to an embodiment of the present application, and the method can be executed by the electronic device 1 shown in fig. 1 as a current terminal, and can be applied to the neural networks shown in fig. 2A to 2B to realize fine classification of the visual element boundaries, so as to improve the recognition accuracy at the visual element boundaries. The method comprises the following steps:

step 301: and acquiring an image sequence containing a mouth region corresponding to the target video to be identified.

In this step, the target video to be recognized may be a segment of video with a person speaking, and may be obtained by shooting with an image capturing device, or may be directly obtained by reading from a video database or other devices. The target video comprises a mouth moving image when a person speaks, and the target video can be subjected to frame-by-frame image recognition to obtain an image sequence containing a mouth region.

In one embodiment, step 301 may include: and carrying out face recognition on each frame of image of the target video, and cutting out an image sequence containing a mouth area from the face image. The mouth image sequence can be detected and cut out from the face image in each frame of image by using a face detector, for example, a dilb library (machine learning open source library) can be used for carrying out face recognition on each frame of image, key points of the face in the image are positioned, then a 96 × 96 mouth region-of-interest image is intercepted by taking the lip as the center, the mouth region-of-interest image can be randomly shifted, characteristic images of mouth regions with different position distributions can be obtained, and the diversity of the image sequence of the mouth region can be expanded.

Step 302: and acquiring a mouth feature sequence corresponding to the mouth region according to the image sequence.

In this step, the mouth feature sequence corresponds one-to-one to each frame of image in the image sequence. Mouth features are spatial features that contain temporal feature information. The spatial characteristics represent the variation characteristics of the mouth movement in space, such as the position of the mouth shape, the size of the mouth shape and the like. The time-series characteristic information represents the time-dependent information of the mouth motion in different frame images, such as the temporal correlation of the mouth motion of the current frame image and the previous frame image.

Step 303: and identifying a target image belonging to a visual element change boundary in the image sequence according to the mouth feature sequence and a preset time convolution network.

In this step, in an actual scene, the image sequence includes multiple frames of images, the mouth motion of each frame of image may correspond to one visual element feature, the visual element feature may be a mouth shape, and the visual element category of the mouth shape may be identified through the mouth shape and a preset correspondence table between the mouth shape and a phoneme. However, there is often a lot of mouth movement between two mouth shapes in the image sequence, for example, the mouth shape of the first frame image is "AA", the mouth shape of the third frame image is "OW", and the second frame image is an excessive image in which the mouth shape of the person in the image changes from the mouth shape "AA" to "OW", and the mouth shape in the excessive image is compared with the mouth shapes in the first frame image and the second frame image, and is easily recognized erroneously as the mouth shape "AA" or "OW", and this excessive image is a target image belonging to the boundary of the change of the visual pixels, and if there is only a slight change between the two images, but both the images belong to the corresponding image of "AA", in this case, the boundary of the change of the visual pixels is not calculated. The identification accuracy of the visual element boundary image directly influences the visual element identification rate of the final whole image sequence. Therefore, in order to solve the problems of confusion and high recognition difficulty of the visual element recognition at the boundary of the visual element characteristic change, the target image belonging to the visual element change boundary can be screened out from the image sequence by utilizing the shallow time sequence characteristic of the mouth characteristic sequence.

Step 304: and acquiring the visual element boundary characteristics of the target video according to the mouth characteristic sequence and the target image.

In this step, an image sequence of a target video (e.g., a segment of a human speech video) may include a plurality of target images belonging to a view element change boundary, and the plurality of target images may be finely processed and classified to obtain mouth motion characteristics in the view element boundary image.

Step 305: and obtaining a visual element boundary classification result corresponding to the target video according to the visual element boundary characteristics and a preset first classifier.

In this step, the first classifier is used to identify a result of the classification of the visual element boundary of the image sequence, and the classifier may be obtained by training the classifier using the labeled visual element boundary sample image. As shown in fig. 2A, the result of the boundary identification, i.e., the visual element boundary feature, is input into the first classifier, so that the visual element boundary classification result of the whole image sequence can be obtained.

According to the method for recognizing the visual element, the characteristic extraction is carried out on the image sequence to be recognized, which comprises the mouth area, so that the mouth characteristic sequence corresponding to the mouth area is obtained, then the target image belonging to the visual element change boundary in the image sequence is recognized according to the mouth characteristic sequence and the preset time convolution network, further the visual element boundary characteristic of the target video is obtained, and finally the visual element boundary classification result corresponding to the target video is obtained according to the visual element boundary characteristic and the preset first classifier, so that the detailed classification of the visual element boundary is realized, and the accuracy of visual element recognition is improved.

Please refer to fig. 4, which is a method for visual element recognition according to an embodiment of the present application, and the method can be executed by the electronic device 1 shown in fig. 1 as a current terminal, and can be applied to a neural network shown in fig. 2A to 2B to realize recognition of visual element boundary features of an image sequence according to timing features, and perform fine classification on visual element boundaries, thereby improving recognition accuracy at the visual element boundaries. The method comprises the following steps:

step 401: and acquiring an image sequence containing a mouth region corresponding to the target video to be identified. See the description of step 301 in the above embodiments for details.

Step 402: and acquiring initial mouth features based on a short time domain corresponding to the mouth region based on the image sequence and a preset first convolution neural network.

In this step, the image sequence of the mouth region may be subjected to feature extraction by using a hybrid convolutional neural network (i.e., a first convolutional neural network), the image sequence of the mouth region may be converted into a grayscale image sequence, and the entire image sequence may be used as an input to the hybrid convolutional neural network. For example, through the "mouth feature extraction" part shown in fig. 2A, a single-layer 3D convolutional neural network may be used, and the temporal dependency of the mouth motion is obtained through a smaller time domain convolution kernel as the initial mouth feature.

Step 403: based on the initial mouth feature and a preset second convolutional neural network, obtaining a mouth feature corresponding to a mouth region of each frame of image in the image sequence, and obtaining the mouth feature sequence according to the mouth feature corresponding to the mouth region of each frame of image.

In this step, a deep convolutional neural network (i.e., a second convolutional neural network) may be used to extract the mouth space key information in each frame of image sequence, so as to obtain the mouth region-of-interest features that include the short-time context information frame by frame. And the second convolutional neural network is a network containing a position self-adaptive generated convolutional kernel. For example, a multilayer 2D deep convolutional neural network based on inversion of Convolution is used to obtain relatively rich semantic information from an image sequence containing initial mouth features. By adopting the Involution method for adaptively generating the convolution kernel at the spatial position, the dependence between the mouth features at different spatial positions can be modeled, and further the spatial feature information of the mouth movement at different positions is extracted based on the initial mouth feature of the short time domain in the step 402. And a spatial position dependency relationship is introduced while a time sequence dependency relationship is modeled, so that the expression capability of the model is greatly improved under the condition that the parameter quantity is not obviously improved.

Step 404: and respectively obtaining mouth features corresponding to different time domain ranges according to the mouth feature sequence and a preset time convolution network.

In this step, the preset time convolution network may be implemented by the time sequence modeling module shown in fig. 2A, and the time sequence feature information may be extracted by the time sequence modeling module shown in fig. 2A. The time sequence modeling module is mainly used for modeling the multi-scale long-distance time sequence dependency relationship of the image sequence containing the mouth region. The method specifically comprises the following steps: and (3) taking the mouth feature sequence obtained in the step 403 as input, and coding context information in different time domain ranges through a multi-layer time convolution module to obtain a mouth feature image sequence which is aggregated with richer context dependency relationship in a longer time distance.

For example, a multi-scale TCN (Temporal Convolutional Network) time-domain Convolutional Network is adopted, time sequence information of different time ranges in an image sequence containing mouth features is respectively extracted, and long-term dependency relationships between the current frame image and other frame images are obtained through feature splicing and fusion, so that short-term dependency of the 3D Network in step 402 can be supplemented, and the identification accuracy is improved.

Step 405: and obtaining the confidence coefficient of each frame of image in the image sequence belonging to the target image of the visual element change boundary according to the mouth feature sequence and a preset time convolution network.

In this step, in order to solve the problems of confusion and high difficulty in recognition of the view at the boundary of the view feature change, the shallow time sequence feature of the TCN network in step 404 may be used to screen out a target image belonging to the view change boundary from the image sequence. For example, the output features of the first layer of the TCN network in step 404 are used as input, and pass through a 1 × 1 convolutional neural network, and the boundary confidence of each frame of image is activated and regressed by using Sigmoid function.

Step 406: and taking the image with the confidence coefficient higher than a preset threshold value as a target image.

In this step, on the basis of step 405, a BiasReLU activation function is used to filter out frames with confidence levels less than or equal to a preset threshold, that is, images with confidence levels higher than the preset threshold can be used as target images.

Step 407: a mask of the target image is obtained.

In this step, for example, mask masks for all target images selected in step 406 may be obtained.

Step 408: and acquiring the view pixel boundary characteristics of the target video according to the mask of the target image and the mouth characteristics corresponding to the target image in the mouth characteristic sequence.

In this step, in addition to step 407, the mask of each target image is multiplied to the mouth feature sequence output by the second convolutional neural network in step 403, so as to obtain the mouth feature of each target image. Then, the mouth feature of each target image is input into a first classifier, and the visual element boundary feature of the target video is obtained. Because the TCN shallow time sequence characteristics are supervised by adding the branch circuit, the identification accuracy of the network to the samples which are difficult to distinguish near the visual element boundary can be effectively improved.

Step 409: and obtaining a visual element boundary classification result corresponding to the target video according to the visual element boundary characteristics and a preset first classifier. See the description of step 305 in the above embodiments for details.

Step 410: and obtaining a visual element classification result corresponding to the target video according to the mouth features, the visual element boundary classification results and a preset second classifier which respectively correspond to different time domain ranges.

In this step, the visual element boundary classification result may be used as the supervision information, the mouth features corresponding to different time domain ranges are input into the second classifier according to the supervision information, and the classification result of the visual element corresponding to the target video is output. The visual element boundary classification result comprises visual element characteristics of the whole image sequence which are easily recognized by mistake, the visual element boundary classification result is considered in the whole visual element recognition process of the image sequence, the whole visual element recognition process can be supervised, and the final visual element recognition result is promoted to be more accurate.

The first classifier is used for identifying a visual element boundary classification result of the image sequence, and the classifier can be trained by adopting a sample image sequence with a marked visual element boundary to obtain the first classifier. The second classifier is used for identifying the visual element category of the whole image sequence, and the classifier can be trained by adopting the marked sample image sequence to obtain the second classifier.

In an embodiment, the step of establishing the first classifier or the second classifier may include: and acquiring an initial sample image sequence, and counting the visual element distribution information in the sample image sequence. And according to the visual element distribution information, fusing the visual element features with the distribution quantity proportion smaller than a preset threshold value to objects with different identity information in the sample image sequence to obtain a sample image sequence after face changing, and taking the initial sample image sequence and the sample image sequence after face changing as final sample image sequences. And training a neural network model by adopting the final sample image sequence to obtain a corresponding classifier.

In an actual scene, an initial sample image sequence may have a relatively small data set, and contain very little face identity information, usually only several tens of identity IDs, so that it is very easy to over-fit to a specific person ID, for example, the initial sample image sequence data set is several sections of talking videos of persons, assuming that one section of video is a talking image sequence of only one person, the identity ID refers to a person in each talking video in the initial sample image sequence data set, if the identity ID is too few, that is, the total number of speakers in the entire initial sample image sequence data set is too few, the recognition model is easily affected by attributes such as the face shape and the skin color of the person in the data set, which may result in that the training effect is better for the several persons, and the recognition by replacing the videos of the speakers outside the data set is not good enough, and thus the generalization capability is weak. And the initial sample image sequence may have a data imbalance problem, the difference between the number of samples corresponding to each type of each visual element in the data set is large, some are particularly small, and the recognition effect of the recognition model obtained by training with a large number of samples of the visual element type is better compared with the comparison result, while the unbalanced initial sample image sequence may have a very small proportion of partial visual element features in the data set, so that the distribution and the features of the visual element feature types with a large number of sample images are mainly learned by the recognition model in the training process, and the accuracy of final visual element recognition can be affected.

Therefore, in order to alleviate the problem of data set imbalance, the distribution situation of different visual element features in the data set of the initial sample image sequence can be counted, and for the visual element feature with a small distribution quantity ratio, the face changing technology based on image generation is adopted to change the face of the ID of the person in the data set, so that the quantity of the ID of the sample in the data set is increased. For example, firstly, the identity information of the person B in the initial sample image sequence is extracted by using a pre-trained face recognition model, then the attribute information (face features) of the person a is extracted by using an encoder, and finally the attribute information and the identity information are input into a decoder to be fused, and a new face with the attribute of a and the identity of B is generated. Since the face change only changes the identity information and does not change the attributes, particularly the expressions, the mouth shape does not change before and after the face change, but the shape of the lips changes due to the change of the ID. Therefore, image sequence data of different IDs under the same mouth shape can be obtained, and data enhancement is completed. The method has the advantages that the data with few samples can be effectively supplemented according to the distribution condition of the original data, the negative influence caused by data imbalance is relieved, a large amount of new face IDs are generated, the risk that the recognition model is over-fitted to a certain ID of a training set in the training process is reduced, and the generalization capability of the model is improved.

Please refer to fig. 5, which is a device 500 for visual element recognition according to an embodiment of the present application, and the device can be applied to the electronic device 1 shown in fig. 1 and can be applied to the neural network shown in fig. 2A to 2B to realize the fine classification of the visual element boundaries, so as to improve the recognition accuracy at the visual element boundaries. The device includes: the system comprises a first obtaining module 501, a first extracting module 502, a first identifying module 503, a second identifying module 504 and a first classifying module 505, wherein the principle relationship of the modules is as follows:

a first obtaining module 501, configured to obtain an image sequence including a mouth region corresponding to a target video to be identified.

The first extraction module 502 is configured to obtain a mouth feature sequence corresponding to a mouth region according to the image sequence.

The first identifying module 503 is configured to identify a target image belonging to a view element change boundary in the image sequence according to the mouth feature sequence and a preset time convolution network.

And a second identifying module 504, configured to obtain a view boundary feature of the target video according to the mouth feature sequence and the target image.

And a first classification module 505, configured to obtain a visual element boundary classification result corresponding to the target video according to the visual element boundary feature and a preset first classifier.

In one embodiment, the first extraction module 502 is configured to: and acquiring initial mouth features based on a short time domain corresponding to the mouth region based on the image sequence and a preset first convolution neural network. Based on the initial mouth feature and a preset second convolutional neural network, obtaining a mouth feature corresponding to a mouth region of each frame of image in the image sequence, and obtaining the mouth feature sequence according to the mouth feature corresponding to the mouth region of each frame of image. And the second convolutional neural network is a network containing a position self-adaptive generated convolutional kernel.

In one embodiment, the apparatus further comprises: the second extraction module 506 is configured to obtain mouth features corresponding to different time domain ranges according to the mouth feature sequence and a preset time convolution network. And a second classification module 507, configured to obtain a classification result of the visual elements corresponding to the target video according to the mouth features, the visual element boundary classification results and a preset second classifier respectively corresponding to different time domain ranges.

In one embodiment, the first identification module 503 is configured to: and obtaining the confidence coefficient of each frame of image in the image sequence belonging to the target image of the visual element change boundary according to the mouth feature sequence and a preset time convolution network. And taking the image with the confidence coefficient higher than a preset threshold value as a target image.

In one embodiment, the second identification module 504 is configured to: a mask of the target image is obtained. And acquiring the view pixel boundary characteristics of the target video according to the mask of the target image and the mouth characteristics corresponding to the target image in the mouth characteristic sequence.

In one embodiment, the apparatus further comprises: and a second obtaining module 508, configured to obtain an initial sample image sequence, and count the distribution information of the pixels in the sample image sequence. And a fusion module 509, configured to fuse, according to the visual element distribution information, the visual element features whose distribution quantity ratio is smaller than the preset threshold to objects with different identity information in the sample image sequence, to obtain a sample image sequence after face change, and use the initial sample image sequence and the sample image sequence after face change as a final sample image sequence. And a training module 510, configured to train the neural network model with the final sample image sequence to obtain a classifier, where the classifier includes a first classifier and a second classifier.

In an embodiment, the first obtaining module 501 is configured to: and carrying out face recognition on each frame of image of the target video, and cutting out an image sequence containing a mouth area from the recognized face image.

For a detailed description of the above device 500 for visual recognition, please refer to the description of the related method steps in the above embodiments.

An embodiment of the present invention further provides a non-transitory electronic device readable storage medium, including: a program that, when run on an electronic device, causes the electronic device to perform all or part of the procedures of the methods in the above-described embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like. The storage medium may also comprise a combination of memories of the kind described above.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A method of visual recognition, comprising:

acquiring an image sequence which comprises a mouth region and corresponds to a target video to be identified;

acquiring a mouth feature sequence corresponding to the mouth region according to the image sequence;

identifying a target image belonging to a visual element change boundary in the image sequence according to the mouth feature sequence and a preset time convolution network;

acquiring a visual element boundary characteristic of the target video according to the mouth characteristic sequence and the target image;

and obtaining a visual element boundary classification result corresponding to the target video according to the visual element boundary characteristics and a preset first classifier.

2. The method according to claim 1, wherein the obtaining a mouth feature sequence corresponding to the mouth region according to the image sequence includes:

acquiring initial mouth features based on a short time domain corresponding to the mouth region based on the image sequence and a preset first convolution neural network;

based on the initial mouth feature and a preset second convolutional neural network, obtaining a mouth feature corresponding to a mouth region of each frame of image in the image sequence, and obtaining the mouth feature sequence according to the mouth feature corresponding to the mouth region of each frame of image;

wherein the second convolutional neural network is a network comprising a position adaptive generation convolutional kernel.

3. The method of claim 1, further comprising:

acquiring mouth features respectively corresponding to different time domain ranges according to the mouth feature sequence and the preset time convolution network;

and obtaining a classification result of the visual elements corresponding to the target video according to the mouth features, the visual element boundary classification result and a preset second classifier which respectively correspond to the different time domain ranges.

4. The method according to claim 1, wherein the identifying a target image belonging to a view element change boundary in the image sequence according to the mouth feature sequence and a preset time convolution network comprises:

obtaining the confidence coefficient of each frame of image in the image sequence, which belongs to the target image of the visual element change boundary, according to the mouth feature sequence and a preset time convolution network;

and taking the image with the confidence coefficient higher than a preset threshold value as the target image.

5. The method according to claim 1, wherein the obtaining of the visual boundary feature of the target video according to the mouth feature sequence and the target image comprises:

acquiring a mask of the target image;

and acquiring the view pixel boundary characteristics of the target video according to the mask of the target image and the mouth characteristics corresponding to the target image in the mouth characteristic sequence.

6. The method of claim 3, further comprising:

acquiring an initial sample image sequence, and counting the visual element distribution information in the sample image sequence;

according to the visual element distribution information, fusing visual element features with the distribution quantity proportion smaller than a preset threshold value to objects with different identity information in the sample image sequence to obtain a sample image sequence after face changing, and taking the initial sample image sequence and the sample image sequence after face changing as final sample image sequences;

and training a neural network model by adopting the final sample image sequence to obtain the classifier, wherein the classifier comprises the first classifier and the second classifier.

7. The method according to claim 1, wherein the obtaining of the image sequence containing the mouth region corresponding to the target video to be recognized comprises:

and carrying out face recognition on each frame of image of the target video, and cutting out an image sequence containing a mouth area from the recognized face image.

8. An apparatus for visual recognition, comprising:

the first acquisition module is used for acquiring an image sequence which comprises a mouth region and corresponds to a target video to be identified;

the first extraction module is used for acquiring a mouth feature sequence corresponding to the mouth region according to the image sequence;

the first identification module is used for identifying a target image belonging to a visual element change boundary in the image sequence according to the mouth feature sequence and a preset time convolution network;

the second identification module is used for obtaining the visual element boundary characteristics of the target video according to the mouth characteristic sequence and the target image;

and the first classification module is used for obtaining a visual element boundary classification result corresponding to the target video according to the visual element boundary characteristics and a preset first classifier.

9. An electronic device, comprising:

a memory to store a computer program;

a processor to execute the computer program to implement the method of any one of claims 1 to 7.

10. A non-transitory electronic device readable storage medium, comprising: program which, when run by an electronic device, causes the electronic device to perform the method of any one of claims 1 to 7.