WO2023077742A1

WO2023077742A1 - Video processing method and apparatus, and neural network training method and apparatus

Info

Publication number: WO2023077742A1
Application number: PCT/CN2022/088965
Authority: WO
Inventors: 陈奕名; 王麒铭; 栾鹏龙; 兰永亮; 贾兆柱
Original assignee: 新东方教育科技集团有限公司
Priority date: 2021-11-04
Filing date: 2022-04-25
Publication date: 2023-05-11
Also published as: CN113723385B; CN113723385A

Abstract

A video processing method and apparatus, and a neural network training method and apparatus. The video processing method comprises: obtaining at least one frame image and an audio clip; preprocessing the at least one frame image to obtain mouth feature information of a face region; and on the basis of the mouth feature information and the audio clip, using a video processing network for processing the at least one frame image to obtain a target video, wherein objects in the target video have mouth shape changes synchronous with the audio clip, the mouth feature information is at least used for providing basic contours of the face region and the mouth of each object for the video processing network, and a positional relationship between the face region and the mouth of each object. In the video processing method, the mouth feature information is utilized to provide the approximate contours and positions of the face and the mouth for the video processing network, so that the network can conveniently generate a more accurate mouth region, and the mouth shape part of the obtained target video is higher in matching degree and higher in accuracy.

Description

Video processing method and device, neural network training method and device

This application claims the priority of Chinese Patent Application No. 202111296799.X filed on November 04, 2021, the entirety of which is incorporated by reference as a part of this application.

technical field

Embodiments of the present disclosure relate to a video processing method, a video processing device, a neural network training method, a neural network training device, electronic equipment, and a non-transitory computer-readable storage medium.

Background technique

Lip synchronization has a wide range of application scenarios in scenarios such as game/anime character dubbing, digital avatars, and lip-sync voice translation. For example, a user can provide a piece of audio and a given character image or animated image, and a speech video of the corresponding character can be generated. The mouth shape of the corresponding character in the speech video changes correspondingly with the change of the audio, and the character's mouth shape completely matches the audio.

Contents of the invention

At least one embodiment of the present disclosure provides a video processing method, including: acquiring at least one frame image and an audio clip, wherein each frame image includes at least one object, and each object includes a face area; performing the processing on the at least one frame image Preprocessing to obtain mouth feature information of the facial region; based on the mouth feature information and the audio clip, using a video processing network to process the at least one frame image to obtain a target video, wherein the target Objects in the video have mouth shape changes synchronized with the audio clip, and the mouth feature information is at least used to provide the video processing network with the basic outline of each object's face area and mouth, and the The positional relationship between the facial area and the mouth of each object.

For example, in the video processing method provided in at least one embodiment of the present disclosure, performing preprocessing on the at least one frame image to obtain mouth feature information of the facial region includes: using a mouth blur model to process each The mouth of the object in the frame image is blurred to obtain a mouth blurred picture corresponding to each frame image, wherein the mouth feature information includes at least one mouth blurred picture corresponding to the at least one frame image respectively .

For example, in the video processing method provided in at least one embodiment of the present disclosure, the mouth of the object in each frame image is blurred by using the mouth blur model to obtain the mouth blur corresponding to each frame image The picture includes: performing a first color space conversion on the frame image to obtain a first converted image; extracting a mouth area in the first converted image, and performing a first filtering process on the mouth area to obtain the The blurred picture of the mouth corresponding to the frame image.

For example, in the video processing method provided in at least one embodiment of the present disclosure, the mouth of the object in each frame image is blurred by using the mouth blur model to obtain the mouth blur corresponding to each frame image The picture includes: performing a first color space conversion on the frame image to obtain a first converted image; extracting a mouth area in the first converted image, and performing a first filtering process on the mouth area to obtain a first converted image. The middle blurred image; the second color space conversion is performed on the frame image to obtain a second conversion image; the skin area in the second conversion image is extracted, and a preset area including the mouth is selected from the skin area; performing a second filtering process on the preset area to obtain a second intermediate blurred image; performing synthesis processing on the first intermediate blurred image and the second intermediate blurred image to obtain a mouth blurred picture corresponding to the frame image.

For example, in the video processing method provided in at least one embodiment of the present disclosure, the first color space is an HSI color space, and the second color space is a YCbCr color space.

For example, in the video processing method provided in at least one embodiment of the present disclosure, performing preprocessing on the at least one frame image to obtain the mouth feature information of the facial region further includes: blurring the mouth of the at least one picture Gradient feature extraction is performed to obtain a gradient feature map corresponding to each mouth blur picture, wherein the mouth feature information further includes at least one gradient feature map corresponding to the at least one mouth blur picture.

For example, in the video processing method provided in at least one embodiment of the present disclosure, performing gradient feature extraction on the at least one blurred mouth picture to obtain a gradient feature map corresponding to each blurred mouth picture includes: acquiring each The grayscale image corresponding to the blurred mouth picture; obtain the first convolution kernel and the second convolution kernel, wherein the size of the first convolution kernel is smaller than the size of the second convolution kernel, and the first convolution kernel The sum of all elements in the product kernel is 0, and the sum of all elements in the second convolution kernel is 0; the grayscale image is combined with the first convolution kernel and the second convolution kernel Convolution processing to obtain the gradient map corresponding to each blurred mouth picture.

For example, in the video processing method provided in at least one embodiment of the present disclosure, performing preprocessing on the at least one frame image to obtain mouth feature information of the facial region further includes: using a facial key point detection model to process the Each frame image is processed to obtain a plurality of facial key points; extract a plurality of mouth key points related to the mouth in the plurality of facial key points, wherein the mouth feature information also includes the plurality of mouth key points.

For example, in the video processing method provided in at least one embodiment of the present disclosure, the video processing network includes a feature extraction subnetwork and a decoding generation subnetwork, and based on the mouth feature information and the audio clip, the video processing network is used to The network processing the at least one frame image includes: performing spectrum conversion processing on the audio clip to obtain a feature spectrum; using the feature extraction sub-network to perform feature extraction on the at least one blurred mouth picture and the feature spectrum The extraction process obtains M visual feature vectors, wherein the M visual feature vectors match the audio clips, M is a positive integer and is less than or equal to the number of the at least one blurred mouth picture; using the decoding to generate The sub-network processes the M visual feature vectors to obtain M target frames, wherein the M target frames are in one-to-one correspondence with the M time points in the audio clip, and each of the M target frames The M target frames have the mouth shape corresponding to the corresponding time point in the audio clip; the target video is obtained according to the M target frames.

For example, in the video processing method provided in at least one embodiment of the present disclosure, the feature extraction sub-network is used to perform feature extraction processing on the at least one mouth blur picture and the feature spectrum to obtain M visual feature vectors, including : Divide the at least one blurred mouth picture into M groups in sequence, and use the feature extraction sub-network to extract visual feature vectors corresponding to each group, so as to obtain the M visual feature vectors.

For example, in the video processing method provided in at least one embodiment of the present disclosure, the mouth feature information further includes at least one gradient feature map corresponding to the at least one blurred mouth picture, and the feature extraction sub-network is used to extract Performing feature extraction processing on the at least one blurred mouth picture and the feature spectrum to obtain M visual feature vectors, including: using the feature extraction sub-network to extract the at least one blurred mouth picture and the at least one gradient feature map Perform feature extraction processing with the feature spectrum to obtain M visual feature vectors, wherein the at least one gradient feature map is used to provide the feature extraction sub-network with the blurred area and the non-blurred area in the corresponding mouth blurred picture scope.

For example, in the video processing method provided in at least one embodiment of the present disclosure, the mouth feature information further includes a plurality of mouth key points, and the M visual feature vectors are processed by using the decoding generation sub-network to obtain M target frames, including: using the decoding generation sub-network to process each visual feature vector to generate an intermediate frame with a mouth area; using the multiple mouth key points to process the mouth of the intermediate frame The location of the region and the image information are corrected to obtain the target frame corresponding to the visual feature vector.

At least one embodiment of the present disclosure provides a neural network training method, wherein the neural network includes a video processing network, and the training method includes: acquiring a training video and a training audio segment matching the training video, wherein the The training video includes at least one training frame image, each training frame image includes at least one object, and each object includes a facial area; the training video is preprocessed to obtain mouth feature information corresponding to the training video; based on the The mouth feature information and the training audio clips are used to train the video processing network.

For example, in the neural network training method provided in at least one embodiment of the present disclosure, the video processing network includes a feature extraction sub-network, and based on the mouth feature information and the training audio clip, the video processing network is The training includes: performing spectral conversion processing on the training audio segment to obtain the training feature spectrum; using the training feature spectrum and the mouth feature information to train the feature extraction sub-network to be trained to obtain the trained The feature extraction sub-network described above.

For example, in the neural network training method provided in at least one embodiment of the present disclosure, the mouth feature information includes at least one blurred mouth picture, and the training feature spectrum and the mouth feature information are used to train the The feature extraction sub-network is trained to obtain the trained feature extraction sub-network, including: using the feature extraction sub-network to be trained to process the training feature spectrum and the at least one blurred mouth picture to obtain Training visual feature vectors and training audio feature vectors; according to the training visual feature vectors and the training audio feature vectors, calculating the loss value of the feature extraction sub-network through the loss function corresponding to the feature extraction sub-network; based on the The loss value modifies the parameters of the feature extraction sub-network to be trained; and when the loss value corresponding to the feature extraction sub-network to be trained does not meet the predetermined accuracy rate condition, continue to input the training feature spectrum and the At least one mouth blur picture to repeat the above training process.

For example, in the neural network training method provided in at least one embodiment of the present disclosure, the mouth feature information includes at least one blurred mouth picture, and the video processing network further includes a decoding generation sub-network, based on the mouth feature Information and the training audio clip, training the video processing network, further includes: using the trained feature extraction sub-network to process the training feature spectrum and the at least one blurred mouth picture to obtain at least A target visual feature vector; according to the at least one target visual feature vector and the training video, the decoding generation sub-network is trained.

For example, in the neural network training method provided in at least one embodiment of the present disclosure, the mouth feature information further includes a plurality of mouth key points, and according to the at least one target visual feature vector and the training video, all The decoding generation sub-network is trained, including: using the mouth position information provided by the plurality of mouth key points, combined with the at least one target visual feature vector to train the decoding generation sub-network.

For example, in the neural network training method provided in at least one embodiment of the present disclosure, the neural network further includes a discriminant sub-network, the discriminant sub-network and the decoding-generating sub-network constitute a generative confrontation network, and the During the training process of the decoding generation subnetwork, the generative confrontation network is alternately and iteratively trained to obtain the trained decoding generation subnetwork.

At least one embodiment of the present disclosure provides a video processing device, including: an acquisition unit configured to acquire at least one frame image and an audio clip, wherein each frame image includes at least one object, and each object includes a face area; a preprocessing unit , configured to preprocess the at least one frame image to obtain the mouth feature information of the facial region; the video processing unit is configured to use a video processing network to process the mouth feature information and the audio clip based on the mouth feature information The at least one frame image is processed to obtain a target video, wherein the object in the target video has a synchronous mouth shape change with the audio clip, and wherein the mouth feature information is at least used for reporting to the video processing network A basic outline of the face area and the mouth of each object, and a positional relationship between the face area and the mouth of each object are provided.

At least one embodiment of the present disclosure provides a neural network training device, including: a training data acquisition unit configured to acquire a training video and a training audio segment matching the training video, wherein the training video includes at least one training frame Image, each training frame image includes at least one object, and each object includes a facial area; a preprocessing unit is configured to preprocess the training video to obtain mouth feature information of the facial area; a training unit is configured to Based on the mouth feature information and the training audio clips, the video processing network is trained, wherein the mouth feature information is at least used to provide the video processing network with the facial area and The basic outline of the mouth, and the positional relationship between the facial area and the mouth of each object.

At least one embodiment of the present disclosure provides an electronic device, including: a memory storing computer-executable instructions in a non-transitory manner; a processor configured to run the computer-executable instructions, wherein the computer-executable instructions are executed by the The processor implements the video processing method according to any embodiment of the present disclosure or the training method described in any embodiment of the present disclosure when running.

At least one embodiment of the present disclosure provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are executed by a processor, the computer-executable instructions according to The video processing method described in any embodiment of the present disclosure or the training method described in any embodiment of the present disclosure.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present disclosure more clearly, the accompanying drawings of the embodiments will be briefly introduced below. Obviously, the accompanying drawings in the following description only relate to some embodiments of the present disclosure, rather than limiting the present disclosure .

FIG. 1 is a flowchart of a video processing method provided by an embodiment of the present disclosure;

FIG. 2A is a schematic diagram of a mouth blurring process provided by at least one embodiment of the present disclosure;

Fig. 2B is a schematic diagram of a frame image provided by at least one embodiment of the present disclosure;

Fig. 2C is a blurred mouth picture provided by at least one embodiment of the present disclosure;

FIG. 3 is a flowchart of a video processing method provided by at least one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a characteristic spectrum provided by at least one embodiment of the present disclosure;

FIG. 5 is a flowchart of a neural network training method provided by an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a neural network provided by an embodiment of the present disclosure;

Fig. 7 is a schematic block diagram of a video processing device provided by at least one embodiment of the present disclosure;

Fig. 8 is a schematic block diagram of a training device provided by at least one embodiment of the present disclosure;

FIG. 9 is a schematic block diagram of an electronic device provided by an embodiment of the present disclosure;

Fig. 10 is a schematic diagram of a non-transitory computer-readable storage medium provided by at least one embodiment of the present disclosure;

Fig. 11 is a schematic diagram of a hardware environment provided by at least one embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings of the embodiments of the present disclosure. Apparently, the described embodiments are some of the embodiments of the present disclosure, not all of them. Based on the described embodiments of the present disclosure, all other embodiments obtained by persons of ordinary skill in the art without creative effort fall within the protection scope of the present disclosure.

Unless otherwise defined, the technical terms or scientific terms used in the present disclosure shall have the ordinary meanings understood by those having ordinary skill in the art to which the present disclosure belongs. "First", "second" and similar words used in the present disclosure do not indicate any order, quantity or importance, but are only used to distinguish different components. "Comprising" or "comprising" and similar words mean that the elements or items appearing before the word include the elements or items listed after the word and their equivalents, without excluding other elements or items. Words such as "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up", "Down", "Left", "Right" and so on are only used to indicate the relative positional relationship. When the absolute position of the described object changes, the relative positional relationship may also change accordingly. In order to keep the following description of the embodiments of the present disclosure clear and concise, the present disclosure omits detailed descriptions of some known functions and known components.

Currently, there are usually two implementations of lip sync. One is manual reconstruction, for example, through image processing software, such as photoshop, etc., to modify the mouth shape state of all frame images in the video one by one according to the current audio content, but to achieve such an effect requires a very complicated implementation process and takes a long time And it needs to consume huge manpower and material resources. Another way is to use a lip synchronization model (such as a mouth shape generation model such as Wav2Lip) to reconstruct the mouth shape. The mouth area of the image input to the model is cut out, and then the mouth shape is reconstructed. This method requires a network Create the mouth shape from scratch, because in the process of model training, the model needs to grasp not only the area of the facial contour, but also the contour of the mouth, so the range that the model needs to master is too large, and it is difficult to train and converge.

At least one embodiment of the present disclosure provides a video processing method, including: acquiring at least one frame image and an audio segment, wherein each frame image includes at least one object, and each object includes a face area; preprocessing the at least one frame image , to obtain the mouth feature information of the face area; based on the mouth feature information and the audio clip, use the video processing network to process at least one frame image to obtain the target video, wherein the object in the target video has a mouth shape synchronized with the audio clip Change, the mouth feature information is at least used to provide the video processing network with the basic outline of each object's face area and mouth, and the positional relationship between each object's face area and mouth.

In the video processing method of this embodiment, the mouth feature information is used to assist the video processing network to obtain the target video, and the target video has a synchronous mouth shape change corresponding to the audio clip, which is compared to the traditional way of directly using the network to do it from scratch The method uses mouth feature information to provide the video processing network with the basic outline of each object's facial area and mouth, as well as the positional relationship between each object's facial area and mouth, so that the network can generate more accurate mouths. area, the resulting target video has a higher matching degree of mouth shape and higher accuracy.

The video processing method provided in at least one embodiment of the present disclosure can be applied to the video processing device provided in the embodiment of the present disclosure, and the video processing device can be configured on an electronic device. The electronic device may be a personal computer, a mobile terminal, etc., and the mobile terminal may be a hardware device such as a mobile phone, a tablet computer, or a notebook computer.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments.

FIG. 1 is a flowchart of a video processing method provided by an embodiment of the present disclosure.

As shown in FIG. 1 , the video processing method provided by at least one embodiment of the present disclosure includes steps S10 to S30.

In step S10, at least one frame image and an audio segment are acquired.

In step S20, at least one frame image is preprocessed to obtain mouth feature information of the facial region.

In step S30, based on the mouth feature information and the audio segment, at least one frame image is processed using a video processing network to obtain a target video.

For example, an object in the target video has a mouth change that is synchronized with the audio clip.

For example, the mouth feature information is at least used to provide the video processing network with the basic outline of each object's face area and mouth, and the positional relationship between each object's face area and mouth.

For example, each frame image includes at least one object, and each object includes a face area.

For example, a static image with an object can be obtained as a frame image, and then a target video is generated based on the frame image and an audio clip. In the target video, the object has a mouth shape change synchronized with the audio clip.

For example, it is also possible to obtain a pre-recorded, generated or produced video, the video includes a plurality of video frames, the video frame includes at least one object, the plurality of video frames are used as a plurality of frame images, and then, based on the plurality of frame images and audio clips to generate the target video.

For example, objects may include real people, 2D or 3D animated characters, anthropomorphic animals, bionic people, etc., and these objects all have complete facial regions, for example, facial regions include mouth, nose, eyes, chin and other parts.

For example, the audio segment is the speech content of the object in the target video. For example, in the field of animation dubbing, the audio segment may be the dubbing content of the animation character.

For example, in one embodiment, a video can be pre-recorded. For example, in the video, the teacher will first face the camera and say "Hello, everyone from ××", where ×× indicates the region. At this point, the multiple video frames included in the recorded video are multiple frame images, and the lecturer is the object included in the frame images. When the IP obtained by the location where the video is played is the IP of region A, for example, if region A is Beijing, the audio clip is "Hello, children from Beijing"; for example, if region A is Tianjin, the audio clip is "Kids from Tianjin Hello everyone".

For example, in some other embodiments, in the pre-recorded video, the lecturer will face the camera and say "XXX classmate won the first place, and classmate XXXX won the second place". At this time, the multiple video frames included in the recorded video are multiple frame images, and the lecturer is the object included in the frame images. According to the list results obtained, for example, Zhang San is the first and Li Si is the second, then the audio clip is "Zhang San won the first place, and Li Si won the second place".

For example, the audio segment may be a pre-recorded voice segment by the user, or may be a voice segment converted from a text segment, and the present disclosure does not limit the acquisition method of the audio segment.

For example, the frame image may be an original image obtained by shooting, or may be a processed image obtained by performing image processing on the original image, which is not limited in the present disclosure.

For example, the mouth feature information includes at least one blurred mouth picture, for example, the blurred mouth picture is used to provide the video processing network with the basic outline of the facial area and mouth of each object, and the facial area and mouth of each object location relationship.

For example, step S20 may include: using a mouth blur model to blur the mouth of the object in each frame image to obtain a mouth blur picture corresponding to each frame image.

For example, the blurred mouth picture is obtained by blurring the mouth of the object in the frame image, that is, blurring the mouth area of the object in the frame image, so as to provide the video processing network with the basic contours of the face area and the mouth area, and The positional relationship between the facial area and the mouth of each object retains most of the structure of the picture, which facilitates the network to generate more accurate mouth images, and adds mouth position regression during the processing of the video processing network to enhance the robustness of mouth shape generation. Stickiness.

For example, using the mouth blur model to blur the mouth of the object in each frame image to obtain the mouth blur picture corresponding to each frame image may include: performing the first color space conversion on the frame image to obtain the first converting the image; extracting the mouth area in the first converted image, and performing a first filtering process on the mouth area to obtain a blurred mouth picture corresponding to the frame image.

For example, the first color space is the HSI color space, where H represents the hue (Hue), S represents the color saturation (Saturation or Chroma), and I represents the brightness (Intensity or Brightness). The HSI color space utilizes the H component, the S component and The I component describes the color.

For example, converting the frame image from the RGB color space to the HSI color space means converting the value of each pixel from the original R component (red component), G component (green component) and B component (blue component) to H Component, S component and I component, the specific conversion formula is as follows:

Among them, I represents the I component in the HSI color space, S represents the S component in the HSI color space, H represents the H component in the HSI color space, R represents the R component in the RGB color space, and G represents the G in the RGB color space Component, B represents the B component in the RGB color space, min(*) represents the minimum value function, and θ represents the angle parameter.

After the HSI color space conversion, since the lips are usually red, the H component in the HSI color space is more sensitive to the red area. Therefore, the H component in the mouth area is larger, and the H component in the first converted image can be larger than the preset The threshold area is extracted as the mouth area, and the mean filtering process is performed on the mouth area, and the filtering result is used as the mouth blurred picture corresponding to the frame image.

For example, in order to increase the weight of the red area in the H component, the disclosure modifies the calculation formula of the angle parameter, as shown in the following formula:

That is to say, the (RB) ² component is added to the denominator of the angle component to increase the sensitivity of the R component to the B component, highlight the weight of the red part of the mouth area in the H component, and improve the determined mouth area. accuracy.

For example, if the object in the frame image is an object with a skin area such as a person, then on the basis of the above process, the skin area can be further extracted, a preset area including the mouth can be selected in the skin area, and the preset area can be filtered. , the results of the two filtering processes are combined to obtain a blurred mouth picture with a blurred mouth, and the blur effect is enhanced.

For example, using the mouth blur model to blur the mouth of the object in each frame image to obtain the mouth blur picture corresponding to each frame image may include: performing the first color space conversion on the frame image to obtain the first converting the image; extracting the mouth area in the first converted image, performing the first filtering process on the mouth area to obtain the first intermediate blurred image; performing the second color space conversion on the frame image to obtain the second converted image; extracting the second Converting the skin area in the image, selecting a preset area including the mouth from the skin area; performing a second filtering process on the preset area to obtain a second intermediate blurred image; performing a second intermediate blurred image on the first intermediate blurred image and the second intermediate blurred image Composite processing to obtain a mouth blurred picture corresponding to the frame image.

For example, the second color space is the YCbCr color space. "Y" in the YCbCr color space represents the brightness, that is, the grayscale value of the pixel; while "Cr" and "Cb" represent the chroma, which are used to describe the color and saturation of the image, and are used to specify the pixel Among them, "Cr" reflects the difference between the red part of the RGB input signal and the brightness value of the RGB signal, that is, the red chrominance component of the pixel, and "Cb" reflects the blue color of the RGB input signal. The difference between the color part and the luminance value of the RGB signal, that is, the blue chrominance component of the pixel. RGB signal luminance values are obtained by summing specific parts of the RGB input signals together.

At present, general images are based on the RGB (red, green, blue) color space. In the RGB color space, the skin color of the human body image is greatly affected by the brightness, so it is difficult to separate the skin color point from the non-skin color point. That is to say, in In the face image processed in the RGB color space, the skin color points are discrete points, and there are many non-skin color points embedded in the middle, which brings difficulties for skin color area calibration (such as face calibration, eye calibration, etc.). The YCbCr color space is often used in face detection, because the effect of brightness can be ignored by converting the RGB color space to the YCbCr color space, and because the YCbCr color space is less affected by the brightness, the skin color will produce a good cluster, so that the The three-dimensional color space is mapped to the two-dimensional CbCr plane, so that the skin color points form a certain shape, so as to achieve the purpose of recognizing the human body image according to the skin color. In other words, the YCbCr color space is a color model that separates the brightness separately. Using this color model can make the skin color point not be affected by the brightness of the light and make it difficult to separate.

For example, the frame image is mapped to the YCbCr color space to obtain the mapped image; then, the mapped image is projected on the CbCr plane to obtain a skin color sample image, and the skin color sample image includes skin color sample points corresponding to the pixel points of the frame image ;Finally, traverse the skin color sample image. In the process of traversing the skin color sample image, if the skin color sample point is located in the skin pixel point ellipse boundary and within the ellipse, it is judged that the pixel point in the frame image corresponding to the skin color sample point belongs to the skin area. If the skin color sample point If it is not located in the skin pixel point ellipse boundary and within the ellipse, it is judged that the pixel point in the frame image corresponding to the skin color sample point does not belong to the skin area, thereby extracting the skin area in the second converted image.

For example, in some embodiments, the facial key point detection model can be used to process the frame image to obtain a plurality of facial key points, and determine whether the face of the object in the frame image has eyes on the frame image according to the positions of the facial key points. side, the chin is on the lower side of the frame image, if it is, it means that the face direction of the object is normal, and the mouth area is located at the lower side of the frame image. At this time, the preset coordinate interval in the skin area can be extracted, for example, the object The lower half of the skin area is taken as the preset area including the mouth; if not, it means that the face direction of the subject is abnormal. After rotating the frame image, extract the preset coordinate interval in the skin area to obtain the Preset area for the mouth.

For example, in some embodiments, the preset area including the mouth may be determined according to the skin ratio in the skin area. For example, the chin part only has the mouth, and the skin ratio is relatively high, while the forehead part has non-skin areas such as hair, and the skin ratio is low. Therefore, it can be determined according to the skin ratio whether the face of the object in the frame image has eyes on the upper part of the chin In the following, for example, if the part with a high skin ratio is located in the lower part of the frame image, it means that the face direction of the subject is normal. Then, refer to the extraction process as described above to extract the preset area including the mouth in the skin area, For example, if the part with a high skin ratio is located in the upper part of the frame image, it means that the face direction of the subject is abnormal. After rotating the frame image, refer to the extraction process as described above to extract the predicted area including the mouth in the skin area. set area.

For example, after the preset area is extracted, mean filtering is performed on the preset area, and the filtering result is used as the second intermediate blurred image.

For example, convert the frame image from the RGB color space to the HSI color space to obtain the first converted image, extract the area in the first converted image whose H component is greater than the preset threshold as the mouth area, and perform mean value filtering on the mouth area , using the filtering result as the first intermediate blurred image.

For example, after the first intermediate blurred image and the second intermediate blurred image are obtained, the first intermediate blurred image and the second intermediate blurred image are synthesized, for example, the pixels at corresponding positions are summed to obtain the frame image corresponding to mouth blurred picture. For example, the addition process can use equal weights to prevent the pixel value from being too large. For example, a decimal between 0 and 1 can be set as a weight value (for example, 0.5), and the first intermediate blurred image and the second intermediate blurred image Pixels at corresponding positions in the image are multiplied by weight values and then added together to obtain pixel values of pixels at corresponding positions in the blurred mouth image.

For example, when multiple objects are included in the frame image, the above blurring process is performed on each object, so that the mouths of each object are blurred.

Fig. 2A is a schematic diagram of a mouth blurring process provided by at least one embodiment of the present disclosure. The execution process of mouth blurring processing will be described in detail below with reference to FIG. 2A .

After the frame image is obtained, the first color space conversion is performed on the frame image, that is, the frame image is converted to the HSI color space to obtain the first converted image. The specific process is as described above and will not be repeated here.

Afterwards, the mouth area in the first converted image is extracted, for example, the mouth area is extracted according to the H component. The specific process is as described above, and will not be repeated here.

Afterwards, mean filtering is performed on the mouth region to obtain a first intermediate blurred image.

At the same time, the second color space conversion is performed on the frame image, that is, the frame image is converted to the YCbCr color space to obtain a second converted image.

Afterwards, the skin area in the second transformed image is extracted, the specific process is as described above, and will not be repeated here.

Afterwards, the preset region including the mouth is extracted, the specific process is as described above, and will not be repeated here.

Afterwards, mean filtering is performed on the preset area to obtain a second intermediate blurred image.

Finally, the first intermediate blurred image and the second intermediate blurred image are synthesized to obtain a mouth blurred picture corresponding to the frame image.

Fig. 2B is a schematic diagram of a frame image provided by at least one embodiment of the present disclosure. As shown in FIG. 2B , the frame image includes an object, and the object has a complete face area.

FIG. 2C is a blurred mouth picture provided by at least one embodiment of the present disclosure. The blurred mouth picture is obtained by blurring the mouth of the object in the frame image shown in FIG. 2B . As shown in Figure 2C, in the blurred mouth picture, the lower part of the subject’s face is blurred, but the basic contours and positions of the face and mouth can still be seen. Compared with the traditional way of cutting out the mouth In terms of graph processing, most of the structure of the image is preserved, which makes it easier for the network to generate more accurate mouth images based on relevant information.

It should be noted that in FIG. 2B and FIG. 2C, mosaic processing is performed on the eye part to protect privacy, and the actual processing does not involve this process.

Since the input video processing network is a blurred mouth image that blurs the mouth area, compared with other methods, the blurred mouth image provides the basic contours of the mouth and face, which can help the video processing network generate more accurate mouth images. internal image. However, the video processing network does not know which area is a blurred area and which area is a clear area, and the position of the mouth in each frame image may be different, which makes it difficult to improve the processing effect of the model.

For example, the outline of the object in the blurred area is not obvious, and the gray level of the edge of the outline does not change strongly, resulting in a weak sense of hierarchy, while the gray level of the edge of the outline of the object in the clear area changes significantly, and the sense of hierarchy is strong. The gradient represents the reciprocal direction of a certain pixel point, and the contour edge in the blurred mouth picture can be determined through the change of the gradient value, thereby determining the blurred area in the blurred mouth picture (for example, the blurred area in the blurred mouth picture) and The extent of the unblurred area (such as the unblurred area in a blurred mouth image).

For example, the mouth feature information may also include at least one gradient feature map corresponding to at least one mouth blurred picture, and the gradient feature map is used to provide the video processing network with the blurred area and the non-blurred area in the mouth blurred picture corresponding to the gradient feature map. The scope of the region, so that the video processing network can obtain a more accurate mouth position range, reduce the interference caused by image noise, and facilitate the rapid convergence of the model during the training phase.

For example, step S20 may also include: performing gradient feature extraction on at least one blurred mouth picture to obtain a gradient feature map corresponding to each blurred mouth picture, wherein the mouth feature information also includes at least one blurred mouth picture corresponding to At least one gradient feature map.

For example, for each blurred mouth picture, the gradient feature map corresponding to the blurred mouth picture is composed of gradient values corresponding to each pixel included in the blurred mouth picture.

For example, performing gradient feature extraction on at least one blurred mouth picture to obtain a gradient feature map corresponding to each blurred mouth picture may include: obtaining a grayscale image corresponding to each blurred mouth picture; obtaining the first convolution kernel and The second convolution kernel, wherein the size of the first convolution kernel is smaller than the size of the second convolution kernel, the sum of all elements in the first convolution kernel is 0, and the sum of all elements in the second convolution kernel is 0; Convolve the grayscale image with the first convolution kernel and the second convolution kernel to obtain the gradient image corresponding to each mouth blurred image.

For example, if the blurred mouth picture is a color picture, grayscale processing is performed on the blurred mouth picture to obtain a grayscale image corresponding to the blurred mouth picture.

For example, when calculating the gradient map, the first convolution kernel A1 is used to perform convolution processing with the grayscale image. The sum of all elements in the first convolution kernel A1 is 0, and the size of the first convolution kernel A1 is usually 3 ×3. On this basis, the present disclosure provides the second convolution kernel A2 to participate in the processing of the gradient feature map, the sum of all elements in the second convolution kernel A2 is also 0, and the size of the second convolution kernel A2 is larger than the first convolution The size of the kernel A1, for example, the size of the second convolution kernel A2 is 5×5 or 7×7, so that the second convolution kernel A2 can be used to expand the receptive field of the gradient feature extraction, reduce the influence of noise interference, and reduce the mouth Blur the noise in the picture to reduce the impact of noise on the feature extraction of the subsequent feature extraction sub-network.

For example, the first convolution kernel A1 is shown in the following formula:

For example, the second convolution kernel A2 is shown in the following formula:

For example, the calculation formula of the gradient feature map O is as follows:

Among them, I represents the grayscale image,

Indicates a convolution calculation.

It should be noted that the above-mentioned first convolution kernel A1 and second convolution kernel A2 are only illustrative, for example, as long as the sum of all elements in the first convolution kernel A1 is 0, all elements in the second convolution kernel A2 The sum is 0, as long as the size of the first convolution kernel is smaller than the size of the second convolution kernel, which is not specifically limited in the present disclosure.

For example, the mouth feature information may also include a plurality of mouth key points. For example, multiple mouth keypoints are used to assist in determining the precise position of the mouth during the process of generating the mouth shape of the object in the target video. That is to say, when the mouth feature information includes a plurality of mouth key points, the mouth feature information is also used to provide the position of the mouth of each object to the video processing network.

If only the blurred mouth image is used to assist in generating the target video, the position of the mouth in the target video may not be accurately positioned. Using key points of the mouth can help improve the accuracy of the mouth position. In addition, the key points of the mouth can make the video processing network only focus on the information of the mouth and surrounding muscles, without additional learning of the overall face contour, direction and structure, etc. Therefore, using blurred mouth images combined with key points of the mouth can effectively Improve the accuracy of the object's mouth shape change and position in the final generated target video.

For example, step S20 may also include: processing each frame of image with a facial key point detection model to obtain multiple facial key points; extracting multiple mouth key points related to the mouth among the multiple facial key points.

For example, when the object in the frame image is a person, the face key point detection model can adopt the face key point detection model, and the face key point detection model processes the face in the frame image to obtain the corresponding Multiple facial key points, these facial key points can include multiple key points related to parts such as eyes, nose, and mouth. Afterwards, a plurality of mouth key points related to the mouth are extracted from the plurality of facial key points, and position coordinates of the plurality of mouth key points are obtained. For example, the multiple mouth key points here include multiple mouth key points corresponding to all frame images. For example, each frame image can get 25 mouth key points, and there are 10 frame images in total, so there are 250 mouth key points in total. Mouth keypoints are input into a decoding-generating sub-network as an aid to determine the precise position of the mouth.

For example, a video processing network includes a feature extraction subnetwork and a decoding generation subnetwork.

For example, step S30 may include: performing spectral conversion processing on the audio clip to obtain a feature spectrum; using a feature extraction sub-network to perform feature extraction processing on at least one blurred mouth picture and feature spectrum to obtain M visual feature vectors, wherein M Match the visual feature vector with the audio clip, M is a positive integer and is less than or equal to the number of at least one mouth blurred picture; use the decoding generation sub-network to process the M visual feature vectors to obtain M target frames, among which, M target Frames correspond one-to-one to M time points in the audio clip, and each target frame in the M target frames has a mouth shape corresponding to the corresponding time point in the audio clip; the target video is obtained according to the M target frames.

For example, when performing spectral conversion processing on an audio segment, Mel-scale Frequency Cepstral Coefficients (MFCC for short) of the audio segment may be extracted as a feature spectrum. In the field of speech recognition, MFCC is a set of feature vectors obtained by encoding speech physical information (such as spectral envelope and details). This set of feature vectors can be understood as including m1 n1-dimensional feature vectors. Here, audio clips include m1 audio frames, and each audio frame is converted into an n1-dimensional feature vector, thus, a matrix vector of n1*m1 is obtained as a feature spectrum.

Fig. 3 is a schematic diagram of a characteristic spectrum provided by at least one embodiment of the present disclosure. As shown in Figure 3, the abscissa of the feature spectrum represents time, which means that the audio clip includes 40 audio frames, the ordinate represents the MFCC feature vector, and the ones in the same column represent a set of feature vectors, and different gray levels represent different intensities .

Of course, the audio segment may also be processed in other manners for extracting spectral features to obtain the characteristic spectrum, which is not limited in the present disclosure.

It should be noted that in this disclosure, the matching of video and audio clips means that the mouth shape of the object in the frame image included in the video should be the shape of the content in the audio corresponding to the same time point as the frame image. For example, if the content of the audio clip is "Happy Birthday", the mouth movement in the video should match the mouth movement of the subject when they say "Happy Birthday".

For example, M visual feature vectors are matched with audio clips, which means that M visual feature vectors are synchronized with audio clips. Since the audio feature vector (representing the feature information of the audio clip, see the description below) output by the feature extraction sub-network is consistent with the visual feature vector during the training phase, the feature spectrum and all the blurred mouth pictures corresponding to all frame images After the feature extraction sub-network is input, the output M visual feature vectors and audio feature vectors are basically the same vectors, so as to achieve matching with audio clips.

For example, using the feature extraction sub-network to perform feature extraction processing on at least one blurred mouth picture and feature spectrum to obtain M visual feature vectors, which may include; dividing at least one blurred mouth picture into M groups in sequence, using the feature extraction sub-network The visual feature vectors corresponding to each group are extracted to obtain M visual feature vectors.

For example, the number of frame images is y, and y blurred mouth pictures are obtained after blurring y frame images. Afterwards, the y blurred mouth pictures are displayed in sequence, and each x blurred mouth pictures form a group, and a total of M=y/x groups of mouth blurred pictures are obtained, where x and y are both positive integers. Afterwards, M groups of mouth blurred pictures are sequentially input into the feature extraction sub-network to obtain visual feature vectors corresponding to each group of mouth blurred pictures, thereby obtaining M visual feature vectors.

In the case of a relatively large number of frame images, if the above-mentioned grouping process is not performed, the training difficulty of the video processing network may be increased, and the network is not easy to converge. Considering that the subject's mouth shape will not change rapidly during the speech process, each pronunciation will last for a period of time, so the frame images can be grouped, and the difficulty in the network training process is reduced without affecting the final effect, and it is easier to obtain convergent network.

For example, when the mouth feature information also includes at least one gradient feature map corresponding to at least one mouth blurred picture, the feature extraction sub-network is used to perform feature extraction processing on at least one mouth blurred picture and feature spectrum to obtain M visual features The vector may include: using a feature extraction sub-network to perform feature extraction processing on at least one blurred mouth picture, at least one gradient feature map and feature spectrum, to obtain M visual feature vectors, wherein at least one gradient feature map is used for feature extraction The sub-networks provide the ranges of blurred and non-blurred regions in the corresponding mouth blurred images.

For example, if the blurred mouth picture is a color image, the pixel value of each pixel includes a set of RGB pixel values, so the number of input channels of the feature extraction sub-network is at least 3, corresponding to the R channel, G channel and B channel respectively. For example, add an input channel corresponding to the R channel, G channel and B channel. After obtaining the gradient feature map corresponding to the blurred mouth picture, the gradient feature map is input from the added input channel to the feature extraction subnetwork, that is, the feature extraction subnetwork. The input size of the network is M*N*4, where M represents the width of the blurred mouth image, N represents the height of the blurred mouth image, and 4 represents 4 input channels.

For example, if multiple mouth blurred pictures are grouped sequentially, the gradient feature map is also grouped in the same way, and the mouth blurred picture and its corresponding gradient feature map are input to the feature extraction sub-network for processing.

For example, when the mouth feature information also includes multiple mouth key points, using the decoding generation sub-network to process M visual feature vectors to obtain M target frames may include: using the decoding generation sub-network to process each visual feature vector The vector is processed to generate an intermediate frame with the mouth area; the position and image information of the mouth area of the intermediate frame are corrected by using multiple mouth key points, and the target frame corresponding to the visual feature vector is obtained.

If the mouth feature information only includes the blurred picture of the mouth, the mouth in the generated visual feature vector is still blurred, and the decoding generation sub-network cannot directly understand the structure and general shape of the face like human cognition. The position of the mouth in the picture with the mouth area generated by the decoding generation sub-network may not be accurate. Therefore, multiple key points of the mouth can be used to help improve the accuracy of the mouth position, and the auxiliary network can generate more realistic pictures.

For example, the image information includes image information such as muscles around the mouth area. For example, the mouth key points can be used to locate the position of the mouth in the frame image, so that the mouth key points can be used to assist decoding and generate sub-networks that only focus on image information such as the mouth and its surrounding muscles, without additional learning of the overall face Therefore, the key points of the mouth combined with the blurred picture of the mouth can effectively improve the accuracy of the mouth area generated in the target frame.

For example, the feature extraction sub-network and the decoding generation sub-network may use a convolutional neural network, etc., and the present disclosure does not limit the structure of the feature extraction sub-network and the decoding generation sub-network.

Fig. 4 is a flowchart of a video processing method provided by at least one embodiment of the present disclosure. The following describes in detail the execution process of the video processing method provided by an embodiment of the present disclosure with reference to FIG. 4 .

As shown in FIG. 4 , the audio segment and the frame image are first obtained. For the related content of the audio segment and the frame image, please refer to the description of step S10 , which will not be repeated here.

Blur the mouths of all objects included in each frame image to obtain the mouth blurred picture corresponding to each frame image, and perform gradient feature extraction on each mouth blurred picture to obtain the gradient corresponding to each mouth blurred picture feature map, and each frame image is processed using a facial key point detection model to obtain multiple mouth key points. For the generation process of mouth blurred pictures, gradient feature maps and mouth key points, please refer to the relevant description of step S20 , the repetitions will not be repeated.

Afterwards, the feature spectrum and the blurred mouth pictures and gradient feature maps divided into M groups are input into the feature extraction sub-network to obtain M visual feature vectors.

After that, M visual feature vectors and multiple mouth key points are input into the decoding generation sub-network for processing, and M target frames are obtained, and each target frame in the M target frames has a mouth corresponding to the corresponding time point in the audio clip. For example, if the audio clip is "Happy Birthday", then the mouth shapes of the objects in the M target frames follow the audio clip and are sequentially displayed as the mouth shapes of "Happy Birthday".

Afterwards, the M target frames are arranged sequentially according to the order of display time points to obtain the target video.

At least one embodiment of the present disclosure also provides a neural network training method. FIG. 5 is a flow chart of a neural network training method provided by an embodiment of the present disclosure.

As shown in FIG. 5 , the neural network training method provided by at least one embodiment of the present disclosure includes steps S40 to S60. For example, neural networks include video processing networks.

Step S40, acquiring a training video and a training audio segment matched with the training video.

For example, the training video includes at least one training frame image, each training frame image includes at least one object, and each object includes a face area.

Step S50, preprocessing the training video to obtain mouth feature information corresponding to the training video.

Step S60, based on the mouth feature information and the training audio clips, the video processing network is trained.

For example, the training video may be a video with mouth shape changes, and the mouth shape changes in the training video are the content of the training audio clip. For example, the training video can be a speaker saying "Happy Birthday" to the camera, the object in the training frame image is the speaker, the training frame image includes the speaker's facial area, and the training audio clip is "Happy Birthday".

For example, for specific concepts about training frame images, objects, and facial regions, reference may be made to the descriptions about frame images, objects, and facial regions in the aforementioned step S10, and repeated descriptions will not be repeated here.

For example, the mouth feature information may include mouth blurred pictures corresponding to each training frame image. For the process of obtaining the mouth blurred picture, please refer to the related description of step S20, which will not be repeated here.

For example, the mouth feature information may include a gradient feature map corresponding to each blurred mouth picture. For the process of obtaining the gradient feature map, please refer to the relevant description of step S20, which will not be repeated here.

For example, the mouth feature information may also include a plurality of key points of the mouth. For the process of obtaining the key points of the mouth, reference may be made to the relevant description of step S20 , which will not be repeated here.

As mentioned earlier, the mouth feature information is used to provide the approximate outline of the face and mouth, as well as the positional relationship between the face and the mouth. Since the mouth blurred picture still retains the overall outline of the picture, the network does not need to do it again Created from scratch, it facilitates rapid network convergence, speeds up the network training process, and reduces training difficulty and time overhead.

For example, as mentioned above, the gradient feature map is used to provide the range of the blurred area and the non-blurred area in the blurred mouth picture corresponding to the gradient feature map, and provides more limited parameters for the video processing network, which is convenient for the feature extraction sub-network to determine accurate The position of the mouth can reduce image noise interference, facilitate rapid network convergence, speed up the network training process, and reduce training difficulty and time overhead.

In addition, as mentioned above, the mouth key points are used to provide mouth position information, so that the network mainly considers image information such as the mouth and its surrounding muscles during the training process, and does not need to learn the overall facial contour, direction and structure, etc. Information, effectively improve the training efficiency, and can get a video processing network with higher accuracy.

For example, a video processing network includes a feature extraction subnetwork and a decoding generation subnetwork. For example, when training a video processing network, the feature extraction sub-network is first trained, and after the feature extraction sub-network is trained, the decoding generation sub-network is combined with the trained feature extraction sub-network, that is, the decoding generation sub-network During the training process of the network, the weight parameters in the feature extraction sub-network do not change, only the parameters of the decoding generation sub-network are updated.

For example, step S60 may include: performing spectral conversion processing on the training audio segment to obtain the training feature spectrum; using the training feature spectrum and mouth feature information to train the feature extraction sub-network to be trained to obtain the trained feature extraction sub-network .

For example, the Mel cepstral coefficients of the training audio clips can be extracted as the training feature spectrum.

For example, using the training feature spectrum and mouth feature information to train the feature extraction sub-network to be trained to obtain a trained feature extraction sub-network may include: using the feature extraction sub-network to be trained to train the feature spectrum and at least one The mouth blurred picture is processed to obtain the training visual feature vector and the training audio segment feature vector; according to the training visual feature vector and the training audio feature vector, the loss value of the feature extraction sub-network is calculated through the loss function corresponding to the feature extraction sub-network; based on the loss modify the parameters of the feature extraction subnetwork to be trained; and when the loss value corresponding to the feature extraction subnetwork to be trained does not meet the predetermined accuracy rate condition, continue to input the training feature spectrum and at least one blurred mouth picture to repeat the above training process.

For example, in the process of training the feature sub-network, you can also input the gradient feature map corresponding to each blurred mouth picture. For the specific input process, refer to the relevant introduction in the video processing method, and will not repeat it here.

The training goal of the feature extraction sub-network is to match the output visual feature vector with the audio feature vector. For the concept of matching, refer to the content mentioned above. For example, the i-th feature element in the visual feature vector and the i-th feature element in the audio feature vector should match, which is reflected in the feature value that the feature value of the visual feature vector and the audio feature vector are very close or identical. Therefore, during training, the loss value is calculated by using the training visual feature vector and the training audio feature, and the parameters of the feature extraction sub-network are corrected based on the loss value, so that the visual feature vector and audio feature vector output by the trained feature extraction sub-network unanimous.

After the feature extraction sub-network training is completed, step S60 may also include: using the trained feature extraction sub-network to process the training feature spectrum and at least one blurred mouth picture to obtain at least one target visual feature vector; Feature vectors and training videos are used to train the decoder-generation sub-network.

For example, according to at least one target visual feature vector and the training video, the decoding generation sub-network is trained, which may include: using the mouth position information provided by a plurality of mouth key points, combined with at least one target visual feature vector to decode the generation sub-network to train. For example, in this process, the key points of the mouth are used to assist training, so that the position of the mouth shape is more accurate. For the specific technical effect of the key points of the mouth, refer to the content mentioned above, and will not repeat them here.

For example, the neural network also includes a discriminative sub-network. The discriminative sub-network and the decoding-generating sub-network constitute a Generative Adversarial Networks (GAN for short). Iterative training to obtain a trained decoder generation subnetwork.

For example, the decoding generation sub-network acts as the generator (Generator) in the generative confrontation network, generating images to "fool" the discriminator, and the discriminant sub-network acts as the discriminator (Discriminator) in the generative confrontation network, judging the decoding generation Authenticity of images generated by sub-networks. For example, in the training process, first let the generator continuously generate image data and be judged by the discriminator. In this process, the parameters of the discriminator are not adjusted, and only the generator is trained and parameter adjusted until the discriminator cannot judge the authenticity of the image generated by the generator. After that, fix the parameters of the generator and continue to train the discriminator until the discriminator can accurately judge the authenticity of the image generated by the generator; after that, continue to repeat the above process until the generation and discrimination capabilities of the generator and the discriminator are getting better and better. Strong, so as to get a generator with the best generating effect.

Fig. 6 is a schematic structural diagram of a neural network provided by an embodiment of the present disclosure.

As shown in FIG. 6 , the neural network 100 provided by at least one embodiment of the present disclosure includes a video processing network 101 and a discrimination subnetwork 102, the video processing network 101 includes a feature extraction subnetwork 1011 and a decoding generation subnetwork 1012, and the decoding generation subnetwork 1012 The network 1012 and the discriminative sub-network 102 constitute a generative adversarial network.

The training process of the video processing network 101 will be described in detail below with reference to FIG. 6 .

First, the feature extraction sub-network 1011 is trained. For example, refer to the description of step S50 to obtain a plurality of blurred mouth pictures corresponding to a plurality of training frame images, and a plurality of gradient feature maps corresponding to a plurality of blurred mouth pictures respectively, and perform spectral conversion processing on the training audio clips to obtain training The feature spectrum is to input multiple blurred mouth pictures, multiple gradient feature maps and feature spectrum into the feature extraction sub-network 1011 for processing to obtain visual feature vectors and audio feature vectors. After that, calculate the loss value according to the visual feature vector and the audio feature vector, adjust the parameters of the feature extraction sub-network according to the loss value, until the loss value corresponding to the feature extraction sub-network meets the predetermined accuracy rate condition, and obtain the trained feature extraction sub-network 1011 .

At this time, the visual feature vector output by the trained feature extraction sub-network 1011 is consistent with the audio feature vector.

Afterwards, the decoding generation sub-network 1012 is trained in combination with the trained feature extraction sub-network 1011 .

For example, after inputting multiple blurred mouth pictures into the feature extraction subnetwork 1011, multiple target visual feature vectors are obtained. At this time, the target visual feature vectors are consistent with the audio feature vectors output by the feature extraction subnetwork 1011.

Input a plurality of target visual feature vectors and a plurality of mouth key points into the decoding generation sub-network 1012 for processing to obtain an output frame, in which the mouth shape of the object changes, but the change may be different from the training frame corresponding to the same display time point There is a difference in the shape of the mouth in the image.

The output frame and the training frame image are input into the discrimination sub-network 102, and the discrimination sub-network 102 uses the mouth shape in the training frame image as a standard, referring to the process as previously described to alternately train the decoding generation sub-network 1012 and the discrimination sub-network 102, and, based on The binary classification cross-entropy loss function calculates the loss value, and alternately modifies the parameters of the discrimination subnetwork 102 and the decoding generation subnetwork 1012 until a trained decoding generation subnetwork 1012 is obtained.

In the above embodiment, since the blurred mouth picture still retains the overall outline of the picture, the network does not need to be created from scratch, which facilitates the rapid convergence of the network, speeds up the training process of the feature extraction sub-network, and reduces the training difficulty and time overhead. The gradient feature map is used to provide the range of the blurred area and the non-blurred area in the mouth blurred picture, so that the network can quickly locate the mouth area and facilitate the network to quickly converge. In addition, the mouth key points are used to provide mouth position information, so that the decoding generation sub-network mainly considers image information such as the mouth and its surrounding muscles during the training process, and does not need to learn information such as the overall facial contour, direction, and structure. Effectively improve the training efficiency and obtain a video processing network with higher accuracy.

At least one embodiment of the present disclosure further provides a video processing device, and FIG. 7 is a schematic block diagram of a video processing device provided by at least one embodiment of the present disclosure.

As shown in FIG. 7 , the video processing apparatus 200 may include an acquisition unit 201 , a preprocessing unit 202 and a video processing unit 203 . These components are interconnected by a bus system and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the video processing device 200 shown in FIG. 7 are exemplary rather than limiting, and the video processing device 200 may also have other components and structures as required.

For example, these modules may be implemented by hardware (such as circuit) modules, software modules, or any combination of the two, and the following embodiments are the same as this, and will not be repeated here. For example, a central processing unit (CPU), a video processing unit (GPU), a tensor processing unit (TPU), a field programmable logic gate array (FPGA), or other forms of processors with data processing capabilities and/or instruction execution capabilities The processing units and corresponding computer instructions implement these units.

For example, the obtaining unit 201 is configured to obtain at least one frame image and an audio segment, for example, each frame image includes at least one object, and each object includes a face area.

For example, the acquiring unit 201 may include a memory storing frame images and audio clips. For example, the acquisition unit 201 may include one or more cameras to shoot or record a video including multiple frame images or a still frame image of an object. In addition, the acquisition unit 201 may also include a recording device to obtain audio clips. For example, the acquisition unit 201 may be hardware, software, firmware and any feasible combination thereof.

For example, the preprocessing unit 202 is configured to preprocess at least one frame image to obtain mouth feature information of the face area.

For example, video processing unit 203 may include video processing network 204 . The video processing unit 203 uses the video processing network 204 to process at least one frame image based on the mouth feature information and the audio clip to obtain a target video, wherein the object in the target video and the audio clip have synchronous mouth shape changes.

The video processing network 204 includes a feature extraction subnetwork and a decoding generation subnetwork. It should be noted that the video processing network 204 in the video processing unit 203 has the same structure and function as the video processing network 204 in the embodiment of the above-mentioned video processing method. I won't repeat them here.

It should be noted that the acquiring unit 201 can be used to realize step S10 shown in FIG. 1 , the preprocessing unit 202 can be used to realize step S20 shown in FIG. 1 , and the video processing unit 203 can be used to realize the steps shown in FIG. 1 S30. Therefore, for the specific description of the functions that can be realized by the acquisition unit 201 , the preprocessing unit 202 and the video processing unit 203 , reference may be made to the relevant descriptions of steps S10 to S30 in the embodiment of the above video processing method, and repeated descriptions will not be repeated. In addition, the video processing apparatus 200 can achieve technical effects similar to those of the aforementioned video processing method, which will not be repeated here.

At least one embodiment of the present disclosure further provides a neural network training device, and FIG. 8 is a schematic block diagram of a training device provided by at least one embodiment of the present disclosure.

As shown in FIG. 8 , the training device 300 may include a training data acquisition unit 301 , a preprocessing unit 302 and a training unit 303 . These components are interconnected by a bus system and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the training device 300 shown in FIG. 8 are exemplary rather than limiting, and the training device 300 may also have other components and structures as required.

For example, the training data obtaining unit 301 is configured to obtain a training video and a training audio segment matched with the training video. For example, the training video includes at least one training frame image, each training frame image includes at least one object, and each object includes a face area.

For example, the preprocessing unit 302 is configured to preprocess the training video to obtain mouth feature information of the facial region.

For example, the training unit 303 is configured to train the video processing network based on mouth feature information and training audio clips.

For example, the training unit 303 includes a neural network 304 and a loss function (not shown), the neural network 304 includes a video processing network, and the training unit 303 is used to train the neural network 304 to be trained to obtain a trained video processing network.

For example, the video processing network includes a feature extraction subnetwork and a decoding and generating subnetwork, and the neural network 304 also includes a discriminative subnetwork, which constitutes a generative confrontation network. It should be noted that the structure and function of the neural network 304 in the training unit 303 are the same as those of the neural network 100 in the above embodiment of the neural network training method, and will not be repeated here.

It should be noted that the training data acquisition unit 301 can be used to realize the step S40 shown in FIG. 5, the preprocessing unit 302 can be used to realize the step S50 shown in FIG. 5, and the training unit 303 can be used to realize the Step S60. Therefore, for specific descriptions of the functions that can be realized by the training data acquisition unit 301, the preprocessing unit 302, and the training unit 303, reference may be made to the relevant descriptions of steps S40 to S60 in the embodiment of the video processing method above, and repeated descriptions will not be repeated. In addition, the training device 300 can achieve technical effects similar to those of the aforementioned training method, which will not be repeated here.

Fig. 9 is a schematic block diagram of an electronic device provided by an embodiment of the present disclosure. As shown in FIG. 9 , the electronic device 400 is, for example, suitable for implementing the video processing method or the training method provided by the embodiments of the present disclosure. It should be noted that the components of the electronic device 400 shown in FIG. 9 are only exemplary rather than limiting, and the electronic device 400 may also have other components according to actual application requirements.

As shown in FIG. 9, an electronic device 400 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 401, which may perform various appropriate actions and processes according to non-transitory computer-readable instructions stored in a memory, to achieve various functions.

For example, when the computer-readable instructions are executed by the processing device 401, one or more steps in the video processing method according to any of the foregoing embodiments may be executed. It should be noted that, for a detailed description of the processing process of the video processing method, reference may be made to relevant descriptions in the embodiments of the above video processing method, and repeated descriptions will not be repeated.

For example, when the computer-readable instructions are executed by the processing device 401, one or more steps in the neural network training method according to any of the above-mentioned embodiments may be executed. It should be noted that, for a detailed description of the processing process of the training method, reference may be made to relevant descriptions in the embodiments of the training method above, and repeated descriptions will not be repeated.

For example, the memory may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) 403 and/or cache memory (cache), etc., for example, computer readable instructions may be loaded from storage device 408 into random access memory (RAM) 403 to run computer readable instructions. Non-volatile memory may include, for example, read-only memory (ROM) 402, hard disks, erasable programmable read-only memory (EPROM), compact disk read-only memory (CD-ROM), USB memory, flash memory, and the like. Various application programs and various data, such as style images, and various data used and/or generated by application programs, can also be stored in the computer-readable storage medium.

For example, the processing device 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404. An input/output (I/O) interface 405 is also connected to bus 404 .

Typically, the following devices can be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 407 such as a computer; a storage device 408 including, for example, a magnetic tape, a hard disk, a flash memory, etc.; and a communication device 409. The communication means 409 may allow the electronic device 400 to perform wireless or wired communication with other electronic devices to exchange data. Although FIG. 9 shows electronic device 400 having various means, it should be understood that it is not required to implement or have all of the means shown, and electronic device 400 may alternatively implement or have more or fewer means. For example, the processor 401 can control other components in the electronic device 400 to perform desired functions. The processor 401 may be a device with data processing capabilities and/or program execution capabilities, such as a central processing unit (CPU), a tensor processing unit (TPU), or a graphics processing unit (GPU). The central processing unit (CPU) may be an X86 or ARM architecture or the like. The GPU can be integrated directly on the motherboard alone, or built into the north bridge chip of the motherboard. A GPU can also be built into a central processing unit (CPU).

Fig. 10 is a schematic diagram of a non-transitory computer-readable storage medium provided by at least one embodiment of the present disclosure. For example, as shown in FIG. 10 , the storage medium 500 may be a non-transitory computer-readable storage medium, and one or more computer-readable instructions 501 may be stored non-transitory on the storage medium 500 . For example, when the computer-readable instructions 501 are executed by the processor, one or more steps in the above-mentioned video processing method or training method may be executed.

For example, the storage medium 500 may be applied to the above-mentioned electronic device, for example, the storage medium 500 may include a memory in the electronic device.

For example, the storage medium may include a memory card of a smartphone, a storage unit of a tablet computer, a hard disk of a personal computer, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM), Portable compact disc read-only memory (CD-ROM), flash memory, or any combination of the above-mentioned storage media may also be other applicable storage media.

For example, for the description of the storage medium 500, reference may be made to the description of the memory in the embodiments of the electronic device, and repeated descriptions will not be repeated.

Fig. 11 is a schematic diagram of a hardware environment provided by at least one embodiment of the present disclosure. The electronic device provided by the present disclosure can be applied in the Internet system.

The functions of the image processing apparatus and/or electronic equipment involved in the present disclosure can be realized by using the computer system provided in FIG. 11 . Such computer systems can include personal computers, laptops, tablets, mobile phones, personal digital assistants, smart glasses, smart watches, smart rings, smart helmets, and any smart portable or wearable device. The specific system in this embodiment illustrates a hardware platform including a user interface using functional block diagrams. Such computer equipment may be a general purpose computer equipment or a special purpose computer equipment. Both computer devices can be used to realize the image processing device and/or electronic device in this embodiment. The computer system may include any components that implement the presently described information needed to achieve image processing. For example, a computer system can be realized by a computer device through its hardware devices, software programs, firmware, and combinations thereof. For the sake of convenience, only one computer device is drawn in Fig. 11, but the relevant computer functions for realizing the information required for image processing described in this embodiment can be implemented by a group of similar platforms in a distributed manner, Distribute the processing load of a computer system.

As shown in Figure 11, the computer system can include a communication port 250, which is connected to a network for data communication, for example, the computer system can send and receive information and data through the communication port 250, that is, the communication port 250 can realize the communication between the computer system and the computer system. Other electronic devices communicate wirelessly or by wire to exchange data. The computer system may also include a processor group 220 (ie, the processor described above) for executing program instructions. The processor group 220 may consist of at least one processor (eg, CPU). The computer system may include an internal communication bus 210 . A computer system may include different forms of program storage units and data storage units (i.e., memory or storage media described above), such as hard disk 270, read-only memory (ROM) 230, random access memory (RAM) 240, which can be used to store Various data files used by the computer for processing and/or communicating, and possibly program instructions executed by the processor group 220 . The computer system may also include an input/output component 260 for enabling input/output data flow between the computer system and other components (eg, user interface 280, etc.).

Typically, the following devices may be connected to the input/output assembly 260: input devices such as a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; displays (e.g., LCD, OLED displays, etc.), speakers, an output device such as a vibrator; a storage device including, for example, a magnetic tape, a hard disk, etc.; and a communication interface.

While FIG. 11 shows a computer system with various devices, it should be understood that the computer system is not required to have all of the devices shown and, instead, the computer system may have more or fewer devices.

For this disclosure, the following points need to be explained:

(1) The drawings of the embodiments of the present disclosure only relate to the structures involved in the embodiments of the present disclosure, and other structures may refer to general designs.

(2) For clarity, in the drawings used to describe the embodiments of the present invention, the thickness and size of layers or structures are exaggerated. It will be understood that when an element such as a layer, film, region, or substrate is referred to as being "on" or "under" another element, it can be "directly on" or "under" the other element, Or intervening elements may be present.

(3) In the case of no conflict, the embodiments of the present disclosure and the features in the embodiments can be combined with each other to obtain new embodiments.

The above description is only a specific implementation manner of the present disclosure, but the protection scope of the present disclosure is not limited thereto, and the protection scope of the present disclosure should be based on the protection scope of the claims.

Claims

A video processing method, comprising:

acquiring at least one frame image and an audio clip, wherein each frame image includes at least one object, and each object includes a facial region;

Preprocessing the at least one frame image to obtain mouth feature information of the facial region;

Based on the mouth feature information and the audio clip, using a video processing network to process the at least one frame image to obtain a target video,

Wherein, the object in the target video has a mouth shape change synchronously with the audio clip, and the mouth feature information is at least used to provide the video processing network with basic information about the facial area and mouth of each object. outline, and the positional relationship between the facial area and the mouth of each object.
The video processing method according to claim 1, wherein said at least one frame image is preprocessed to obtain mouth feature information of said facial region, comprising:

Blurring the mouth of the object in each frame image by using a mouth blur model to obtain a mouth blur picture corresponding to each frame image,

Wherein, the mouth feature information includes at least one blurred mouth picture respectively corresponding to the at least one frame image.
The video processing method according to claim 2, wherein the mouth of the object in each frame image is blurred using a mouth blur model to obtain a mouth blur picture corresponding to each frame image, including :

performing a first color space conversion on the frame image to obtain a first converted image;

Extracting the mouth area in the first converted image, performing a first filtering process on the mouth area, and obtaining a blurred mouth picture corresponding to the frame image.
The video processing method according to claim 2, wherein the mouth of the object in each frame image is blurred using a mouth blur model to obtain a mouth blur picture corresponding to each frame image, including :

performing a first color space conversion on the frame image to obtain a first converted image;

extracting the mouth area in the first converted image, and performing a first filtering process on the mouth area to obtain a first intermediate blurred image;

performing a second color space conversion on the frame image to obtain a second converted image;

extracting a skin area in the second transformed image, and selecting a preset area including a mouth from the skin area;

performing a second filtering process on the preset area to obtain a second intermediate blurred image;

Combining the first intermediate blurred image and the second intermediate blurred image to obtain a blurred mouth picture corresponding to the frame image.
The video processing method according to claim 4, wherein the first color space is an HSI color space, and the second color space is a YCbCr color space.
The video processing method according to any one of claims 2-5, wherein, performing preprocessing on the at least one frame image to obtain mouth feature information of the facial region, further comprising:

Gradient feature extraction is performed on the at least one blurred mouth picture to obtain a gradient feature map corresponding to each blurred mouth picture, wherein the mouth feature information further includes at least one corresponding to the at least one blurred mouth picture. Gradient feature map.
The video processing method according to claim 6, wherein performing gradient feature extraction on the at least one blurred mouth picture to obtain a gradient feature map corresponding to each blurred mouth picture, comprising:

Obtain the grayscale image corresponding to each blurred mouth picture;

Obtain the first convolution kernel and the second convolution kernel, wherein the size of the first convolution kernel is smaller than the size of the second convolution kernel, and the sum of all elements in the first convolution kernel is 0 , the sum of all elements in the second convolution kernel is 0;

Convolving the grayscale image with the first convolution kernel and the second convolution kernel to obtain a gradient image corresponding to each blurred mouth picture.
The video processing method according to any one of claims 2-7, wherein, performing preprocessing on the at least one frame image to obtain mouth feature information of the facial region, further comprising:

Process each frame image by using a facial key point detection model to obtain a plurality of facial key points;

Extracting a plurality of mouth key points related to the mouth among the plurality of facial key points, wherein the mouth feature information further includes the plurality of mouth key points.
The video processing method according to any one of claims 2-8, wherein the video processing network includes a feature extraction subnetwork and a decoding generation subnetwork,

Based on the mouth feature information and the audio clip, using the video processing network to process the at least one frame image, including:

performing spectral conversion processing on the audio segment to obtain a characteristic spectrum;

Using the feature extraction sub-network to perform feature extraction processing on the at least one blurred mouth picture and the feature spectrum to obtain M visual feature vectors, wherein the M visual feature vectors match the audio clips, M is a positive integer and is less than or equal to the number of the at least one mouth blurred picture;

Process the M visual feature vectors by using the decoding generation subnetwork to obtain M target frames, wherein the M target frames correspond to the M time points in the audio clip one-to-one, and the M Each of the target frames has a mouth shape corresponding to the corresponding time point in the audio clip;

The target video is obtained according to the M target frames.
The video processing method according to claim 9, wherein, using the feature extraction sub-network to perform feature extraction processing on the at least one blurred mouth picture and the feature spectrum to obtain M visual feature vectors, including:

Divide the at least one blurred mouth picture into M groups in sequence, and use the feature extraction sub-network to extract the visual feature vectors corresponding to each group, so as to obtain the M visual feature vectors.
The video processing method according to claim 9 or 10, wherein the mouth feature information further includes at least one gradient feature map respectively corresponding to the at least one blurred mouth picture,

Using the feature extraction sub-network to perform feature extraction processing on the at least one blurred mouth picture and the feature spectrum to obtain M visual feature vectors, including:

Using the feature extraction sub-network to perform feature extraction processing on the at least one blurred mouth picture, the at least one gradient feature map, and the feature spectrum to obtain M visual feature vectors, wherein the at least one gradient feature map It is used to provide the feature extraction sub-network with ranges of blurred areas and non-blurred areas in the corresponding mouth blurred picture.
The video processing method according to any one of claims 9-11, wherein the mouth feature information further includes a plurality of mouth key points,

Using the decoding generation sub-network to process the M visual feature vectors to obtain M target frames, including:

Each visual feature vector is processed by the decoding generation sub-network to generate an intermediate frame with a mouth area;

The position and image information of the mouth region of the intermediate frame are corrected by using the plurality of mouth key points to obtain the target frame corresponding to the visual feature vector.
A training method for a neural network, wherein the neural network includes a video processing network, and the training method includes:

Obtaining a training video and a training audio segment matched with the training video, wherein the training video includes at least one training frame image, each training frame image includes at least one object, and each object includes a facial region;

Preprocessing the training video to obtain mouth feature information corresponding to the training video;

training the video processing network based on the mouth feature information and the training audio clips,

Wherein, the mouth feature information is at least used to provide the video processing network with the basic outline of the facial area and mouth of each object, and the facial area and mouth of each object. Positional relationship.
The training method according to claim 13, wherein the video processing network comprises a feature extraction sub-network,

Based on the mouth feature information and the training audio clips, the video processing network is trained, including:

Perform spectral conversion processing on the training audio segment to obtain the training feature spectrum;

Using the training feature spectrum and the mouth feature information, the feature extraction sub-network to be trained is trained to obtain the trained feature extraction sub-network.
The training method according to claim 14, wherein the mouth feature information includes at least one blurred mouth picture,

Using the training feature spectrum and the mouth feature information to train the feature extraction sub-network to be trained to obtain the trained feature extraction sub-network, including:

Using the feature extraction sub-network to be trained to process the training feature spectrum and the at least one blurred mouth picture to obtain a training visual feature vector and a training audio feature vector;

According to the training visual feature vector and the training audio feature vector, calculate the loss value of the feature extraction sub-network through the loss function corresponding to the feature extraction sub-network;

modifying the parameters of the feature extraction sub-network to be trained based on the loss value; and

When the loss value corresponding to the feature extraction sub-network to be trained does not meet the predetermined accuracy condition, continue to input the training feature spectrum and the at least one blurred mouth picture to repeat the above training process.
The training method according to claim 15, wherein the mouth feature information includes at least one blurred mouth picture,

The video processing network also includes a decoding generation subnetwork,

Based on the mouth feature information and the training audio clips, the video processing network is trained, further comprising:

Using the trained feature extraction sub-network to process the training feature spectrum and the at least one blurred mouth picture to obtain at least one target visual feature vector;

The decoding generation sub-network is trained according to the at least one target visual feature vector and the training video.
The training method according to claim 16, wherein the mouth feature information further includes a plurality of mouth key points,

According to the at least one target visual feature vector and the training video, the decoding generation sub-network is trained, including:

The decoding generation sub-network is trained by using the mouth position information provided by the plurality of mouth key points in combination with the at least one target visual feature vector.
The training method according to claim 16 or 17, wherein the neural network further comprises a discriminant sub-network, and the discriminant sub-network and the decoding-generating sub-network constitute a generative confrontation network,

In the process of training the decoding generation sub-network, the generative confrontation network is alternately and iteratively trained to obtain the trained decoding generation sub-network.
A video processing device, comprising:

An acquisition unit configured to acquire at least one frame image and an audio segment, wherein each frame image includes at least one object, and each object includes a face area;

A preprocessing unit configured to preprocess the at least one frame image to obtain mouth feature information of the facial region;

The video processing unit is configured to use a video processing network to process the at least one frame image based on the mouth feature information and the audio clip to obtain a target video, wherein the object in the target video is related to the audio clip The segment has synchronous mouth shape changes, and the mouth feature information is at least used to provide the video processing network with the facial area of each object and the basic outline of the mouth, and the facial area of each object and the positional relationship of the mouth.
A training device for a neural network, the neural network comprising a video processing network,

The training device includes:

A training data acquisition unit configured to acquire a training video and a training audio segment matched with the training video, wherein the training video includes at least one training frame image, each training frame image includes at least one object, and each object includes a face area;

A preprocessing unit configured to preprocess the training video to obtain mouth feature information of the facial region;

a training unit configured to train the video processing network based on the mouth feature information and the training audio clip,

Wherein, the mouth feature information is at least used to provide the video processing network with the basic outline of the facial area and mouth of each object, and the facial area and mouth of each object. Positional relationship.
An electronic device comprising:

memory non-transitoryly storing computer-executable instructions;

a processor configured to execute said computer-executable instructions,

Wherein, when the computer-executable instructions are run by the processor, the video processing method according to any one of claims 1-12 or the neural network training method according to any one of claims 13-18 are realized.
A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions,

When the computer-executable instructions are executed by the processor, the video processing method according to any one of claims 1-12 or the neural network training method according to any one of claims 13-18 are realized.