WO2023077742A1 - Video processing method and apparatus, and neural network training method and apparatus - Google Patents

Video processing method and apparatus, and neural network training method and apparatus Download PDF

Info

Publication number
WO2023077742A1
WO2023077742A1 PCT/CN2022/088965 CN2022088965W WO2023077742A1 WO 2023077742 A1 WO2023077742 A1 WO 2023077742A1 CN 2022088965 W CN2022088965 W CN 2022088965W WO 2023077742 A1 WO2023077742 A1 WO 2023077742A1
Authority
WO
WIPO (PCT)
Prior art keywords
mouth
network
training
feature
blurred
Prior art date
Application number
PCT/CN2022/088965
Other languages
French (fr)
Chinese (zh)
Inventor
陈奕名
王麒铭
栾鹏龙
兰永亮
贾兆柱
Original Assignee
新东方教育科技集团有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 新东方教育科技集团有限公司 filed Critical 新东方教育科技集团有限公司
Publication of WO2023077742A1 publication Critical patent/WO2023077742A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • Embodiments of the present disclosure relate to a video processing method, a video processing device, a neural network training method, a neural network training device, electronic equipment, and a non-transitory computer-readable storage medium.
  • Lip synchronization has a wide range of application scenarios in scenarios such as game/anime character dubbing, digital avatars, and lip-sync voice translation.
  • a user can provide a piece of audio and a given character image or animated image, and a speech video of the corresponding character can be generated.
  • the mouth shape of the corresponding character in the speech video changes correspondingly with the change of the audio, and the character's mouth shape completely matches the audio.
  • At least one embodiment of the present disclosure provides a video processing method, including: acquiring at least one frame image and an audio clip, wherein each frame image includes at least one object, and each object includes a face area; performing the processing on the at least one frame image Preprocessing to obtain mouth feature information of the facial region; based on the mouth feature information and the audio clip, using a video processing network to process the at least one frame image to obtain a target video, wherein the target Objects in the video have mouth shape changes synchronized with the audio clip, and the mouth feature information is at least used to provide the video processing network with the basic outline of each object's face area and mouth, and the The positional relationship between the facial area and the mouth of each object.
  • performing preprocessing on the at least one frame image to obtain mouth feature information of the facial region includes: using a mouth blur model to process each The mouth of the object in the frame image is blurred to obtain a mouth blurred picture corresponding to each frame image, wherein the mouth feature information includes at least one mouth blurred picture corresponding to the at least one frame image respectively .
  • the mouth of the object in each frame image is blurred by using the mouth blur model to obtain the mouth blur corresponding to each frame image
  • the picture includes: performing a first color space conversion on the frame image to obtain a first converted image; extracting a mouth area in the first converted image, and performing a first filtering process on the mouth area to obtain the The blurred picture of the mouth corresponding to the frame image.
  • the mouth of the object in each frame image is blurred by using the mouth blur model to obtain the mouth blur corresponding to each frame image
  • the picture includes: performing a first color space conversion on the frame image to obtain a first converted image; extracting a mouth area in the first converted image, and performing a first filtering process on the mouth area to obtain a first converted image.
  • the middle blurred image; the second color space conversion is performed on the frame image to obtain a second conversion image; the skin area in the second conversion image is extracted, and a preset area including the mouth is selected from the skin area; performing a second filtering process on the preset area to obtain a second intermediate blurred image; performing synthesis processing on the first intermediate blurred image and the second intermediate blurred image to obtain a mouth blurred picture corresponding to the frame image.
  • the first color space is an HSI color space
  • the second color space is a YCbCr color space.
  • performing preprocessing on the at least one frame image to obtain the mouth feature information of the facial region further includes: blurring the mouth of the at least one picture
  • Gradient feature extraction is performed to obtain a gradient feature map corresponding to each mouth blur picture, wherein the mouth feature information further includes at least one gradient feature map corresponding to the at least one mouth blur picture.
  • performing gradient feature extraction on the at least one blurred mouth picture to obtain a gradient feature map corresponding to each blurred mouth picture includes: acquiring each The grayscale image corresponding to the blurred mouth picture; obtain the first convolution kernel and the second convolution kernel, wherein the size of the first convolution kernel is smaller than the size of the second convolution kernel, and the first convolution kernel The sum of all elements in the product kernel is 0, and the sum of all elements in the second convolution kernel is 0; the grayscale image is combined with the first convolution kernel and the second convolution kernel Convolution processing to obtain the gradient map corresponding to each blurred mouth picture.
  • performing preprocessing on the at least one frame image to obtain mouth feature information of the facial region further includes: using a facial key point detection model to process the Each frame image is processed to obtain a plurality of facial key points; extract a plurality of mouth key points related to the mouth in the plurality of facial key points, wherein the mouth feature information also includes the plurality of mouth key points.
  • the video processing network includes a feature extraction subnetwork and a decoding generation subnetwork, and based on the mouth feature information and the audio clip, the video processing network is used to
  • the network processing the at least one frame image includes: performing spectrum conversion processing on the audio clip to obtain a feature spectrum; using the feature extraction sub-network to perform feature extraction on the at least one blurred mouth picture and the feature spectrum
  • the extraction process obtains M visual feature vectors, wherein the M visual feature vectors match the audio clips, M is a positive integer and is less than or equal to the number of the at least one blurred mouth picture; using the decoding to generate
  • the sub-network processes the M visual feature vectors to obtain M target frames, wherein the M target frames are in one-to-one correspondence with the M time points in the audio clip, and each of the M target frames
  • the M target frames have the mouth shape corresponding to the corresponding time point in the audio clip; the target video is obtained according to the M target frames.
  • the feature extraction sub-network is used to perform feature extraction processing on the at least one mouth blur picture and the feature spectrum to obtain M visual feature vectors, including : Divide the at least one blurred mouth picture into M groups in sequence, and use the feature extraction sub-network to extract visual feature vectors corresponding to each group, so as to obtain the M visual feature vectors.
  • the mouth feature information further includes at least one gradient feature map corresponding to the at least one blurred mouth picture
  • the feature extraction sub-network is used to extract Performing feature extraction processing on the at least one blurred mouth picture and the feature spectrum to obtain M visual feature vectors, including: using the feature extraction sub-network to extract the at least one blurred mouth picture and the at least one gradient feature map Perform feature extraction processing with the feature spectrum to obtain M visual feature vectors, wherein the at least one gradient feature map is used to provide the feature extraction sub-network with the blurred area and the non-blurred area in the corresponding mouth blurred picture scope.
  • the mouth feature information further includes a plurality of mouth key points
  • the M visual feature vectors are processed by using the decoding generation sub-network to obtain M target frames, including: using the decoding generation sub-network to process each visual feature vector to generate an intermediate frame with a mouth area; using the multiple mouth key points to process the mouth of the intermediate frame The location of the region and the image information are corrected to obtain the target frame corresponding to the visual feature vector.
  • At least one embodiment of the present disclosure provides a neural network training method, wherein the neural network includes a video processing network, and the training method includes: acquiring a training video and a training audio segment matching the training video, wherein the The training video includes at least one training frame image, each training frame image includes at least one object, and each object includes a facial area; the training video is preprocessed to obtain mouth feature information corresponding to the training video; based on the The mouth feature information and the training audio clips are used to train the video processing network.
  • the video processing network includes a feature extraction sub-network, and based on the mouth feature information and the training audio clip, the video processing network is The training includes: performing spectral conversion processing on the training audio segment to obtain the training feature spectrum; using the training feature spectrum and the mouth feature information to train the feature extraction sub-network to be trained to obtain the trained The feature extraction sub-network described above.
  • the mouth feature information includes at least one blurred mouth picture
  • the training feature spectrum and the mouth feature information are used to train the The feature extraction sub-network is trained to obtain the trained feature extraction sub-network, including: using the feature extraction sub-network to be trained to process the training feature spectrum and the at least one blurred mouth picture to obtain Training visual feature vectors and training audio feature vectors; according to the training visual feature vectors and the training audio feature vectors, calculating the loss value of the feature extraction sub-network through the loss function corresponding to the feature extraction sub-network; based on the The loss value modifies the parameters of the feature extraction sub-network to be trained; and when the loss value corresponding to the feature extraction sub-network to be trained does not meet the predetermined accuracy rate condition, continue to input the training feature spectrum and the At least one mouth blur picture to repeat the above training process.
  • the mouth feature information includes at least one blurred mouth picture
  • the video processing network further includes a decoding generation sub-network, based on the mouth feature Information and the training audio clip
  • training the video processing network further includes: using the trained feature extraction sub-network to process the training feature spectrum and the at least one blurred mouth picture to obtain at least A target visual feature vector; according to the at least one target visual feature vector and the training video, the decoding generation sub-network is trained.
  • the mouth feature information further includes a plurality of mouth key points, and according to the at least one target visual feature vector and the training video, all The decoding generation sub-network is trained, including: using the mouth position information provided by the plurality of mouth key points, combined with the at least one target visual feature vector to train the decoding generation sub-network.
  • the neural network further includes a discriminant sub-network, the discriminant sub-network and the decoding-generating sub-network constitute a generative confrontation network, and the During the training process of the decoding generation subnetwork, the generative confrontation network is alternately and iteratively trained to obtain the trained decoding generation subnetwork.
  • At least one embodiment of the present disclosure provides a video processing device, including: an acquisition unit configured to acquire at least one frame image and an audio clip, wherein each frame image includes at least one object, and each object includes a face area; a preprocessing unit , configured to preprocess the at least one frame image to obtain the mouth feature information of the facial region; the video processing unit is configured to use a video processing network to process the mouth feature information and the audio clip based on the mouth feature information The at least one frame image is processed to obtain a target video, wherein the object in the target video has a synchronous mouth shape change with the audio clip, and wherein the mouth feature information is at least used for reporting to the video processing network
  • a basic outline of the face area and the mouth of each object, and a positional relationship between the face area and the mouth of each object are provided.
  • At least one embodiment of the present disclosure provides a neural network training device, including: a training data acquisition unit configured to acquire a training video and a training audio segment matching the training video, wherein the training video includes at least one training frame Image, each training frame image includes at least one object, and each object includes a facial area; a preprocessing unit is configured to preprocess the training video to obtain mouth feature information of the facial area; a training unit is configured to Based on the mouth feature information and the training audio clips, the video processing network is trained, wherein the mouth feature information is at least used to provide the video processing network with the facial area and The basic outline of the mouth, and the positional relationship between the facial area and the mouth of each object.
  • At least one embodiment of the present disclosure provides an electronic device, including: a memory storing computer-executable instructions in a non-transitory manner; a processor configured to run the computer-executable instructions, wherein the computer-executable instructions are executed by the The processor implements the video processing method according to any embodiment of the present disclosure or the training method described in any embodiment of the present disclosure when running.
  • At least one embodiment of the present disclosure provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are executed by a processor, the computer-executable instructions according to The video processing method described in any embodiment of the present disclosure or the training method described in any embodiment of the present disclosure.
  • FIG. 1 is a flowchart of a video processing method provided by an embodiment of the present disclosure
  • FIG. 2A is a schematic diagram of a mouth blurring process provided by at least one embodiment of the present disclosure
  • Fig. 2B is a schematic diagram of a frame image provided by at least one embodiment of the present disclosure.
  • Fig. 2C is a blurred mouth picture provided by at least one embodiment of the present disclosure.
  • FIG. 3 is a flowchart of a video processing method provided by at least one embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of a characteristic spectrum provided by at least one embodiment of the present disclosure.
  • FIG. 5 is a flowchart of a neural network training method provided by an embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of a neural network provided by an embodiment of the present disclosure.
  • Fig. 7 is a schematic block diagram of a video processing device provided by at least one embodiment of the present disclosure.
  • Fig. 8 is a schematic block diagram of a training device provided by at least one embodiment of the present disclosure.
  • FIG. 9 is a schematic block diagram of an electronic device provided by an embodiment of the present disclosure.
  • Fig. 10 is a schematic diagram of a non-transitory computer-readable storage medium provided by at least one embodiment of the present disclosure
  • Fig. 11 is a schematic diagram of a hardware environment provided by at least one embodiment of the present disclosure.
  • lip sync there are usually two implementations of lip sync.
  • One is manual reconstruction, for example, through image processing software, such as photoshop, etc., to modify the mouth shape state of all frame images in the video one by one according to the current audio content, but to achieve such an effect requires a very complicated implementation process and takes a long time And it needs to consume huge manpower and material resources.
  • Another way is to use a lip synchronization model (such as a mouth shape generation model such as Wav2Lip) to reconstruct the mouth shape. The mouth area of the image input to the model is cut out, and then the mouth shape is reconstructed.
  • a lip synchronization model such as a mouth shape generation model such as Wav2Lip
  • This method requires a network Create the mouth shape from scratch, because in the process of model training, the model needs to grasp not only the area of the facial contour, but also the contour of the mouth, so the range that the model needs to master is too large, and it is difficult to train and converge.
  • At least one embodiment of the present disclosure provides a video processing method, including: acquiring at least one frame image and an audio segment, wherein each frame image includes at least one object, and each object includes a face area; preprocessing the at least one frame image , to obtain the mouth feature information of the face area; based on the mouth feature information and the audio clip, use the video processing network to process at least one frame image to obtain the target video, wherein the object in the target video has a mouth shape synchronized with the audio clip Change, the mouth feature information is at least used to provide the video processing network with the basic outline of each object's face area and mouth, and the positional relationship between each object's face area and mouth.
  • the mouth feature information is used to assist the video processing network to obtain the target video, and the target video has a synchronous mouth shape change corresponding to the audio clip, which is compared to the traditional way of directly using the network to do it from scratch
  • the method uses mouth feature information to provide the video processing network with the basic outline of each object's facial area and mouth, as well as the positional relationship between each object's facial area and mouth, so that the network can generate more accurate mouths. area, the resulting target video has a higher matching degree of mouth shape and higher accuracy.
  • the video processing method provided in at least one embodiment of the present disclosure can be applied to the video processing device provided in the embodiment of the present disclosure, and the video processing device can be configured on an electronic device.
  • the electronic device may be a personal computer, a mobile terminal, etc.
  • the mobile terminal may be a hardware device such as a mobile phone, a tablet computer, or a notebook computer.
  • FIG. 1 is a flowchart of a video processing method provided by an embodiment of the present disclosure.
  • the video processing method provided by at least one embodiment of the present disclosure includes steps S10 to S30.
  • step S10 at least one frame image and an audio segment are acquired.
  • step S20 at least one frame image is preprocessed to obtain mouth feature information of the facial region.
  • step S30 based on the mouth feature information and the audio segment, at least one frame image is processed using a video processing network to obtain a target video.
  • an object in the target video has a mouth change that is synchronized with the audio clip.
  • the mouth feature information is at least used to provide the video processing network with the basic outline of each object's face area and mouth, and the positional relationship between each object's face area and mouth.
  • each frame image includes at least one object, and each object includes a face area.
  • a static image with an object can be obtained as a frame image, and then a target video is generated based on the frame image and an audio clip.
  • the object has a mouth shape change synchronized with the audio clip.
  • a pre-recorded, generated or produced video the video includes a plurality of video frames, the video frame includes at least one object, the plurality of video frames are used as a plurality of frame images, and then, based on the plurality of frame images and audio clips to generate the target video.
  • objects may include real people, 2D or 3D animated characters, anthropomorphic animals, bionic people, etc., and these objects all have complete facial regions, for example, facial regions include mouth, nose, eyes, chin and other parts.
  • the audio segment is the speech content of the object in the target video.
  • the audio segment may be the dubbing content of the animation character.
  • a video can be pre-recorded.
  • the teacher will first face the camera and say "Hello, everyone from ⁇ ", where ⁇ indicates the region.
  • the multiple video frames included in the recorded video are multiple frame images, and the lecturer is the object included in the frame images.
  • the IP obtained by the location where the video is played is the IP of region A, for example, if region A is Beijing, the audio clip is "Hello, children from Beijing”; for example, if region A is Tianjin, the audio clip is "Kids from Tianjin Hello everyone".
  • the lecturer in the pre-recorded video, the lecturer will face the camera and say "XXX classmate won the first place, and classmate XXXX won the second place".
  • the multiple video frames included in the recorded video are multiple frame images, and the lecturer is the object included in the frame images. According to the list results obtained, for example, Zhang San is the first and Li Si is the second, then the audio clip is "Zhang San won the first place, and Li Si won the second place".
  • the audio segment may be a pre-recorded voice segment by the user, or may be a voice segment converted from a text segment, and the present disclosure does not limit the acquisition method of the audio segment.
  • the frame image may be an original image obtained by shooting, or may be a processed image obtained by performing image processing on the original image, which is not limited in the present disclosure.
  • the mouth feature information includes at least one blurred mouth picture, for example, the blurred mouth picture is used to provide the video processing network with the basic outline of the facial area and mouth of each object, and the facial area and mouth of each object location relationship.
  • step S20 may include: using a mouth blur model to blur the mouth of the object in each frame image to obtain a mouth blur picture corresponding to each frame image.
  • the blurred mouth picture is obtained by blurring the mouth of the object in the frame image, that is, blurring the mouth area of the object in the frame image, so as to provide the video processing network with the basic contours of the face area and the mouth area, and
  • the positional relationship between the facial area and the mouth of each object retains most of the structure of the picture, which facilitates the network to generate more accurate mouth images, and adds mouth position regression during the processing of the video processing network to enhance the robustness of mouth shape generation. Stickiness.
  • using the mouth blur model to blur the mouth of the object in each frame image to obtain the mouth blur picture corresponding to each frame image may include: performing the first color space conversion on the frame image to obtain the first converting the image; extracting the mouth area in the first converted image, and performing a first filtering process on the mouth area to obtain a blurred mouth picture corresponding to the frame image.
  • the first color space is the HSI color space, where H represents the hue (Hue), S represents the color saturation (Saturation or Chroma), and I represents the brightness (Intensity or Brightness).
  • H represents the hue (Hue)
  • S represents the color saturation (Saturation or Chroma)
  • I represents the brightness (Intensity or Brightness).
  • the HSI color space utilizes the H component, the S component and The I component describes the color.
  • converting the frame image from the RGB color space to the HSI color space means converting the value of each pixel from the original R component (red component), G component (green component) and B component (blue component) to H Component, S component and I component, the specific conversion formula is as follows:
  • I represents the I component in the HSI color space
  • S represents the S component in the HSI color space
  • H represents the H component in the HSI color space
  • R represents the R component in the RGB color space
  • G represents the G in the RGB color space Component
  • B represents the B component in the RGB color space
  • min(*) represents the minimum value function
  • represents the angle parameter.
  • the H component in the HSI color space is more sensitive to the red area. Therefore, the H component in the mouth area is larger, and the H component in the first converted image can be larger than the preset
  • the threshold area is extracted as the mouth area, and the mean filtering process is performed on the mouth area, and the filtering result is used as the mouth blurred picture corresponding to the frame image.
  • the disclosure modifies the calculation formula of the angle parameter, as shown in the following formula:
  • the (RB) 2 component is added to the denominator of the angle component to increase the sensitivity of the R component to the B component, highlight the weight of the red part of the mouth area in the H component, and improve the determined mouth area. accuracy.
  • the object in the frame image is an object with a skin area such as a person
  • the skin area can be further extracted, a preset area including the mouth can be selected in the skin area, and the preset area can be filtered.
  • the results of the two filtering processes are combined to obtain a blurred mouth picture with a blurred mouth, and the blur effect is enhanced.
  • using the mouth blur model to blur the mouth of the object in each frame image to obtain the mouth blur picture corresponding to each frame image may include: performing the first color space conversion on the frame image to obtain the first converting the image; extracting the mouth area in the first converted image, performing the first filtering process on the mouth area to obtain the first intermediate blurred image; performing the second color space conversion on the frame image to obtain the second converted image; extracting the second Converting the skin area in the image, selecting a preset area including the mouth from the skin area; performing a second filtering process on the preset area to obtain a second intermediate blurred image; performing a second intermediate blurred image on the first intermediate blurred image and the second intermediate blurred image Composite processing to obtain a mouth blurred picture corresponding to the frame image.
  • the second color space is the YCbCr color space.
  • "Y” in the YCbCr color space represents the brightness, that is, the grayscale value of the pixel; while “Cr” and “Cb” represent the chroma, which are used to describe the color and saturation of the image, and are used to specify the pixel Among them, “Cr” reflects the difference between the red part of the RGB input signal and the brightness value of the RGB signal, that is, the red chrominance component of the pixel, and “Cb” reflects the blue color of the RGB input signal. The difference between the color part and the luminance value of the RGB signal, that is, the blue chrominance component of the pixel. RGB signal luminance values are obtained by summing specific parts of the RGB input signals together.
  • RGB red, green, blue
  • the skin color of the human body image is greatly affected by the brightness, so it is difficult to separate the skin color point from the non-skin color point. That is to say, in the face image processed in the RGB color space, the skin color points are discrete points, and there are many non-skin color points embedded in the middle, which brings difficulties for skin color area calibration (such as face calibration, eye calibration, etc.).
  • the YCbCr color space is often used in face detection, because the effect of brightness can be ignored by converting the RGB color space to the YCbCr color space, and because the YCbCr color space is less affected by the brightness, the skin color will produce a good cluster, so that the The three-dimensional color space is mapped to the two-dimensional CbCr plane, so that the skin color points form a certain shape, so as to achieve the purpose of recognizing the human body image according to the skin color.
  • the YCbCr color space is a color model that separates the brightness separately. Using this color model can make the skin color point not be affected by the brightness of the light and make it difficult to separate.
  • the frame image is mapped to the YCbCr color space to obtain the mapped image; then, the mapped image is projected on the CbCr plane to obtain a skin color sample image, and the skin color sample image includes skin color sample points corresponding to the pixel points of the frame image ;Finally, traverse the skin color sample image.
  • traversing the skin color sample image if the skin color sample point is located in the skin pixel point ellipse boundary and within the ellipse, it is judged that the pixel point in the frame image corresponding to the skin color sample point belongs to the skin area.
  • the skin color sample point If it is not located in the skin pixel point ellipse boundary and within the ellipse, it is judged that the pixel point in the frame image corresponding to the skin color sample point does not belong to the skin area, thereby extracting the skin area in the second converted image.
  • the facial key point detection model can be used to process the frame image to obtain a plurality of facial key points, and determine whether the face of the object in the frame image has eyes on the frame image according to the positions of the facial key points.
  • the chin is on the lower side of the frame image, if it is, it means that the face direction of the object is normal, and the mouth area is located at the lower side of the frame image.
  • the preset coordinate interval in the skin area can be extracted, for example, the object The lower half of the skin area is taken as the preset area including the mouth; if not, it means that the face direction of the subject is abnormal. After rotating the frame image, extract the preset coordinate interval in the skin area to obtain the Preset area for the mouth.
  • the preset area including the mouth may be determined according to the skin ratio in the skin area.
  • the chin part only has the mouth, and the skin ratio is relatively high, while the forehead part has non-skin areas such as hair, and the skin ratio is low. Therefore, it can be determined according to the skin ratio whether the face of the object in the frame image has eyes on the upper part of the chin In the following, for example, if the part with a high skin ratio is located in the lower part of the frame image, it means that the face direction of the subject is normal.
  • mean filtering is performed on the preset area, and the filtering result is used as the second intermediate blurred image.
  • convert the frame image from the RGB color space to the HSI color space to obtain the first converted image extract the area in the first converted image whose H component is greater than the preset threshold as the mouth area, and perform mean value filtering on the mouth area , using the filtering result as the first intermediate blurred image.
  • the first intermediate blurred image and the second intermediate blurred image are synthesized, for example, the pixels at corresponding positions are summed to obtain the frame image corresponding to mouth blurred picture.
  • the addition process can use equal weights to prevent the pixel value from being too large.
  • a decimal between 0 and 1 can be set as a weight value (for example, 0.5)
  • the first intermediate blurred image and the second intermediate blurred image Pixels at corresponding positions in the image are multiplied by weight values and then added together to obtain pixel values of pixels at corresponding positions in the blurred mouth image.
  • the above blurring process is performed on each object, so that the mouths of each object are blurred.
  • Fig. 2A is a schematic diagram of a mouth blurring process provided by at least one embodiment of the present disclosure. The execution process of mouth blurring processing will be described in detail below with reference to FIG. 2A .
  • the first color space conversion is performed on the frame image, that is, the frame image is converted to the HSI color space to obtain the first converted image.
  • the specific process is as described above and will not be repeated here.
  • the mouth area in the first converted image is extracted, for example, the mouth area is extracted according to the H component.
  • the specific process is as described above, and will not be repeated here.
  • mean filtering is performed on the mouth region to obtain a first intermediate blurred image.
  • the second color space conversion is performed on the frame image, that is, the frame image is converted to the YCbCr color space to obtain a second converted image.
  • the skin area in the second transformed image is extracted, the specific process is as described above, and will not be repeated here.
  • the preset region including the mouth is extracted, the specific process is as described above, and will not be repeated here.
  • mean filtering is performed on the preset area to obtain a second intermediate blurred image.
  • first intermediate blurred image and the second intermediate blurred image are synthesized to obtain a mouth blurred picture corresponding to the frame image.
  • Fig. 2B is a schematic diagram of a frame image provided by at least one embodiment of the present disclosure. As shown in FIG. 2B , the frame image includes an object, and the object has a complete face area.
  • FIG. 2C is a blurred mouth picture provided by at least one embodiment of the present disclosure.
  • the blurred mouth picture is obtained by blurring the mouth of the object in the frame image shown in FIG. 2B .
  • the lower part of the subject’s face is blurred, but the basic contours and positions of the face and mouth can still be seen.
  • most of the structure of the image is preserved, which makes it easier for the network to generate more accurate mouth images based on relevant information.
  • FIG. 2B and FIG. 2C mosaic processing is performed on the eye part to protect privacy, and the actual processing does not involve this process.
  • the input video processing network is a blurred mouth image that blurs the mouth area
  • the blurred mouth image provides the basic contours of the mouth and face, which can help the video processing network generate more accurate mouth images.
  • the video processing network does not know which area is a blurred area and which area is a clear area, and the position of the mouth in each frame image may be different, which makes it difficult to improve the processing effect of the model.
  • the outline of the object in the blurred area is not obvious, and the gray level of the edge of the outline does not change strongly, resulting in a weak sense of hierarchy, while the gray level of the edge of the outline of the object in the clear area changes significantly, and the sense of hierarchy is strong.
  • the gradient represents the reciprocal direction of a certain pixel point, and the contour edge in the blurred mouth picture can be determined through the change of the gradient value, thereby determining the blurred area in the blurred mouth picture (for example, the blurred area in the blurred mouth picture) and The extent of the unblurred area (such as the unblurred area in a blurred mouth image).
  • the mouth feature information may also include at least one gradient feature map corresponding to at least one mouth blurred picture, and the gradient feature map is used to provide the video processing network with the blurred area and the non-blurred area in the mouth blurred picture corresponding to the gradient feature map.
  • the scope of the region so that the video processing network can obtain a more accurate mouth position range, reduce the interference caused by image noise, and facilitate the rapid convergence of the model during the training phase.
  • step S20 may also include: performing gradient feature extraction on at least one blurred mouth picture to obtain a gradient feature map corresponding to each blurred mouth picture, wherein the mouth feature information also includes at least one blurred mouth picture corresponding to At least one gradient feature map.
  • the gradient feature map corresponding to the blurred mouth picture is composed of gradient values corresponding to each pixel included in the blurred mouth picture.
  • performing gradient feature extraction on at least one blurred mouth picture to obtain a gradient feature map corresponding to each blurred mouth picture may include: obtaining a grayscale image corresponding to each blurred mouth picture; obtaining the first convolution kernel and The second convolution kernel, wherein the size of the first convolution kernel is smaller than the size of the second convolution kernel, the sum of all elements in the first convolution kernel is 0, and the sum of all elements in the second convolution kernel is 0; Convolve the grayscale image with the first convolution kernel and the second convolution kernel to obtain the gradient image corresponding to each mouth blurred image.
  • the blurred mouth picture is a color picture
  • grayscale processing is performed on the blurred mouth picture to obtain a grayscale image corresponding to the blurred mouth picture.
  • the first convolution kernel A1 is used to perform convolution processing with the grayscale image.
  • the sum of all elements in the first convolution kernel A1 is 0, and the size of the first convolution kernel A1 is usually 3 ⁇ 3.
  • the present disclosure provides the second convolution kernel A2 to participate in the processing of the gradient feature map, the sum of all elements in the second convolution kernel A2 is also 0, and the size of the second convolution kernel A2 is larger than the first convolution
  • the size of the kernel A1, for example, the size of the second convolution kernel A2 is 5 ⁇ 5 or 7 ⁇ 7, so that the second convolution kernel A2 can be used to expand the receptive field of the gradient feature extraction, reduce the influence of noise interference, and reduce the mouth Blur the noise in the picture to reduce the impact of noise on the feature extraction of the subsequent feature extraction sub-network.
  • the first convolution kernel A1 is shown in the following formula:
  • the second convolution kernel A2 is shown in the following formula:
  • the calculation formula of the gradient feature map O is as follows:
  • I represents the grayscale image, Indicates a convolution calculation.
  • first convolution kernel A1 and second convolution kernel A2 are only illustrative, for example, as long as the sum of all elements in the first convolution kernel A1 is 0, all elements in the second convolution kernel A2 The sum is 0, as long as the size of the first convolution kernel is smaller than the size of the second convolution kernel, which is not specifically limited in the present disclosure.
  • the mouth feature information may also include a plurality of mouth key points.
  • multiple mouth keypoints are used to assist in determining the precise position of the mouth during the process of generating the mouth shape of the object in the target video. That is to say, when the mouth feature information includes a plurality of mouth key points, the mouth feature information is also used to provide the position of the mouth of each object to the video processing network.
  • the position of the mouth in the target video may not be accurately positioned.
  • Using key points of the mouth can help improve the accuracy of the mouth position.
  • the key points of the mouth can make the video processing network only focus on the information of the mouth and surrounding muscles, without additional learning of the overall face contour, direction and structure, etc. Therefore, using blurred mouth images combined with key points of the mouth can effectively Improve the accuracy of the object's mouth shape change and position in the final generated target video.
  • step S20 may also include: processing each frame of image with a facial key point detection model to obtain multiple facial key points; extracting multiple mouth key points related to the mouth among the multiple facial key points.
  • the face key point detection model can adopt the face key point detection model, and the face key point detection model processes the face in the frame image to obtain the corresponding Multiple facial key points
  • these facial key points can include multiple key points related to parts such as eyes, nose, and mouth.
  • a plurality of mouth key points related to the mouth are extracted from the plurality of facial key points, and position coordinates of the plurality of mouth key points are obtained.
  • the multiple mouth key points here include multiple mouth key points corresponding to all frame images.
  • each frame image can get 25 mouth key points, and there are 10 frame images in total, so there are 250 mouth key points in total. Mouth keypoints are input into a decoding-generating sub-network as an aid to determine the precise position of the mouth.
  • a video processing network includes a feature extraction subnetwork and a decoding generation subnetwork.
  • step S30 may include: performing spectral conversion processing on the audio clip to obtain a feature spectrum; using a feature extraction sub-network to perform feature extraction processing on at least one blurred mouth picture and feature spectrum to obtain M visual feature vectors, wherein M Match the visual feature vector with the audio clip, M is a positive integer and is less than or equal to the number of at least one mouth blurred picture; use the decoding generation sub-network to process the M visual feature vectors to obtain M target frames, among which, M target Frames correspond one-to-one to M time points in the audio clip, and each target frame in the M target frames has a mouth shape corresponding to the corresponding time point in the audio clip; the target video is obtained according to the M target frames.
  • MFCC Mel-scale Frequency Cepstral Coefficients
  • MFCC is a set of feature vectors obtained by encoding speech physical information (such as spectral envelope and details). This set of feature vectors can be understood as including m1 n1-dimensional feature vectors.
  • audio clips include m1 audio frames, and each audio frame is converted into an n1-dimensional feature vector, thus, a matrix vector of n1*m1 is obtained as a feature spectrum.
  • Fig. 3 is a schematic diagram of a characteristic spectrum provided by at least one embodiment of the present disclosure.
  • the abscissa of the feature spectrum represents time, which means that the audio clip includes 40 audio frames
  • the ordinate represents the MFCC feature vector
  • the ones in the same column represent a set of feature vectors
  • different gray levels represent different intensities .
  • the audio segment may also be processed in other manners for extracting spectral features to obtain the characteristic spectrum, which is not limited in the present disclosure.
  • the matching of video and audio clips means that the mouth shape of the object in the frame image included in the video should be the shape of the content in the audio corresponding to the same time point as the frame image. For example, if the content of the audio clip is "Happy Birthday", the mouth movement in the video should match the mouth movement of the subject when they say "Happy Birthday”.
  • M visual feature vectors are matched with audio clips, which means that M visual feature vectors are synchronized with audio clips. Since the audio feature vector (representing the feature information of the audio clip, see the description below) output by the feature extraction sub-network is consistent with the visual feature vector during the training phase, the feature spectrum and all the blurred mouth pictures corresponding to all frame images After the feature extraction sub-network is input, the output M visual feature vectors and audio feature vectors are basically the same vectors, so as to achieve matching with audio clips.
  • M visual feature vectors which may include; dividing at least one blurred mouth picture into M groups in sequence, using the feature extraction sub-network The visual feature vectors corresponding to each group are extracted to obtain M visual feature vectors.
  • the number of frame images is y
  • y blurred mouth pictures are obtained after blurring y frame images.
  • M groups of mouth blurred pictures are sequentially input into the feature extraction sub-network to obtain visual feature vectors corresponding to each group of mouth blurred pictures, thereby obtaining M visual feature vectors.
  • the training difficulty of the video processing network may be increased, and the network is not easy to converge.
  • the frame images can be grouped, and the difficulty in the network training process is reduced without affecting the final effect, and it is easier to obtain convergent network.
  • the feature extraction sub-network is used to perform feature extraction processing on at least one mouth blurred picture and feature spectrum to obtain M visual features
  • the vector may include: using a feature extraction sub-network to perform feature extraction processing on at least one blurred mouth picture, at least one gradient feature map and feature spectrum, to obtain M visual feature vectors, wherein at least one gradient feature map is used for feature extraction
  • the sub-networks provide the ranges of blurred and non-blurred regions in the corresponding mouth blurred images.
  • the pixel value of each pixel includes a set of RGB pixel values, so the number of input channels of the feature extraction sub-network is at least 3, corresponding to the R channel, G channel and B channel respectively. For example, add an input channel corresponding to the R channel, G channel and B channel.
  • the gradient feature map is input from the added input channel to the feature extraction subnetwork, that is, the feature extraction subnetwork.
  • the input size of the network is M*N*4, where M represents the width of the blurred mouth image, N represents the height of the blurred mouth image, and 4 represents 4 input channels.
  • the gradient feature map is also grouped in the same way, and the mouth blurred picture and its corresponding gradient feature map are input to the feature extraction sub-network for processing.
  • using the decoding generation sub-network to process M visual feature vectors to obtain M target frames may include: using the decoding generation sub-network to process each visual feature vector The vector is processed to generate an intermediate frame with the mouth area; the position and image information of the mouth area of the intermediate frame are corrected by using multiple mouth key points, and the target frame corresponding to the visual feature vector is obtained.
  • the mouth feature information only includes the blurred picture of the mouth
  • the mouth in the generated visual feature vector is still blurred
  • the decoding generation sub-network cannot directly understand the structure and general shape of the face like human cognition.
  • the position of the mouth in the picture with the mouth area generated by the decoding generation sub-network may not be accurate. Therefore, multiple key points of the mouth can be used to help improve the accuracy of the mouth position, and the auxiliary network can generate more realistic pictures.
  • the image information includes image information such as muscles around the mouth area.
  • the mouth key points can be used to locate the position of the mouth in the frame image, so that the mouth key points can be used to assist decoding and generate sub-networks that only focus on image information such as the mouth and its surrounding muscles, without additional learning of the overall face Therefore, the key points of the mouth combined with the blurred picture of the mouth can effectively improve the accuracy of the mouth area generated in the target frame.
  • the feature extraction sub-network and the decoding generation sub-network may use a convolutional neural network, etc., and the present disclosure does not limit the structure of the feature extraction sub-network and the decoding generation sub-network.
  • Fig. 4 is a flowchart of a video processing method provided by at least one embodiment of the present disclosure. The following describes in detail the execution process of the video processing method provided by an embodiment of the present disclosure with reference to FIG. 4 .
  • the audio segment and the frame image are first obtained.
  • the related content of the audio segment and the frame image please refer to the description of step S10 , which will not be repeated here.
  • the feature spectrum and the blurred mouth pictures and gradient feature maps divided into M groups are input into the feature extraction sub-network to obtain M visual feature vectors.
  • M visual feature vectors and multiple mouth key points are input into the decoding generation sub-network for processing, and M target frames are obtained, and each target frame in the M target frames has a mouth corresponding to the corresponding time point in the audio clip. For example, if the audio clip is "Happy Birthday", then the mouth shapes of the objects in the M target frames follow the audio clip and are sequentially displayed as the mouth shapes of "Happy Birthday”.
  • the M target frames are arranged sequentially according to the order of display time points to obtain the target video.
  • FIG. 5 is a flow chart of a neural network training method provided by an embodiment of the present disclosure.
  • the neural network training method includes steps S40 to S60.
  • neural networks include video processing networks.
  • Step S40 acquiring a training video and a training audio segment matched with the training video.
  • the training video includes at least one training frame image, each training frame image includes at least one object, and each object includes a face area.
  • Step S50 preprocessing the training video to obtain mouth feature information corresponding to the training video.
  • Step S60 based on the mouth feature information and the training audio clips, the video processing network is trained.
  • the training video may be a video with mouth shape changes, and the mouth shape changes in the training video are the content of the training audio clip.
  • the training video can be a speaker saying "Happy Birthday" to the camera, the object in the training frame image is the speaker, the training frame image includes the speaker's facial area, and the training audio clip is "Happy Birthday”.
  • the mouth feature information may include mouth blurred pictures corresponding to each training frame image.
  • the process of obtaining the mouth blurred picture please refer to the related description of step S20, which will not be repeated here.
  • the mouth feature information may include a gradient feature map corresponding to each blurred mouth picture.
  • the process of obtaining the gradient feature map please refer to the relevant description of step S20, which will not be repeated here.
  • the mouth feature information may also include a plurality of key points of the mouth.
  • the relevant description of step S20 which will not be repeated here.
  • the mouth feature information is used to provide the approximate outline of the face and mouth, as well as the positional relationship between the face and the mouth. Since the mouth blurred picture still retains the overall outline of the picture, the network does not need to do it again Created from scratch, it facilitates rapid network convergence, speeds up the network training process, and reduces training difficulty and time overhead.
  • the gradient feature map is used to provide the range of the blurred area and the non-blurred area in the blurred mouth picture corresponding to the gradient feature map, and provides more limited parameters for the video processing network, which is convenient for the feature extraction sub-network to determine accurate
  • the position of the mouth can reduce image noise interference, facilitate rapid network convergence, speed up the network training process, and reduce training difficulty and time overhead.
  • the mouth key points are used to provide mouth position information, so that the network mainly considers image information such as the mouth and its surrounding muscles during the training process, and does not need to learn the overall facial contour, direction and structure, etc. Information, effectively improve the training efficiency, and can get a video processing network with higher accuracy.
  • a video processing network includes a feature extraction subnetwork and a decoding generation subnetwork.
  • the feature extraction sub-network is first trained, and after the feature extraction sub-network is trained, the decoding generation sub-network is combined with the trained feature extraction sub-network, that is, the decoding generation sub-network During the training process of the network, the weight parameters in the feature extraction sub-network do not change, only the parameters of the decoding generation sub-network are updated.
  • step S60 may include: performing spectral conversion processing on the training audio segment to obtain the training feature spectrum; using the training feature spectrum and mouth feature information to train the feature extraction sub-network to be trained to obtain the trained feature extraction sub-network .
  • the Mel cepstral coefficients of the training audio clips can be extracted as the training feature spectrum.
  • using the training feature spectrum and mouth feature information to train the feature extraction sub-network to be trained to obtain a trained feature extraction sub-network may include: using the feature extraction sub-network to be trained to train the feature spectrum and at least one The mouth blurred picture is processed to obtain the training visual feature vector and the training audio segment feature vector; according to the training visual feature vector and the training audio feature vector, the loss value of the feature extraction sub-network is calculated through the loss function corresponding to the feature extraction sub-network; based on the loss modify the parameters of the feature extraction subnetwork to be trained; and when the loss value corresponding to the feature extraction subnetwork to be trained does not meet the predetermined accuracy rate condition, continue to input the training feature spectrum and at least one blurred mouth picture to repeat the above training process.
  • the training goal of the feature extraction sub-network is to match the output visual feature vector with the audio feature vector.
  • the concept of matching refer to the content mentioned above.
  • the i-th feature element in the visual feature vector and the i-th feature element in the audio feature vector should match, which is reflected in the feature value that the feature value of the visual feature vector and the audio feature vector are very close or identical. Therefore, during training, the loss value is calculated by using the training visual feature vector and the training audio feature, and the parameters of the feature extraction sub-network are corrected based on the loss value, so that the visual feature vector and audio feature vector output by the trained feature extraction sub-network unanimous.
  • step S60 may also include: using the trained feature extraction sub-network to process the training feature spectrum and at least one blurred mouth picture to obtain at least one target visual feature vector; Feature vectors and training videos are used to train the decoder-generation sub-network.
  • the decoding generation sub-network is trained, which may include: using the mouth position information provided by a plurality of mouth key points, combined with at least one target visual feature vector to decode the generation sub-network to train.
  • the key points of the mouth are used to assist training, so that the position of the mouth shape is more accurate.
  • the neural network also includes a discriminative sub-network.
  • the discriminative sub-network and the decoding-generating sub-network constitute a Generative Adversarial Networks (GAN for short). Iterative training to obtain a trained decoder generation subnetwork.
  • GAN Generative Adversarial Networks
  • the decoding generation sub-network acts as the generator (Generator) in the generative confrontation network, generating images to "fool" the discriminator, and the discriminant sub-network acts as the discriminator (Discriminator) in the generative confrontation network, judging the decoding generation Authenticity of images generated by sub-networks.
  • the generator In the training process, first let the generator continuously generate image data and be judged by the discriminator. In this process, the parameters of the discriminator are not adjusted, and only the generator is trained and parameter adjusted until the discriminator cannot judge the authenticity of the image generated by the generator.
  • Fig. 6 is a schematic structural diagram of a neural network provided by an embodiment of the present disclosure.
  • the neural network 100 provided by at least one embodiment of the present disclosure includes a video processing network 101 and a discrimination subnetwork 102, the video processing network 101 includes a feature extraction subnetwork 1011 and a decoding generation subnetwork 1012, and the decoding generation subnetwork 1012
  • the network 1012 and the discriminative sub-network 102 constitute a generative adversarial network.
  • the training process of the video processing network 101 will be described in detail below with reference to FIG. 6 .
  • the feature extraction sub-network 1011 is trained. For example, refer to the description of step S50 to obtain a plurality of blurred mouth pictures corresponding to a plurality of training frame images, and a plurality of gradient feature maps corresponding to a plurality of blurred mouth pictures respectively, and perform spectral conversion processing on the training audio clips to obtain training
  • the feature spectrum is to input multiple blurred mouth pictures, multiple gradient feature maps and feature spectrum into the feature extraction sub-network 1011 for processing to obtain visual feature vectors and audio feature vectors.
  • the visual feature vector output by the trained feature extraction sub-network 1011 is consistent with the audio feature vector.
  • the decoding generation sub-network 1012 is trained in combination with the trained feature extraction sub-network 1011 .
  • multiple target visual feature vectors are obtained.
  • the target visual feature vectors are consistent with the audio feature vectors output by the feature extraction subnetwork 1011.
  • Input a plurality of target visual feature vectors and a plurality of mouth key points into the decoding generation sub-network 1012 for processing to obtain an output frame, in which the mouth shape of the object changes, but the change may be different from the training frame corresponding to the same display time point There is a difference in the shape of the mouth in the image.
  • the output frame and the training frame image are input into the discrimination sub-network 102, and the discrimination sub-network 102 uses the mouth shape in the training frame image as a standard, referring to the process as previously described to alternately train the decoding generation sub-network 1012 and the discrimination sub-network 102, and, based on The binary classification cross-entropy loss function calculates the loss value, and alternately modifies the parameters of the discrimination subnetwork 102 and the decoding generation subnetwork 1012 until a trained decoding generation subnetwork 1012 is obtained.
  • the network since the blurred mouth picture still retains the overall outline of the picture, the network does not need to be created from scratch, which facilitates the rapid convergence of the network, speeds up the training process of the feature extraction sub-network, and reduces the training difficulty and time overhead.
  • the gradient feature map is used to provide the range of the blurred area and the non-blurred area in the mouth blurred picture, so that the network can quickly locate the mouth area and facilitate the network to quickly converge.
  • the mouth key points are used to provide mouth position information, so that the decoding generation sub-network mainly considers image information such as the mouth and its surrounding muscles during the training process, and does not need to learn information such as the overall facial contour, direction, and structure. Effectively improve the training efficiency and obtain a video processing network with higher accuracy.
  • FIG. 7 is a schematic block diagram of a video processing device provided by at least one embodiment of the present disclosure.
  • the video processing apparatus 200 may include an acquisition unit 201 , a preprocessing unit 202 and a video processing unit 203 . These components are interconnected by a bus system and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the video processing device 200 shown in FIG. 7 are exemplary rather than limiting, and the video processing device 200 may also have other components and structures as required.
  • these modules may be implemented by hardware (such as circuit) modules, software modules, or any combination of the two, and the following embodiments are the same as this, and will not be repeated here.
  • a central processing unit CPU
  • GPU video processing unit
  • TPU tensor processing unit
  • FPGA field programmable logic gate array
  • the processing units and corresponding computer instructions implement these units.
  • the obtaining unit 201 is configured to obtain at least one frame image and an audio segment, for example, each frame image includes at least one object, and each object includes a face area.
  • the acquiring unit 201 may include a memory storing frame images and audio clips.
  • the acquisition unit 201 may include one or more cameras to shoot or record a video including multiple frame images or a still frame image of an object.
  • the acquisition unit 201 may also include a recording device to obtain audio clips.
  • the acquisition unit 201 may be hardware, software, firmware and any feasible combination thereof.
  • the preprocessing unit 202 is configured to preprocess at least one frame image to obtain mouth feature information of the face area.
  • video processing unit 203 may include video processing network 204 .
  • the video processing unit 203 uses the video processing network 204 to process at least one frame image based on the mouth feature information and the audio clip to obtain a target video, wherein the object in the target video and the audio clip have synchronous mouth shape changes.
  • the video processing network 204 includes a feature extraction subnetwork and a decoding generation subnetwork. It should be noted that the video processing network 204 in the video processing unit 203 has the same structure and function as the video processing network 204 in the embodiment of the above-mentioned video processing method. I won't repeat them here.
  • the acquiring unit 201 can be used to realize step S10 shown in FIG. 1
  • the preprocessing unit 202 can be used to realize step S20 shown in FIG. 1
  • the video processing unit 203 can be used to realize the steps shown in FIG. 1 S30. Therefore, for the specific description of the functions that can be realized by the acquisition unit 201 , the preprocessing unit 202 and the video processing unit 203 , reference may be made to the relevant descriptions of steps S10 to S30 in the embodiment of the above video processing method, and repeated descriptions will not be repeated.
  • the video processing apparatus 200 can achieve technical effects similar to those of the aforementioned video processing method, which will not be repeated here.
  • FIG. 8 is a schematic block diagram of a training device provided by at least one embodiment of the present disclosure.
  • the training device 300 may include a training data acquisition unit 301 , a preprocessing unit 302 and a training unit 303 . These components are interconnected by a bus system and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the training device 300 shown in FIG. 8 are exemplary rather than limiting, and the training device 300 may also have other components and structures as required.
  • the training data obtaining unit 301 is configured to obtain a training video and a training audio segment matched with the training video.
  • the training video includes at least one training frame image, each training frame image includes at least one object, and each object includes a face area.
  • the preprocessing unit 302 is configured to preprocess the training video to obtain mouth feature information of the facial region.
  • the training unit 303 is configured to train the video processing network based on mouth feature information and training audio clips.
  • the training unit 303 includes a neural network 304 and a loss function (not shown), the neural network 304 includes a video processing network, and the training unit 303 is used to train the neural network 304 to be trained to obtain a trained video processing network.
  • the video processing network includes a feature extraction subnetwork and a decoding and generating subnetwork
  • the neural network 304 also includes a discriminative subnetwork, which constitutes a generative confrontation network.
  • the structure and function of the neural network 304 in the training unit 303 are the same as those of the neural network 100 in the above embodiment of the neural network training method, and will not be repeated here.
  • the training data acquisition unit 301 can be used to realize the step S40 shown in FIG. 5, the preprocessing unit 302 can be used to realize the step S50 shown in FIG. 5, and the training unit 303 can be used to realize the Step S60. Therefore, for specific descriptions of the functions that can be realized by the training data acquisition unit 301, the preprocessing unit 302, and the training unit 303, reference may be made to the relevant descriptions of steps S40 to S60 in the embodiment of the video processing method above, and repeated descriptions will not be repeated.
  • the training device 300 can achieve technical effects similar to those of the aforementioned training method, which will not be repeated here.
  • Fig. 9 is a schematic block diagram of an electronic device provided by an embodiment of the present disclosure.
  • the electronic device 400 is, for example, suitable for implementing the video processing method or the training method provided by the embodiments of the present disclosure. It should be noted that the components of the electronic device 400 shown in FIG. 9 are only exemplary rather than limiting, and the electronic device 400 may also have other components according to actual application requirements.
  • an electronic device 400 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 401, which may perform various appropriate actions and processes according to non-transitory computer-readable instructions stored in a memory, to achieve various functions.
  • a processing device such as a central processing unit, a graphics processing unit, etc.
  • 401 may perform various appropriate actions and processes according to non-transitory computer-readable instructions stored in a memory, to achieve various functions.
  • the computer-readable instructions when executed by the processing device 401, one or more steps in the neural network training method according to any of the above-mentioned embodiments may be executed. It should be noted that, for a detailed description of the processing process of the training method, reference may be made to relevant descriptions in the embodiments of the training method above, and repeated descriptions will not be repeated.
  • the memory may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • Volatile memory may include, for example, random access memory (RAM) 403 and/or cache memory (cache), etc., for example, computer readable instructions may be loaded from storage device 408 into random access memory (RAM) 403 to run computer readable instructions.
  • Non-volatile memory may include, for example, read-only memory (ROM) 402, hard disks, erasable programmable read-only memory (EPROM), compact disk read-only memory (CD-ROM), USB memory, flash memory, and the like.
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • CD-ROM compact disk read-only memory
  • USB memory flash memory
  • the processing device 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404.
  • An input/output (I/O) interface 405 is also connected to bus 404 .
  • the following devices can be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 407 such as a computer; a storage device 408 including, for example, a magnetic tape, a hard disk, a flash memory, etc.; and a communication device 409.
  • the communication means 409 may allow the electronic device 400 to perform wireless or wired communication with other electronic devices to exchange data.
  • the processor 401 can control other components in the electronic device 400 to perform desired functions.
  • the processor 401 may be a device with data processing capabilities and/or program execution capabilities, such as a central processing unit (CPU), a tensor processing unit (TPU), or a graphics processing unit (GPU).
  • the central processing unit (CPU) may be an X86 or ARM architecture or the like.
  • the GPU can be integrated directly on the motherboard alone, or built into the north bridge chip of the motherboard.
  • a GPU can also be built into a central processing unit (CPU).
  • Fig. 10 is a schematic diagram of a non-transitory computer-readable storage medium provided by at least one embodiment of the present disclosure.
  • the storage medium 500 may be a non-transitory computer-readable storage medium, and one or more computer-readable instructions 501 may be stored non-transitory on the storage medium 500 .
  • the computer-readable instructions 501 are executed by the processor, one or more steps in the above-mentioned video processing method or training method may be executed.
  • the storage medium 500 may be applied to the above-mentioned electronic device, for example, the storage medium 500 may include a memory in the electronic device.
  • the storage medium may include a memory card of a smartphone, a storage unit of a tablet computer, a hard disk of a personal computer, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM), Portable compact disc read-only memory (CD-ROM), flash memory, or any combination of the above-mentioned storage media may also be other applicable storage media.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • CD-ROM Portable compact disc read-only memory
  • flash memory or any combination of the above-mentioned storage media may also be other applicable storage media.
  • Fig. 11 is a schematic diagram of a hardware environment provided by at least one embodiment of the present disclosure.
  • the electronic device provided by the present disclosure can be applied in the Internet system.
  • the functions of the image processing apparatus and/or electronic equipment involved in the present disclosure can be realized by using the computer system provided in FIG. 11 .
  • Such computer systems can include personal computers, laptops, tablets, mobile phones, personal digital assistants, smart glasses, smart watches, smart rings, smart helmets, and any smart portable or wearable device.
  • the specific system in this embodiment illustrates a hardware platform including a user interface using functional block diagrams.
  • Such computer equipment may be a general purpose computer equipment or a special purpose computer equipment. Both computer devices can be used to realize the image processing device and/or electronic device in this embodiment.
  • the computer system may include any components that implement the presently described information needed to achieve image processing.
  • a computer system can be realized by a computer device through its hardware devices, software programs, firmware, and combinations thereof.
  • the relevant computer functions for realizing the information required for image processing described in this embodiment can be implemented by a group of similar platforms in a distributed manner, Distribute the processing load of a computer system.
  • the computer system can include a communication port 250, which is connected to a network for data communication, for example, the computer system can send and receive information and data through the communication port 250, that is, the communication port 250 can realize the communication between the computer system and the computer system.
  • Other electronic devices communicate wirelessly or by wire to exchange data.
  • the computer system may also include a processor group 220 (ie, the processor described above) for executing program instructions.
  • the processor group 220 may consist of at least one processor (eg, CPU).
  • the computer system may include an internal communication bus 210 .
  • a computer system may include different forms of program storage units and data storage units (i.e., memory or storage media described above), such as hard disk 270, read-only memory (ROM) 230, random access memory (RAM) 240, which can be used to store Various data files used by the computer for processing and/or communicating, and possibly program instructions executed by the processor group 220 .
  • the computer system may also include an input/output component 260 for enabling input/output data flow between the computer system and other components (eg, user interface 280, etc.).
  • the following devices may be connected to the input/output assembly 260: input devices such as a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; displays (e.g., LCD, OLED displays, etc.), speakers, an output device such as a vibrator; a storage device including, for example, a magnetic tape, a hard disk, etc.; and a communication interface.
  • input devices such as a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.
  • displays e.g., LCD, OLED displays, etc.
  • speakers e.g., an output device such as a vibrator
  • a storage device including, for example, a magnetic tape, a hard disk, etc.
  • FIG. 11 shows a computer system with various devices, it should be understood that the computer system is not required to have all of the devices shown and, instead, the computer system may have more or fewer devices.

Abstract

A video processing method and apparatus, and a neural network training method and apparatus. The video processing method comprises: obtaining at least one frame image and an audio clip; preprocessing the at least one frame image to obtain mouth feature information of a face region; and on the basis of the mouth feature information and the audio clip, using a video processing network for processing the at least one frame image to obtain a target video, wherein objects in the target video have mouth shape changes synchronous with the audio clip, the mouth feature information is at least used for providing basic contours of the face region and the mouth of each object for the video processing network, and a positional relationship between the face region and the mouth of each object. In the video processing method, the mouth feature information is utilized to provide the approximate contours and positions of the face and the mouth for the video processing network, so that the network can conveniently generate a more accurate mouth region, and the mouth shape part of the obtained target video is higher in matching degree and higher in accuracy.

Description

视频处理方法及装置、神经网络的训练方法及装置Video processing method and device, neural network training method and device
本申请要求于2021年11月04日递交的中国专利申请第202111296799.X号的优先权,该中国专利申请的全文以引入的方式并入以作为本申请的一部分。This application claims the priority of Chinese Patent Application No. 202111296799.X filed on November 04, 2021, the entirety of which is incorporated by reference as a part of this application.
技术领域technical field
本公开的实施例涉及一种基于视频处理方法、视频处理装置、神经网络的训练方法、神经网络的训练装置、电子设备以及非瞬时性计算机可读存储介质。Embodiments of the present disclosure relate to a video processing method, a video processing device, a neural network training method, a neural network training device, electronic equipment, and a non-transitory computer-readable storage medium.
背景技术Background technique
嘴型同步在游戏/动漫角色配音、数字虚拟人、音唇同步的语音翻译等场景下具有广泛的应用场景。例如,用户可以提供一段音频和给定人物形象或动画形象,就可以生成对应人物的讲话视频,对应人物在讲话视频中的嘴型跟随音频的变化而相应变化,人物嘴型与音频完全匹配。Lip synchronization has a wide range of application scenarios in scenarios such as game/anime character dubbing, digital avatars, and lip-sync voice translation. For example, a user can provide a piece of audio and a given character image or animated image, and a speech video of the corresponding character can be generated. The mouth shape of the corresponding character in the speech video changes correspondingly with the change of the audio, and the character's mouth shape completely matches the audio.
发明内容Contents of the invention
本公开至少一实施例提供一种视频处理方法,包括:获取至少一个帧图像和音频片段,其中,每个帧图像包括至少一个对象,每个对象包括面部区域;对所述至少一个帧图像进行预处理,得到所述面部区域的嘴部特征信息;基于所述嘴部特征信息和所述音频片段,使用视频处理网络对所述至少一个帧图像进行处理,得到目标视频,其中,所述目标视频中的对象具有与所述音频片段同步的嘴型变化,所述嘴部特征信息至少用于向所述视频处理网络提供所述每个对象的面部区域和嘴部的基本轮廓,以及所述每个对象的所述面部区域和所述嘴部的位置关系。At least one embodiment of the present disclosure provides a video processing method, including: acquiring at least one frame image and an audio clip, wherein each frame image includes at least one object, and each object includes a face area; performing the processing on the at least one frame image Preprocessing to obtain mouth feature information of the facial region; based on the mouth feature information and the audio clip, using a video processing network to process the at least one frame image to obtain a target video, wherein the target Objects in the video have mouth shape changes synchronized with the audio clip, and the mouth feature information is at least used to provide the video processing network with the basic outline of each object's face area and mouth, and the The positional relationship between the facial area and the mouth of each object.
例如,在本公开至少一实施例提供的视频处理方法中,对所述至少一个帧图像进行预处理,得到所述面部区域的嘴部特征信息,包括:利用嘴部模糊模型对所述每个帧图像中的对象的嘴部进行模糊处理,得到所述每个帧图像对应的嘴部模糊图片,其中,所述嘴部特征信息包括所述至少一个帧图像 分别对应的至少一个嘴部模糊图片。For example, in the video processing method provided in at least one embodiment of the present disclosure, performing preprocessing on the at least one frame image to obtain mouth feature information of the facial region includes: using a mouth blur model to process each The mouth of the object in the frame image is blurred to obtain a mouth blurred picture corresponding to each frame image, wherein the mouth feature information includes at least one mouth blurred picture corresponding to the at least one frame image respectively .
例如,在本公开至少一实施例提供的视频处理方法中,利用嘴部模糊模型对所述每个帧图像中的对象的嘴部进行模糊处理,得到所述每个帧图像对应的嘴部模糊图片,包括:对所述帧图像进行第一色彩空间转换,得到第一转换图像;提取所述第一转换图像中的嘴部区域,对所述嘴部区域进行第一滤波处理,得到所述帧图像对应的嘴部模糊图片。For example, in the video processing method provided in at least one embodiment of the present disclosure, the mouth of the object in each frame image is blurred by using the mouth blur model to obtain the mouth blur corresponding to each frame image The picture includes: performing a first color space conversion on the frame image to obtain a first converted image; extracting a mouth area in the first converted image, and performing a first filtering process on the mouth area to obtain the The blurred picture of the mouth corresponding to the frame image.
例如,在本公开至少一实施例提供的视频处理方法中,利用嘴部模糊模型对所述每个帧图像中的对象的嘴部进行模糊处理,得到所述每个帧图像对应的嘴部模糊图片,包括:对所述帧图像进行第一色彩空间转换,得到第一转换图像;提取所述第一转换图像中的嘴部区域,对所述嘴部区域进行第一滤波处理,得到第一中间模糊图像;对所述帧图像进行第二色彩空间转换,得到第二转换图像;提取所述第二转换图像中的皮肤区域,从所述皮肤区域中选择包括嘴部的预设区域;对所述预设区域进行第二滤波处理,得到第二中间模糊图像;对所述第一中间模糊图像和所述第二中间模糊图像进行合成处理,得到所述帧图像对应的嘴部模糊图片。For example, in the video processing method provided in at least one embodiment of the present disclosure, the mouth of the object in each frame image is blurred by using the mouth blur model to obtain the mouth blur corresponding to each frame image The picture includes: performing a first color space conversion on the frame image to obtain a first converted image; extracting a mouth area in the first converted image, and performing a first filtering process on the mouth area to obtain a first converted image. The middle blurred image; the second color space conversion is performed on the frame image to obtain a second conversion image; the skin area in the second conversion image is extracted, and a preset area including the mouth is selected from the skin area; performing a second filtering process on the preset area to obtain a second intermediate blurred image; performing synthesis processing on the first intermediate blurred image and the second intermediate blurred image to obtain a mouth blurred picture corresponding to the frame image.
例如,在本公开至少一实施例提供的视频处理方法中,所述第一色彩空间为HSI色彩空间,所述第二色彩空间为YCbCr色彩空间。For example, in the video processing method provided in at least one embodiment of the present disclosure, the first color space is an HSI color space, and the second color space is a YCbCr color space.
例如,在本公开至少一实施例提供的视频处理方法中,对所述至少一个帧图像进行预处理,得到所述面部区域的嘴部特征信息,还包括:对所述至少一个嘴部模糊图片进行梯度特征提取,得到每个嘴部模糊图片对应的梯度特征图,其中,所述嘴部特征信息还包括所述至少一个嘴部模糊图片分别对应的至少一个梯度特征图。For example, in the video processing method provided in at least one embodiment of the present disclosure, performing preprocessing on the at least one frame image to obtain the mouth feature information of the facial region further includes: blurring the mouth of the at least one picture Gradient feature extraction is performed to obtain a gradient feature map corresponding to each mouth blur picture, wherein the mouth feature information further includes at least one gradient feature map corresponding to the at least one mouth blur picture.
例如,在本公开至少一实施例提供的视频处理方法中,对所述至少一个嘴部模糊图片进行梯度特征提取,得到每个嘴部模糊图片对应的梯度特征图,包括:获取所述每个嘴部模糊图片对应的灰度图;获取第一卷积核和第二卷积核,其中,所述第一卷积核的尺寸小于所述第二卷积核的尺寸,所述第一卷积核中的所有元素之和为0,所述第二卷积核中的所有元素之和为0;将所述灰度图与所述第一卷积核和所述第二卷积核进行卷积处理,得到所述每个嘴部模糊图片对应的梯度图。For example, in the video processing method provided in at least one embodiment of the present disclosure, performing gradient feature extraction on the at least one blurred mouth picture to obtain a gradient feature map corresponding to each blurred mouth picture includes: acquiring each The grayscale image corresponding to the blurred mouth picture; obtain the first convolution kernel and the second convolution kernel, wherein the size of the first convolution kernel is smaller than the size of the second convolution kernel, and the first convolution kernel The sum of all elements in the product kernel is 0, and the sum of all elements in the second convolution kernel is 0; the grayscale image is combined with the first convolution kernel and the second convolution kernel Convolution processing to obtain the gradient map corresponding to each blurred mouth picture.
例如,在本公开至少一实施例提供的视频处理方法中,对所述至少一个帧图像进行预处理,得到所述面部区域的嘴部特征信息,还包括:利用面部 关键点检测模型对所述每个帧图像进行处理,得到多个面部关键点;提取所述多个面部关键点中与嘴部相关的多个嘴部关键点,其中,所述嘴部特征信息还包括所述多个嘴部关键点。For example, in the video processing method provided in at least one embodiment of the present disclosure, performing preprocessing on the at least one frame image to obtain mouth feature information of the facial region further includes: using a facial key point detection model to process the Each frame image is processed to obtain a plurality of facial key points; extract a plurality of mouth key points related to the mouth in the plurality of facial key points, wherein the mouth feature information also includes the plurality of mouth key points.
例如,在本公开至少一实施例提供的视频处理方法中,所述视频处理网络包括特征提取子网络和解码生成子网络,基于所述嘴部特征信息和所述音频片段,使用所述视频处理网络对所述至少一个帧图像进行处理,包括:对所述音频片段进行频谱转换处理,得到特征频谱;利用所述特征提取子网络对所述至少一个嘴部模糊图片和所述特征频谱进行特征提取处理,得到M个视觉特征向量,其中,所述M个视觉特征向量与所述音频片段相匹配,M为正整数且小于等于所述至少一个嘴部模糊图片的数量;利用所述解码生成子网络对所述M个视觉特征向量进行处理,得到M个目标帧,其中,所述M个目标帧与所述音频片段中M个时点一一对应,且所述M个目标帧中每个目标帧具有与所述音频片段中对应时点对应的嘴型;根据所述M个目标帧得到所述目标视频。For example, in the video processing method provided in at least one embodiment of the present disclosure, the video processing network includes a feature extraction subnetwork and a decoding generation subnetwork, and based on the mouth feature information and the audio clip, the video processing network is used to The network processing the at least one frame image includes: performing spectrum conversion processing on the audio clip to obtain a feature spectrum; using the feature extraction sub-network to perform feature extraction on the at least one blurred mouth picture and the feature spectrum The extraction process obtains M visual feature vectors, wherein the M visual feature vectors match the audio clips, M is a positive integer and is less than or equal to the number of the at least one blurred mouth picture; using the decoding to generate The sub-network processes the M visual feature vectors to obtain M target frames, wherein the M target frames are in one-to-one correspondence with the M time points in the audio clip, and each of the M target frames The M target frames have the mouth shape corresponding to the corresponding time point in the audio clip; the target video is obtained according to the M target frames.
例如,在本公开至少一实施例提供的视频处理方法中,利用所述特征提取子网络对所述至少一个嘴部模糊图片和所述特征频谱进行特征提取处理,得到M个视觉特征向量,包括:将所述至少一个嘴部模糊图片依序分成M组,利用所述特征提取子网络提取每组对应的视觉特征向量,以得到所述M个视觉特征向量。For example, in the video processing method provided in at least one embodiment of the present disclosure, the feature extraction sub-network is used to perform feature extraction processing on the at least one mouth blur picture and the feature spectrum to obtain M visual feature vectors, including : Divide the at least one blurred mouth picture into M groups in sequence, and use the feature extraction sub-network to extract visual feature vectors corresponding to each group, so as to obtain the M visual feature vectors.
例如,在本公开至少一实施例提供的视频处理方法中,所述嘴部特征信息还包括所述至少一个嘴部模糊图片分别对应的至少一个梯度特征图,利用所述特征提取子网络对所述至少一个嘴部模糊图片和所述特征频谱进行特征提取处理,得到M个视觉特征向量,包括:利用所述特征提取子网络对所述至少一个嘴部模糊图片、所述至少一个梯度特征图和所述特征频谱进行特征提取处理,得到M个视觉特征向量,其中,所述至少一个梯度特征图用于为所述特征提取子网络提供对应的嘴部模糊图片中模糊区域和非模糊区域的范围。For example, in the video processing method provided in at least one embodiment of the present disclosure, the mouth feature information further includes at least one gradient feature map corresponding to the at least one blurred mouth picture, and the feature extraction sub-network is used to extract Performing feature extraction processing on the at least one blurred mouth picture and the feature spectrum to obtain M visual feature vectors, including: using the feature extraction sub-network to extract the at least one blurred mouth picture and the at least one gradient feature map Perform feature extraction processing with the feature spectrum to obtain M visual feature vectors, wherein the at least one gradient feature map is used to provide the feature extraction sub-network with the blurred area and the non-blurred area in the corresponding mouth blurred picture scope.
例如,在本公开至少一实施例提供的视频处理方法中,所述嘴部特征信息还包括多个嘴部关键点,利用所述解码生成子网络对所述M个视觉特征向量进行处理,得到M个目标帧,包括:利用所述解码生成子网络对每个视觉特征向量进行处理,生成带有嘴部区域的中间帧;利用所述多个嘴部关 键点对所述中间帧的嘴部区域的位置和图像信息进行修正,得到所述视觉特征向量对应的目标帧。For example, in the video processing method provided in at least one embodiment of the present disclosure, the mouth feature information further includes a plurality of mouth key points, and the M visual feature vectors are processed by using the decoding generation sub-network to obtain M target frames, including: using the decoding generation sub-network to process each visual feature vector to generate an intermediate frame with a mouth area; using the multiple mouth key points to process the mouth of the intermediate frame The location of the region and the image information are corrected to obtain the target frame corresponding to the visual feature vector.
本公开至少一实施例提供一种神经网络的训练方法,其中,所述神经网络包括视频处理网络,所述训练方法包括:获取训练视频和与所述训练视频匹配的训练音频片段,其中,所述训练视频包括至少一个训练帧图像,每个训练帧图像包括至少一个对象,每个对象包括面部区域;对所述训练视频进行预处理,得到所述训练视频对应的嘴部特征信息;基于所述嘴部特征信息和所述训练音频片段,对所述视频处理网络进行训练。At least one embodiment of the present disclosure provides a neural network training method, wherein the neural network includes a video processing network, and the training method includes: acquiring a training video and a training audio segment matching the training video, wherein the The training video includes at least one training frame image, each training frame image includes at least one object, and each object includes a facial area; the training video is preprocessed to obtain mouth feature information corresponding to the training video; based on the The mouth feature information and the training audio clips are used to train the video processing network.
例如,在本公开至少一实施例提供的神经网络的训练方法中,所述视频处理网络包括特征提取子网络,基于所述嘴部特征信息和所述训练音频片段,对所述视频处理网络进行训练,包括:对所述训练音频片段进行频谱转换处理,得到训练特征频谱;利用所述训练特征频谱和所述嘴部特征信息,对待训练的特征提取子网络进行训练,以得到训练好的所述特征提取子网络。For example, in the neural network training method provided in at least one embodiment of the present disclosure, the video processing network includes a feature extraction sub-network, and based on the mouth feature information and the training audio clip, the video processing network is The training includes: performing spectral conversion processing on the training audio segment to obtain the training feature spectrum; using the training feature spectrum and the mouth feature information to train the feature extraction sub-network to be trained to obtain the trained The feature extraction sub-network described above.
例如,在本公开至少一实施例提供的神经网络的训练方法中,所述嘴部特征信息包括至少一个嘴部模糊图片,利用所述训练特征频谱和所述嘴部特征信息对待训练的所述特征提取子网络进行训练,以得到训练好的所述特征提取子网络,包括:利用所述待训练的特征提取子网络对所述训练特征频谱和所述至少一个嘴部模糊图片进行处理,得到训练视觉特征向量和训练音频特征向量;根据所述训练视觉特征向量和所述训练音频特征向量,通过所述特征提取子网络对应的损失函数计算所述特征提取子网络的损失值;基于所述损失值对所述待训练的特征提取子网络的参数进行修正;以及在所述待训练的特征提取子网络对应的损失值不满足预定准确率条件时,继续输入所述训练特征频谱和所述至少一个嘴部模糊图片以重复执行上述训练过程。For example, in the neural network training method provided in at least one embodiment of the present disclosure, the mouth feature information includes at least one blurred mouth picture, and the training feature spectrum and the mouth feature information are used to train the The feature extraction sub-network is trained to obtain the trained feature extraction sub-network, including: using the feature extraction sub-network to be trained to process the training feature spectrum and the at least one blurred mouth picture to obtain Training visual feature vectors and training audio feature vectors; according to the training visual feature vectors and the training audio feature vectors, calculating the loss value of the feature extraction sub-network through the loss function corresponding to the feature extraction sub-network; based on the The loss value modifies the parameters of the feature extraction sub-network to be trained; and when the loss value corresponding to the feature extraction sub-network to be trained does not meet the predetermined accuracy rate condition, continue to input the training feature spectrum and the At least one mouth blur picture to repeat the above training process.
例如,在本公开至少一实施例提供的神经网络的训练方法中,所述嘴部特征信息包括至少一个嘴部模糊图片,所述视频处理网络还包括解码生成子网络,基于所述嘴部特征信息和所述训练音频片段,对所述视频处理网络进行训练,还包括:利用训练好的所述特征提取子网络对所述训练特征频谱和所述至少一个嘴部模糊图片进行处理,得到至少一个目标视觉特征向量;根据所述至少一个目标视觉特征向量以及所述训练视频,对所述解码生成子网络进行训练。For example, in the neural network training method provided in at least one embodiment of the present disclosure, the mouth feature information includes at least one blurred mouth picture, and the video processing network further includes a decoding generation sub-network, based on the mouth feature Information and the training audio clip, training the video processing network, further includes: using the trained feature extraction sub-network to process the training feature spectrum and the at least one blurred mouth picture to obtain at least A target visual feature vector; according to the at least one target visual feature vector and the training video, the decoding generation sub-network is trained.
例如,在本公开至少一实施例提供的神经网络的训练方法中,所述嘴部特征信息还包括多个嘴部关键点,根据所述至少一个目标视觉特征向量以及所述训练视频,对所述解码生成子网络进行训练,包括:利用所述多个嘴部关键点提供的嘴部位置信息,结合所述至少一个目标视觉特征向量对所述解码生成子网络进行训练。For example, in the neural network training method provided in at least one embodiment of the present disclosure, the mouth feature information further includes a plurality of mouth key points, and according to the at least one target visual feature vector and the training video, all The decoding generation sub-network is trained, including: using the mouth position information provided by the plurality of mouth key points, combined with the at least one target visual feature vector to train the decoding generation sub-network.
例如,在本公开至少一实施例提供的神经网络的训练方法中,所述神经网络还包括判别子网络,所述判别子网络和所述解码生成子网络构成生成式对抗网络,在对所述解码生成子网络训练的过程中,对所述生成式对抗网络进行交替迭代训练,以得到训练好的所述解码生成子网络。For example, in the neural network training method provided in at least one embodiment of the present disclosure, the neural network further includes a discriminant sub-network, the discriminant sub-network and the decoding-generating sub-network constitute a generative confrontation network, and the During the training process of the decoding generation subnetwork, the generative confrontation network is alternately and iteratively trained to obtain the trained decoding generation subnetwork.
本公开至少一实施例提供一种视频处理装置,包括:获取单元,配置为获取至少一个帧图像和音频片段,其中,每个帧图像包括至少一个对象,每个对象包括面部区域;预处理单元,配置为对所述至少一个帧图像进行预处理,得到所述面部区域的嘴部特征信息;视频处理单元,配置为基于所述嘴部特征信息和所述音频片段,使用视频处理网络对所述至少一个帧图像进行处理,得到目标视频,其中,所述目标视频中的对象与所述音频片段具有同步的嘴型变化,其中,所述嘴部特征信息至少用于向所述视频处理网络提供所述每个对象的面部区域和嘴部的基本轮廓,以及所述每个对象的所述面部区域和所述嘴部的位置关系。At least one embodiment of the present disclosure provides a video processing device, including: an acquisition unit configured to acquire at least one frame image and an audio clip, wherein each frame image includes at least one object, and each object includes a face area; a preprocessing unit , configured to preprocess the at least one frame image to obtain the mouth feature information of the facial region; the video processing unit is configured to use a video processing network to process the mouth feature information and the audio clip based on the mouth feature information The at least one frame image is processed to obtain a target video, wherein the object in the target video has a synchronous mouth shape change with the audio clip, and wherein the mouth feature information is at least used for reporting to the video processing network A basic outline of the face area and the mouth of each object, and a positional relationship between the face area and the mouth of each object are provided.
本公开至少一实施例提供一种神经网络的训练装置,包括:训练数据获取单元,配置为获取训练视频和与所述训练视频匹配的训练音频片段,其中,所述训练视频包括至少一个训练帧图像,每个训练帧图像包括至少一个对象,每个对象包括面部区域;预处理单元,配置为对所述训练视频进行预处理,得到所述面部区域的嘴部特征信息;训练单元,配置为基于所述嘴部特征信息和所述训练音频片段,对所述视频处理网络进行训练,其中,所述嘴部特征信息至少用于向所述视频处理网络提供所述每个对象的面部区域和嘴部的基本轮廓,以及所述每个对象的所述面部区域和所述嘴部的位置关系。At least one embodiment of the present disclosure provides a neural network training device, including: a training data acquisition unit configured to acquire a training video and a training audio segment matching the training video, wherein the training video includes at least one training frame Image, each training frame image includes at least one object, and each object includes a facial area; a preprocessing unit is configured to preprocess the training video to obtain mouth feature information of the facial area; a training unit is configured to Based on the mouth feature information and the training audio clips, the video processing network is trained, wherein the mouth feature information is at least used to provide the video processing network with the facial area and The basic outline of the mouth, and the positional relationship between the facial area and the mouth of each object.
本公开至少一实施例提供一种电子设备,包括:存储器,非瞬时性地存储有计算机可执行指令;处理器,配置为运行所述计算机可执行指令,其中,所述计算机可执行指令被所述处理器运行时实现根据本公开任一实施例所述的视频处理方法或本公开任一实施例所述的训练方法。At least one embodiment of the present disclosure provides an electronic device, including: a memory storing computer-executable instructions in a non-transitory manner; a processor configured to run the computer-executable instructions, wherein the computer-executable instructions are executed by the The processor implements the video processing method according to any embodiment of the present disclosure or the training method described in any embodiment of the present disclosure when running.
本公开至少一实施例提供一种非瞬时性计算机可读存储介质,其中,所述非瞬时性计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令被处理器执行时实现根据本公开任一实施例所述的视频处理方法或本公开任一实施例所述的训练方法。At least one embodiment of the present disclosure provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are executed by a processor, the computer-executable instructions according to The video processing method described in any embodiment of the present disclosure or the training method described in any embodiment of the present disclosure.
附图说明Description of drawings
为了更清楚地说明本公开实施例的技术方案,下面将对实施例的附图作简单地介绍,显而易见地,下面描述中的附图仅仅涉及本公开的一些实施例,而非对本公开的限制。In order to illustrate the technical solutions of the embodiments of the present disclosure more clearly, the accompanying drawings of the embodiments will be briefly introduced below. Obviously, the accompanying drawings in the following description only relate to some embodiments of the present disclosure, rather than limiting the present disclosure .
图1为本公开一实施例提供的一种视频处理方法的流程图;FIG. 1 is a flowchart of a video processing method provided by an embodiment of the present disclosure;
图2A为本公开至少一实施例提供的嘴部模糊处理的过程示意图;FIG. 2A is a schematic diagram of a mouth blurring process provided by at least one embodiment of the present disclosure;
图2B为本公开至少一实施例提供的帧图像的示意图;Fig. 2B is a schematic diagram of a frame image provided by at least one embodiment of the present disclosure;
图2C为本公开至少一实施例提供的嘴部模糊图片;Fig. 2C is a blurred mouth picture provided by at least one embodiment of the present disclosure;
图3为本公开至少一实施例提供的视频处理方法的流程图;FIG. 3 is a flowchart of a video processing method provided by at least one embodiment of the present disclosure;
图4为本公开至少一实施例提供的特征频谱的示意图;FIG. 4 is a schematic diagram of a characteristic spectrum provided by at least one embodiment of the present disclosure;
图5为本公开一实施例提供的一种神经网络的训练方法的流程图;FIG. 5 is a flowchart of a neural network training method provided by an embodiment of the present disclosure;
图6为本公开一实施例提供的一种神经网络的结构示意图;FIG. 6 is a schematic structural diagram of a neural network provided by an embodiment of the present disclosure;
图7为本公开至少一实施例提供的一种视频处理装置的示意性框图;Fig. 7 is a schematic block diagram of a video processing device provided by at least one embodiment of the present disclosure;
图8为本公开至少一实施例提供的一种训练装置的示意性框图;Fig. 8 is a schematic block diagram of a training device provided by at least one embodiment of the present disclosure;
图9为本公开一实施例提供的一种电子设备的示意性框图;FIG. 9 is a schematic block diagram of an electronic device provided by an embodiment of the present disclosure;
图10为本公开至少一实施例提供的一种非瞬时性计算机可读存储介质的示意图;Fig. 10 is a schematic diagram of a non-transitory computer-readable storage medium provided by at least one embodiment of the present disclosure;
图11为本公开至少一实施例提供的一种硬件环境的示意图。Fig. 11 is a schematic diagram of a hardware environment provided by at least one embodiment of the present disclosure.
具体实施方式Detailed ways
为了使得本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例的附图,对本公开实施例的技术方案进行清楚、完整地描述。显然,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。基于所描述的本公开的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings of the embodiments of the present disclosure. Apparently, the described embodiments are some of the embodiments of the present disclosure, not all of them. Based on the described embodiments of the present disclosure, all other embodiments obtained by persons of ordinary skill in the art without creative effort fall within the protection scope of the present disclosure.
除非另外定义,本公开使用的技术术语或者科学术语应当为本公开所属 领域内具有一般技能的人士所理解的通常意义。本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系,当被描述对象的绝对位置改变后,则该相对位置关系也可能相应地改变。为了保持本公开实施例的以下说明清楚且简明,本公开省略了部分已知功能和已知部件的详细说明。Unless otherwise defined, the technical terms or scientific terms used in the present disclosure shall have the ordinary meanings understood by those having ordinary skill in the art to which the present disclosure belongs. "First", "second" and similar words used in the present disclosure do not indicate any order, quantity or importance, but are only used to distinguish different components. "Comprising" or "comprising" and similar words mean that the elements or items appearing before the word include the elements or items listed after the word and their equivalents, without excluding other elements or items. Words such as "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up", "Down", "Left", "Right" and so on are only used to indicate the relative positional relationship. When the absolute position of the described object changes, the relative positional relationship may also change accordingly. In order to keep the following description of the embodiments of the present disclosure clear and concise, the present disclosure omits detailed descriptions of some known functions and known components.
目前,嘴型同步通常有两种实现方式。一种是人工方式进行重建,例如通过图像处理软件,例如photoshop等,对视频中的所有帧图像按照当前音频内容逐个修改嘴型状态,但实现这样的效果需要经历非常复杂的实现过程,耗时长且需要耗费巨大的人力物力。另一种方式是利用嘴型同步模型(例如Wav2Lip等嘴型生成模型)进行嘴型重建,输入模型的图像的嘴部区域被执行抠图处理,之后再进行嘴型重建,这种方式需要网络对嘴型进行从无到有的创造,由于在模型训练过程中,模型既要掌握脸部轮廓的区域,又要掌握嘴部的轮廓,那么模型需要掌握的范围过大,难以训练收敛。Currently, there are usually two implementations of lip sync. One is manual reconstruction, for example, through image processing software, such as photoshop, etc., to modify the mouth shape state of all frame images in the video one by one according to the current audio content, but to achieve such an effect requires a very complicated implementation process and takes a long time And it needs to consume huge manpower and material resources. Another way is to use a lip synchronization model (such as a mouth shape generation model such as Wav2Lip) to reconstruct the mouth shape. The mouth area of the image input to the model is cut out, and then the mouth shape is reconstructed. This method requires a network Create the mouth shape from scratch, because in the process of model training, the model needs to grasp not only the area of the facial contour, but also the contour of the mouth, so the range that the model needs to master is too large, and it is difficult to train and converge.
本公开至少一实施例提供一种视频处理方法,包括:获取至少一个帧图像和音频片段,其中,每个帧图像包括至少一个对象,每个对象包括面部区域;对至少一个帧图像进行预处理,得到面部区域的嘴部特征信息;基于嘴部特征信息和音频片段,使用视频处理网络对至少一个帧图像进行处理,得到目标视频,其中,目标视频中的对象具有与音频片段同步的嘴型变化,嘴部特征信息至少用于向视频处理网络提供每个对象的面部区域和嘴部的基本轮廓,以及每个对象的面部区域和嘴部的位置关系。At least one embodiment of the present disclosure provides a video processing method, including: acquiring at least one frame image and an audio segment, wherein each frame image includes at least one object, and each object includes a face area; preprocessing the at least one frame image , to obtain the mouth feature information of the face area; based on the mouth feature information and the audio clip, use the video processing network to process at least one frame image to obtain the target video, wherein the object in the target video has a mouth shape synchronized with the audio clip Change, the mouth feature information is at least used to provide the video processing network with the basic outline of each object's face area and mouth, and the positional relationship between each object's face area and mouth.
在该实施例的视频处理方法中,利用嘴部特征信息辅助视频处理网络得到目标视频,目标视频具有对应于音频片段的同步嘴型变化,相比于传统方式直接利用网络去做从无到有的创造,该方法利用嘴部特征信息向视频处理网络提供每个对象的面部区域与嘴部的基本轮廓,以及每个对象的面部区域和嘴部的位置关系,方便网络生成更加准确的嘴部区域,所得到的目标视频的嘴型部分匹配度更高,准确度也更高。In the video processing method of this embodiment, the mouth feature information is used to assist the video processing network to obtain the target video, and the target video has a synchronous mouth shape change corresponding to the audio clip, which is compared to the traditional way of directly using the network to do it from scratch The method uses mouth feature information to provide the video processing network with the basic outline of each object's facial area and mouth, as well as the positional relationship between each object's facial area and mouth, so that the network can generate more accurate mouths. area, the resulting target video has a higher matching degree of mouth shape and higher accuracy.
本公开至少一实施例提供的视频处理方法可应用于本公开实施例提供 的视频处理装置,该视频处理装置可被配置于电子设备上。该电子设备可以是个人计算机、移动终端等,该移动终端可以是手机、平板电脑、笔记本电脑等硬件设备。The video processing method provided in at least one embodiment of the present disclosure can be applied to the video processing device provided in the embodiment of the present disclosure, and the video processing device can be configured on an electronic device. The electronic device may be a personal computer, a mobile terminal, etc., and the mobile terminal may be a hardware device such as a mobile phone, a tablet computer, or a notebook computer.
下面结合附图对本公开的实施例进行详细说明,但是本公开并不限于这些具体的实施例。Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments.
图1为本公开一实施例提供的一种视频处理方法的流程图。FIG. 1 is a flowchart of a video processing method provided by an embodiment of the present disclosure.
如图1所示,本公开至少一实施例提供的视频处理方法包括步骤S10至步骤S30。As shown in FIG. 1 , the video processing method provided by at least one embodiment of the present disclosure includes steps S10 to S30.
在步骤S10,获取至少一个帧图像和音频片段。In step S10, at least one frame image and an audio segment are acquired.
在步骤S20,对至少一个帧图像进行预处理,得到面部区域的嘴部特征信息。In step S20, at least one frame image is preprocessed to obtain mouth feature information of the facial region.
在步骤S30,基于嘴部特征信息和音频片段,使用视频处理网络对至少一个帧图像进行处理,得到目标视频。In step S30, based on the mouth feature information and the audio segment, at least one frame image is processed using a video processing network to obtain a target video.
例如,目标视频中的对象具有与音频片段同步的嘴型变化。For example, an object in the target video has a mouth change that is synchronized with the audio clip.
例如,嘴部特征信息至少用于向视频处理网络提供每个对象的面部区域和嘴部的基本轮廓,以及每个对象的面部区域和嘴部的位置关系。For example, the mouth feature information is at least used to provide the video processing network with the basic outline of each object's face area and mouth, and the positional relationship between each object's face area and mouth.
例如,每个帧图像包括至少一个对象,每个对象包括面部区域。For example, each frame image includes at least one object, and each object includes a face area.
例如,可以获取一张静态的带有对象的图像作为帧图像,之后,基于该帧图像和音频片段,生成目标视频,在目标视频中,对象具有与音频片段同步的嘴型变化。For example, a static image with an object can be obtained as a frame image, and then a target video is generated based on the frame image and an audio clip. In the target video, the object has a mouth shape change synchronized with the audio clip.
例如,也可以获取一段预先录制、生成或制作的视频,该视频包括多个视频帧,视频帧中包括至少一个对象,将多个视频帧作为多个帧图像,之后,基于多个帧图像和音频片段,生成目标视频。For example, it is also possible to obtain a pre-recorded, generated or produced video, the video includes a plurality of video frames, the video frame includes at least one object, the plurality of video frames are used as a plurality of frame images, and then, based on the plurality of frame images and audio clips to generate the target video.
例如,对象可以包括真实人物、二维或三维动画人物、拟人化动物、仿生人等,这些对象均具有完整的面部区域,例如,面部区域包括嘴部、鼻子、眼睛、下巴等部位。For example, objects may include real people, 2D or 3D animated characters, anthropomorphic animals, bionic people, etc., and these objects all have complete facial regions, for example, facial regions include mouth, nose, eyes, chin and other parts.
例如,音频片段为目标视频中对象讲话的内容,例如,在动画配音领域,音频片段可以是动画人物的配音内容。For example, the audio segment is the speech content of the object in the target video. For example, in the field of animation dubbing, the audio segment may be the dubbing content of the animation character.
例如,在一个实施例中,可以预先录制一段视频,例如,在视频中,主讲老师会先面对镜头说“××的小朋友大家好”,这里,××表示地区。此时,录制好的视频中包括的多个视频帧即为多个帧图像,主讲老师即为帧图 像包括的对象。当视频播放的位置所获得的IP为A地区的IP时,例如,A地区为北京,则音频片段为“北京的小朋友大家好”;例如,A地区为天津,则音频片段为“天津的小朋友大家好”。For example, in one embodiment, a video can be pre-recorded. For example, in the video, the teacher will first face the camera and say "Hello, everyone from ××", where ×× indicates the region. At this point, the multiple video frames included in the recorded video are multiple frame images, and the lecturer is the object included in the frame images. When the IP obtained by the location where the video is played is the IP of region A, for example, if region A is Beijing, the audio clip is "Hello, children from Beijing"; for example, if region A is Tianjin, the audio clip is "Kids from Tianjin Hello everyone".
例如,在另一些实施例中,在预先录制的视频中,主讲老师会面对镜头说“×××同学获得第一名,×××同学获得第二名”。此时,录制好的视频中包括的多个视频帧即为多个帧图像,主讲老师即为帧图像包括的对象。根据获取的榜单结果,例如张三为第一名,李四为第二名,则音频片段为“张三同学获得第一名,李四同学获得第二名”。For example, in some other embodiments, in the pre-recorded video, the lecturer will face the camera and say "XXX classmate won the first place, and classmate XXXX won the second place". At this time, the multiple video frames included in the recorded video are multiple frame images, and the lecturer is the object included in the frame images. According to the list results obtained, for example, Zhang San is the first and Li Si is the second, then the audio clip is "Zhang San won the first place, and Li Si won the second place".
例如,音频片段可以是由用户预先录制好的语音片段,也可以是由文字片段转换成的语音片段,本公开对音频片段的获取方式不作限制。For example, the audio segment may be a pre-recorded voice segment by the user, or may be a voice segment converted from a text segment, and the present disclosure does not limit the acquisition method of the audio segment.
例如,帧图像可以是拍摄得到的原始图像,也可以是对原始图像进行图像处理后的处理后图像,本公开对此不作限制。For example, the frame image may be an original image obtained by shooting, or may be a processed image obtained by performing image processing on the original image, which is not limited in the present disclosure.
例如,嘴部特征信息包括至少一个嘴部模糊图片,例如,嘴部模糊图片用于向视频处理网络提供每个对象的面部区域和嘴部的基本轮廓,以及每个对象的面部区域和嘴部的位置关系。For example, the mouth feature information includes at least one blurred mouth picture, for example, the blurred mouth picture is used to provide the video processing network with the basic outline of the facial area and mouth of each object, and the facial area and mouth of each object location relationship.
例如,步骤S20可以包括:利用嘴部模糊模型对每个帧图像中的对象的嘴部进行模糊处理,得到每个帧图像对应的嘴部模糊图片。For example, step S20 may include: using a mouth blur model to blur the mouth of the object in each frame image to obtain a mouth blur picture corresponding to each frame image.
例如,嘴部模糊图片通过对帧图像中对象的嘴部进行模糊处理得到,也即将帧图像中对象的嘴部区域模糊化,从而给视频处理网络提供面部区域与嘴部区域的基本轮廓,以及每个对象的面部区域和嘴部的位置关系,保留图片大部分结构,方便网络生成更准确的嘴部图像,并为视频处理网络在处理过程中增加嘴部位置回归,增强嘴型生成的鲁棒性。For example, the blurred mouth picture is obtained by blurring the mouth of the object in the frame image, that is, blurring the mouth area of the object in the frame image, so as to provide the video processing network with the basic contours of the face area and the mouth area, and The positional relationship between the facial area and the mouth of each object retains most of the structure of the picture, which facilitates the network to generate more accurate mouth images, and adds mouth position regression during the processing of the video processing network to enhance the robustness of mouth shape generation. Stickiness.
例如,利用嘴部模糊模型对每个帧图像中的对象的嘴部进行模糊处理,得到每个帧图像对应的嘴部模糊图片,可以包括:对帧图像进行第一色彩空间转换,得到第一转换图像;提取第一转换图像中的嘴部区域,对嘴部区域进行第一滤波处理,得到帧图像对应的嘴部模糊图片。For example, using the mouth blur model to blur the mouth of the object in each frame image to obtain the mouth blur picture corresponding to each frame image may include: performing the first color space conversion on the frame image to obtain the first converting the image; extracting the mouth area in the first converted image, and performing a first filtering process on the mouth area to obtain a blurred mouth picture corresponding to the frame image.
例如,第一色彩空间为HSI色彩空间,这里,H表示色调(Hue)、S表示色饱和度(Saturation或Chroma),I表示亮度(Intensity或Brightness),HSI色彩空间利用H分量、S分量和I分量来描述色彩。For example, the first color space is the HSI color space, where H represents the hue (Hue), S represents the color saturation (Saturation or Chroma), and I represents the brightness (Intensity or Brightness). The HSI color space utilizes the H component, the S component and The I component describes the color.
例如,将帧图像从RGB色彩空间转换为HSI色彩空间,也即将每个像素点的值由原先的R分量(红色分量)、G分量(绿色分量)和B分量(蓝 色分量)转换为H分量、S分量和I分量,具体转换公式如下所示:For example, converting the frame image from the RGB color space to the HSI color space means converting the value of each pixel from the original R component (red component), G component (green component) and B component (blue component) to H Component, S component and I component, the specific conversion formula is as follows:
Figure PCTCN2022088965-appb-000001
Figure PCTCN2022088965-appb-000001
其中,I表示HSI色彩空间中的I分量,S表示HSI色彩空间中的S分量,H表示HSI色彩空间中的H分量,R表示RGB色彩空间中的R分量,G表示RGB色彩空间中的G分量,B表示RGB色彩空间中的B分量,min(*)表示最小值函数,θ表示角度参数。Among them, I represents the I component in the HSI color space, S represents the S component in the HSI color space, H represents the H component in the HSI color space, R represents the R component in the RGB color space, and G represents the G in the RGB color space Component, B represents the B component in the RGB color space, min(*) represents the minimum value function, and θ represents the angle parameter.
经过HSI色彩空间转换后,由于嘴唇通常为红色,在HSI色彩空间中的H分量对红色区域更加敏感,因此,嘴部区域的H分量较大,可以将第一转换图像中H分量大于预设阈值的区域提取出来作为嘴部区域,对嘴部区域进行均值滤波处理,将滤波结果作为帧图像对应的嘴部模糊图片。After the HSI color space conversion, since the lips are usually red, the H component in the HSI color space is more sensitive to the red area. Therefore, the H component in the mouth area is larger, and the H component in the first converted image can be larger than the preset The threshold area is extracted as the mouth area, and the mean filtering process is performed on the mouth area, and the filtering result is used as the mouth blurred picture corresponding to the frame image.
例如,为增加红色区域在H分量中的权重,本公开对角度参数的计算公式进行了修改,如下式所示:For example, in order to increase the weight of the red area in the H component, the disclosure modifies the calculation formula of the angle parameter, as shown in the following formula:
Figure PCTCN2022088965-appb-000002
Figure PCTCN2022088965-appb-000002
也就是说,在角度分量的分母中增加了(R-B) 2分量,以增加R分量对于B分量的敏感性,凸显嘴部区域的红色部分在H分量中的权重,提高所确定的嘴部区域的准确性。 That is to say, the (RB) 2 component is added to the denominator of the angle component to increase the sensitivity of the R component to the B component, highlight the weight of the red part of the mouth area in the H component, and improve the determined mouth area. accuracy.
例如,若帧图像中的对象为人物等具有皮肤区域的对象,则可以在上述过程的基础上,进一步提取皮肤区域,选择皮肤区域中包括嘴部的预设区域,对预设区域进行滤波处理,将两次滤波处理结果合成,获得嘴部模糊化的嘴部模糊图片,增强模糊效果。For example, if the object in the frame image is an object with a skin area such as a person, then on the basis of the above process, the skin area can be further extracted, a preset area including the mouth can be selected in the skin area, and the preset area can be filtered. , the results of the two filtering processes are combined to obtain a blurred mouth picture with a blurred mouth, and the blur effect is enhanced.
例如,利用嘴部模糊模型对每个帧图像中的对象的嘴部进行模糊处理,得到每个帧图像对应的嘴部模糊图片,可以包括:对帧图像进行第一色彩空间转换,得到第一转换图像;提取第一转换图像中的嘴部区域,对嘴部区域进行第一滤波处理,得到第一中间模糊图像;对帧图像进行第二色彩空间转换,得到第二转换图像;提取第二转换图像中的皮肤区域,从皮肤区域中选择包括嘴部的预设区域;对预设区域进行第二滤波处理,得到第二中间模糊图像;对第一中间模糊图像和第二中间模糊图像进行合成处理,得到帧图像对应的嘴部模糊图片。For example, using the mouth blur model to blur the mouth of the object in each frame image to obtain the mouth blur picture corresponding to each frame image may include: performing the first color space conversion on the frame image to obtain the first converting the image; extracting the mouth area in the first converted image, performing the first filtering process on the mouth area to obtain the first intermediate blurred image; performing the second color space conversion on the frame image to obtain the second converted image; extracting the second Converting the skin area in the image, selecting a preset area including the mouth from the skin area; performing a second filtering process on the preset area to obtain a second intermediate blurred image; performing a second intermediate blurred image on the first intermediate blurred image and the second intermediate blurred image Composite processing to obtain a mouth blurred picture corresponding to the frame image.
例如,第二色彩空间为YCbCr色彩空间。YCbCr颜色空间中的“Y”表示明亮度,也就是像素点的灰阶值;而“Cr”和“Cb”表示的则是色度,作用是描述影像色彩及饱和度,用于指定像素点的颜色,其中,“Cr”反映了RGB输入信号中的红色部分与RGB信号亮度值之间的差异,也即像素点的红色色度分量,而“Cb”反映的是RGB输入信号中的蓝色部分与RGB信号亮度值之间的差异,也即像素点的蓝色色度分量。RGB信号亮度值通过将RGB输入信号的特定部分叠加到一起得到。For example, the second color space is the YCbCr color space. "Y" in the YCbCr color space represents the brightness, that is, the grayscale value of the pixel; while "Cr" and "Cb" represent the chroma, which are used to describe the color and saturation of the image, and are used to specify the pixel Among them, "Cr" reflects the difference between the red part of the RGB input signal and the brightness value of the RGB signal, that is, the red chrominance component of the pixel, and "Cb" reflects the blue color of the RGB input signal. The difference between the color part and the luminance value of the RGB signal, that is, the blue chrominance component of the pixel. RGB signal luminance values are obtained by summing specific parts of the RGB input signals together.
目前,一般图像都是基于RGB(红绿蓝)颜色空间的,在RGB颜色空间里人体图像的肤色受亮度影响相当大,所以肤色点很难从非肤色点中分离出来,也就是说,在RGB颜色空间处理过的人脸图像,肤色点是离散的点,中间嵌有很多非肤色点,这为肤色区域标定(例如人脸标定、眼睛标定等)带来了难题。人脸检测中常常用到YCbCr色彩空间,因为把RGB颜色空间转为YCbCr色彩空间可以忽略亮度的影响,而且由于YCbCr色彩空间受亮度影响很小,肤色会产生很好的类聚,从而可以把三维的颜色空间映射为二维的CbCr平面,使得肤色点形成一定的形状,以达到根据肤色识别人体图像的目的。也就是说,YCbCr色彩空间就是一个单独把亮度分离开来的颜色模型,使用该颜色模型可以使得肤色点不会受到光线亮度影响而导致难以分离。At present, general images are based on the RGB (red, green, blue) color space. In the RGB color space, the skin color of the human body image is greatly affected by the brightness, so it is difficult to separate the skin color point from the non-skin color point. That is to say, in In the face image processed in the RGB color space, the skin color points are discrete points, and there are many non-skin color points embedded in the middle, which brings difficulties for skin color area calibration (such as face calibration, eye calibration, etc.). The YCbCr color space is often used in face detection, because the effect of brightness can be ignored by converting the RGB color space to the YCbCr color space, and because the YCbCr color space is less affected by the brightness, the skin color will produce a good cluster, so that the The three-dimensional color space is mapped to the two-dimensional CbCr plane, so that the skin color points form a certain shape, so as to achieve the purpose of recognizing the human body image according to the skin color. In other words, the YCbCr color space is a color model that separates the brightness separately. Using this color model can make the skin color point not be affected by the brightness of the light and make it difficult to separate.
例如,将帧图像映射至YCbCr色彩空间,以得到映射后图像;接着,将映射后图像在CbCr平面进行投影,以得到肤色样本图像,该肤色样本图像包括对应于帧图像像素点的肤色样本点;最后,遍历该肤色样本图像,在遍历肤色样本图像过程中,若肤色样本点位于皮肤像素点椭圆边界及椭圆内,判断肤色样本点对应的帧图像中像素点属于皮肤区域,若肤色样本点不位于皮肤像素点椭圆边界及椭圆内,判断肤色样本点对应的帧图像中像素点不属于皮肤区域,由此,提取第二转换图像中的皮肤区域。For example, the frame image is mapped to the YCbCr color space to obtain the mapped image; then, the mapped image is projected on the CbCr plane to obtain a skin color sample image, and the skin color sample image includes skin color sample points corresponding to the pixel points of the frame image ;Finally, traverse the skin color sample image. In the process of traversing the skin color sample image, if the skin color sample point is located in the skin pixel point ellipse boundary and within the ellipse, it is judged that the pixel point in the frame image corresponding to the skin color sample point belongs to the skin area. If the skin color sample point If it is not located in the skin pixel point ellipse boundary and within the ellipse, it is judged that the pixel point in the frame image corresponding to the skin color sample point does not belong to the skin area, thereby extracting the skin area in the second converted image.
例如,在一些实施例中,可以利用面部关键点检测模型对帧图像进行处理,得到多个面部关键点,根据面部关键点的位置确定帧图像中对象的脸部是否是眼睛在帧图像的上侧,下巴在帧图像的下侧,若是,则说明对象的脸部方向正常,嘴部区域位于帧图像的下侧部分,此时,可以提取皮肤区域中的预设坐标区间,例如,提取对象的皮肤区域中的下半部分区域,作为包括嘴部的预设区域;若不是,则说明对象的脸部方向不正常,对帧图像进行旋 转之后提取皮肤区域中的预设坐标区间,得到包括嘴部的预设区域。For example, in some embodiments, the facial key point detection model can be used to process the frame image to obtain a plurality of facial key points, and determine whether the face of the object in the frame image has eyes on the frame image according to the positions of the facial key points. side, the chin is on the lower side of the frame image, if it is, it means that the face direction of the object is normal, and the mouth area is located at the lower side of the frame image. At this time, the preset coordinate interval in the skin area can be extracted, for example, the object The lower half of the skin area is taken as the preset area including the mouth; if not, it means that the face direction of the subject is abnormal. After rotating the frame image, extract the preset coordinate interval in the skin area to obtain the Preset area for the mouth.
例如,在一些实施例中,可以根据皮肤区域中的皮肤比例,确定包括嘴部的预设区域。例如,下巴部分只有嘴部,皮肤比例较高,而额头部分有头发等非皮肤区域,皮肤比例较低,由此,可以根据皮肤比例确定帧图像中对象的脸部是否是眼睛在上,下巴在下,例如,若皮肤比例高的部分位于帧图像的下侧部分,则说明对象的脸部方向正常,之后,参考如前所述的提取过程,提取皮肤区域中包括嘴部的预设区域,例如,若皮肤比例高的部分位于帧图像的上侧部分,则说明对象的脸部方向不正常,对帧图像进行旋转之后参考如前所述的提取过程,提取皮肤区域中包括嘴部的预设区域。For example, in some embodiments, the preset area including the mouth may be determined according to the skin ratio in the skin area. For example, the chin part only has the mouth, and the skin ratio is relatively high, while the forehead part has non-skin areas such as hair, and the skin ratio is low. Therefore, it can be determined according to the skin ratio whether the face of the object in the frame image has eyes on the upper part of the chin In the following, for example, if the part with a high skin ratio is located in the lower part of the frame image, it means that the face direction of the subject is normal. Then, refer to the extraction process as described above to extract the preset area including the mouth in the skin area, For example, if the part with a high skin ratio is located in the upper part of the frame image, it means that the face direction of the subject is abnormal. After rotating the frame image, refer to the extraction process as described above to extract the predicted area including the mouth in the skin area. set area.
例如,在提取到预设区域后,对预设区域进行均值滤波,将滤波结果作为第二中间模糊图像。For example, after the preset area is extracted, mean filtering is performed on the preset area, and the filtering result is used as the second intermediate blurred image.
例如,将帧图像由RGB色彩空间转换为HSI色彩空间,得到第一转换图像,将第一转换图像中H分量大于预设阈值的区域提取出来作为嘴部区域,对嘴部区域进行均值滤波处理,将滤波结果作为第一中间模糊图像。For example, convert the frame image from the RGB color space to the HSI color space to obtain the first converted image, extract the area in the first converted image whose H component is greater than the preset threshold as the mouth area, and perform mean value filtering on the mouth area , using the filtering result as the first intermediate blurred image.
例如,在得到第一中间模糊图像和第二中间模糊图像之后,对第一中间模糊图像和第二中间模糊图像进行合成处理,例如,将对应位置的像素点进行加和处理,得到帧图像对应的嘴部模糊图片。例如,加和处理可以采用等权重相加,以防止像素值过大,例如,可以设置一个0到1之间的小数作为权重值(例如0.5),将第一中间模糊图像和第二中间模糊图像对应位置的像素点分别乘以权重值之后再相加,以得到嘴部模糊图片中对应位置的像素点的像素值。For example, after the first intermediate blurred image and the second intermediate blurred image are obtained, the first intermediate blurred image and the second intermediate blurred image are synthesized, for example, the pixels at corresponding positions are summed to obtain the frame image corresponding to mouth blurred picture. For example, the addition process can use equal weights to prevent the pixel value from being too large. For example, a decimal between 0 and 1 can be set as a weight value (for example, 0.5), and the first intermediate blurred image and the second intermediate blurred image Pixels at corresponding positions in the image are multiplied by weight values and then added together to obtain pixel values of pixels at corresponding positions in the blurred mouth image.
例如,当帧图像中包括多个对象时,分别对各个对象执行上述模糊处理过程,以使得各个对象的嘴部均被模糊化。For example, when multiple objects are included in the frame image, the above blurring process is performed on each object, so that the mouths of each object are blurred.
图2A为本公开至少一实施例提供的嘴部模糊处理的过程示意图。下面结合图2A,具体说明嘴部模糊处理的执行过程。Fig. 2A is a schematic diagram of a mouth blurring process provided by at least one embodiment of the present disclosure. The execution process of mouth blurring processing will be described in detail below with reference to FIG. 2A .
在获得帧图像之后,对帧图像进行第一色彩空间转换,也即将帧图像转换至HSI色彩空间,得到第一转换图像,具体过程如前所述,这里不再赘述。After the frame image is obtained, the first color space conversion is performed on the frame image, that is, the frame image is converted to the HSI color space to obtain the first converted image. The specific process is as described above and will not be repeated here.
之后,提取第一转换图像中的嘴部区域,例如,根据H分量提取嘴部区域,具体过程如前所述,这里不再赘述。Afterwards, the mouth area in the first converted image is extracted, for example, the mouth area is extracted according to the H component. The specific process is as described above, and will not be repeated here.
之后,对嘴部区域进行均值滤波,得到第一中间模糊图像。Afterwards, mean filtering is performed on the mouth region to obtain a first intermediate blurred image.
同时,对帧图像进行第二色彩空间转换,也即将帧图像转换至YCbCr 色彩空间,得到第二转换图像。At the same time, the second color space conversion is performed on the frame image, that is, the frame image is converted to the YCbCr color space to obtain a second converted image.
之后,提取第二转换图像中的皮肤区域,具体过程如前所述,这里不再赘述。Afterwards, the skin area in the second transformed image is extracted, the specific process is as described above, and will not be repeated here.
之后,提取包括嘴部的预设区域,具体过程如前所述,这里不再赘述。Afterwards, the preset region including the mouth is extracted, the specific process is as described above, and will not be repeated here.
之后,对预设区域进行均值滤波,得到第二中间模糊图像。Afterwards, mean filtering is performed on the preset area to obtain a second intermediate blurred image.
最后,将第一中间模糊图像和第二中间模糊图像进行合成处理,得到帧图像对应的嘴部模糊图片。Finally, the first intermediate blurred image and the second intermediate blurred image are synthesized to obtain a mouth blurred picture corresponding to the frame image.
图2B为本公开至少一实施例提供的帧图像的示意图。如图2B所示,该帧图像中包括一个对象,该对象具有完整的面部区域。Fig. 2B is a schematic diagram of a frame image provided by at least one embodiment of the present disclosure. As shown in FIG. 2B , the frame image includes an object, and the object has a complete face area.
图2C为本公开至少一实施例提供的嘴部模糊图片,该嘴部模糊图片为对图2B所示的帧图像中对象的嘴部进行模糊处理得到。如图2C所示,在嘴部模糊图片中,对象的脸部下半部分区域被模糊处理,但仍可以看出脸部与嘴巴的基本轮廓和位置,相比于传统方式将嘴部进行抠图处理来说,图片的大部分结构得到保留,从而方便网络基于相关信息生成更准确的嘴部图像。FIG. 2C is a blurred mouth picture provided by at least one embodiment of the present disclosure. The blurred mouth picture is obtained by blurring the mouth of the object in the frame image shown in FIG. 2B . As shown in Figure 2C, in the blurred mouth picture, the lower part of the subject’s face is blurred, but the basic contours and positions of the face and mouth can still be seen. Compared with the traditional way of cutting out the mouth In terms of graph processing, most of the structure of the image is preserved, which makes it easier for the network to generate more accurate mouth images based on relevant information.
需要说明的是,在图2B与图2C中对眼睛部分进行马赛克处理以保护隐私,实际处理不涉及这个过程。It should be noted that in FIG. 2B and FIG. 2C, mosaic processing is performed on the eye part to protect privacy, and the actual processing does not involve this process.
由于输入视频处理网络的是对嘴部区域进行模糊处理的嘴部模糊图片,相比于其他方法,嘴部模糊图片提供了嘴部和面部的基本轮廓,可以帮助视频处理网络生成更加准确的嘴部图像。但是,视频处理网络不清楚哪个区域是模糊区域,哪个区域是清晰区域,而且每个帧图像中的嘴部位置可能都不同,这会使得模型的处理效果难以提升。Since the input video processing network is a blurred mouth image that blurs the mouth area, compared with other methods, the blurred mouth image provides the basic contours of the mouth and face, which can help the video processing network generate more accurate mouth images. internal image. However, the video processing network does not know which area is a blurred area and which area is a clear area, and the position of the mouth in each frame image may be different, which makes it difficult to improve the processing effect of the model.
例如,在模糊区域中的物体的轮廓不明显,轮廓边缘灰度变化不强烈,从而导致层次感不强,而在清晰区域中的物体轮廓边缘灰度变化明显,层次感强。梯度表示某个像素点的方向倒数,通过梯度值的变化可以确定嘴部模糊图片中的轮廓边缘,从而确定嘴部模糊图片中的模糊区域(例如嘴部模糊图片中进行模糊处理的区域)和非模糊区域(例如嘴部模糊图片中未进行模糊处理的区域)的范围。For example, the outline of the object in the blurred area is not obvious, and the gray level of the edge of the outline does not change strongly, resulting in a weak sense of hierarchy, while the gray level of the edge of the outline of the object in the clear area changes significantly, and the sense of hierarchy is strong. The gradient represents the reciprocal direction of a certain pixel point, and the contour edge in the blurred mouth picture can be determined through the change of the gradient value, thereby determining the blurred area in the blurred mouth picture (for example, the blurred area in the blurred mouth picture) and The extent of the unblurred area (such as the unblurred area in a blurred mouth image).
例如,嘴部特征信息还可以包括至少一个嘴部模糊图片分别对应的至少一个梯度特征图,梯度特征图用于向视频处理网络提供该梯度特征图对应的嘴部模糊图片中模糊区域和非模糊区域的范围,从而视频处理网络可以得到 更加准确的嘴部位置范围,降低图像噪声带来的干扰,并且在训练阶段便于模型快速收敛。For example, the mouth feature information may also include at least one gradient feature map corresponding to at least one mouth blurred picture, and the gradient feature map is used to provide the video processing network with the blurred area and the non-blurred area in the mouth blurred picture corresponding to the gradient feature map. The scope of the region, so that the video processing network can obtain a more accurate mouth position range, reduce the interference caused by image noise, and facilitate the rapid convergence of the model during the training phase.
例如,步骤S20还可以包括:对至少一个嘴部模糊图片进行梯度特征提取,得到每个嘴部模糊图片对应的梯度特征图,其中,嘴部特征信息还包括至少一个嘴部模糊图片分别对应的至少一个梯度特征图。For example, step S20 may also include: performing gradient feature extraction on at least one blurred mouth picture to obtain a gradient feature map corresponding to each blurred mouth picture, wherein the mouth feature information also includes at least one blurred mouth picture corresponding to At least one gradient feature map.
例如,针对每个嘴部模糊图片,该嘴部模糊图片对应的梯度特征图由该嘴部模糊图片包括的各个像素点分别对应的梯度值组成。For example, for each blurred mouth picture, the gradient feature map corresponding to the blurred mouth picture is composed of gradient values corresponding to each pixel included in the blurred mouth picture.
例如,对至少一个嘴部模糊图片进行梯度特征提取,得到每个嘴部模糊图片对应的梯度特征图,可以包括:获取每个嘴部模糊图片对应的灰度图;获取第一卷积核和第二卷积核,其中,第一卷积核的尺寸小于第二卷积核的尺寸,第一卷积核中的所有元素之和为0,第二卷积核中的所有元素之和为0;将灰度图与第一卷积核和第二卷积核进行卷积处理,得到每个嘴部模糊图片对应的梯度图。For example, performing gradient feature extraction on at least one blurred mouth picture to obtain a gradient feature map corresponding to each blurred mouth picture may include: obtaining a grayscale image corresponding to each blurred mouth picture; obtaining the first convolution kernel and The second convolution kernel, wherein the size of the first convolution kernel is smaller than the size of the second convolution kernel, the sum of all elements in the first convolution kernel is 0, and the sum of all elements in the second convolution kernel is 0; Convolve the grayscale image with the first convolution kernel and the second convolution kernel to obtain the gradient image corresponding to each mouth blurred image.
例如,若嘴部模糊图片为彩色图片,则对嘴部模糊图片进行灰度处理,以得到嘴部模糊图片对应的灰度图。For example, if the blurred mouth picture is a color picture, grayscale processing is performed on the blurred mouth picture to obtain a grayscale image corresponding to the blurred mouth picture.
例如,通常计算梯度图时,利用第一卷积核A1与灰度图进行卷积处理,第一卷积核A1中的所有元素之和为0,第一卷积核A1的尺寸通常为3×3。本公开在此基础上,提供第二卷积核A2参与梯度特征图的处理,第二卷积核A2中的所有元素之和也为0,第二卷积核A2的尺寸大于第一卷积核A1的尺寸,例如,第二卷积核A2的尺寸为5×5或7×7,从而利用第二卷积核A2将梯度特征提取的感受野扩大,降低噪声干扰的影响,减少嘴部模糊图片中的噪声,降低噪声对后续特征提取子网络进行特征提取的影响。For example, when calculating the gradient map, the first convolution kernel A1 is used to perform convolution processing with the grayscale image. The sum of all elements in the first convolution kernel A1 is 0, and the size of the first convolution kernel A1 is usually 3 ×3. On this basis, the present disclosure provides the second convolution kernel A2 to participate in the processing of the gradient feature map, the sum of all elements in the second convolution kernel A2 is also 0, and the size of the second convolution kernel A2 is larger than the first convolution The size of the kernel A1, for example, the size of the second convolution kernel A2 is 5×5 or 7×7, so that the second convolution kernel A2 can be used to expand the receptive field of the gradient feature extraction, reduce the influence of noise interference, and reduce the mouth Blur the noise in the picture to reduce the impact of noise on the feature extraction of the subsequent feature extraction sub-network.
例如,第一卷积核A1如下公式所示:For example, the first convolution kernel A1 is shown in the following formula:
Figure PCTCN2022088965-appb-000003
Figure PCTCN2022088965-appb-000003
例如,第二卷积核A2如下公式所示:For example, the second convolution kernel A2 is shown in the following formula:
Figure PCTCN2022088965-appb-000004
Figure PCTCN2022088965-appb-000004
例如,梯度特征图O的计算公式如下:For example, the calculation formula of the gradient feature map O is as follows:
Figure PCTCN2022088965-appb-000005
Figure PCTCN2022088965-appb-000005
其中,I表示灰度图,
Figure PCTCN2022088965-appb-000006
表示卷积计算。
Among them, I represents the grayscale image,
Figure PCTCN2022088965-appb-000006
Indicates a convolution calculation.
需要说明的是,上述第一卷积核A1和第二卷积核A2仅为示意,例如, 只要满足第一卷积核A1中所有元素之和为0,第二卷积核A2中所有元素之和为0,第一卷积核的尺寸小于第二卷积核的尺寸即可,本公开对此不做具体限制。It should be noted that the above-mentioned first convolution kernel A1 and second convolution kernel A2 are only illustrative, for example, as long as the sum of all elements in the first convolution kernel A1 is 0, all elements in the second convolution kernel A2 The sum is 0, as long as the size of the first convolution kernel is smaller than the size of the second convolution kernel, which is not specifically limited in the present disclosure.
例如,嘴部特征信息还可以包括多个嘴部关键点。例如,多个嘴部关键点用于在生成目标视频中对象的嘴型过程中,辅助确定嘴部的精确位置。也就是说,在嘴部特征信息还多个嘴部关键点时,嘴部特征信息还用于向视频处理网络提供每个对象的嘴部的位置。For example, the mouth feature information may also include a plurality of mouth key points. For example, multiple mouth keypoints are used to assist in determining the precise position of the mouth during the process of generating the mouth shape of the object in the target video. That is to say, when the mouth feature information includes a plurality of mouth key points, the mouth feature information is also used to provide the position of the mouth of each object to the video processing network.
如果只用嘴部模糊图片辅助生成目标视频,则目标视频中嘴部位置可能定位不太准确,用嘴部关键点可以辅助提高嘴部位置精度。此外,嘴部关键点可以使得视频处理网络只关注嘴部和周围肌肉的信息,不用额外学习整体脸部轮廓、方向与构造等的信息,因此,利用嘴部模糊图片结合嘴部关键点能够有效地提高最终生成的目标视频中对象口型变化及位置的准确率。If only the blurred mouth image is used to assist in generating the target video, the position of the mouth in the target video may not be accurately positioned. Using key points of the mouth can help improve the accuracy of the mouth position. In addition, the key points of the mouth can make the video processing network only focus on the information of the mouth and surrounding muscles, without additional learning of the overall face contour, direction and structure, etc. Therefore, using blurred mouth images combined with key points of the mouth can effectively Improve the accuracy of the object's mouth shape change and position in the final generated target video.
例如,步骤S20还可以包括:利用面部关键点检测模型对每个帧图像进行处理,得到多个面部关键点;提取多个面部关键点中与嘴部相关的多个嘴部关键点。For example, step S20 may also include: processing each frame of image with a facial key point detection model to obtain multiple facial key points; extracting multiple mouth key points related to the mouth among the multiple facial key points.
例如,当帧图像中的对象为人物时,面部关键点检测模型可以采用人脸关键点检测模型,人脸关键点检测模型对帧图像中的人脸进行处理,得到每个帧图像分别对应的多个面部关键点,这些面部关键点可以包括眼睛、鼻子、嘴巴等部位相关的多个关键点。之后,从多个面部关键点中提取与嘴部相关的多个嘴部关键点,并获取多个嘴部关键点的位置坐标。例如,这里的多个嘴部关键点包括所有帧图像分别对应的多个嘴部关键点,例如,每个帧图像可以得到25个嘴部关键点,帧图像一共有10帧,则共有250个嘴部关键点输入解码生成子网络作为辅助以确定嘴部的精确位置。For example, when the object in the frame image is a person, the face key point detection model can adopt the face key point detection model, and the face key point detection model processes the face in the frame image to obtain the corresponding Multiple facial key points, these facial key points can include multiple key points related to parts such as eyes, nose, and mouth. Afterwards, a plurality of mouth key points related to the mouth are extracted from the plurality of facial key points, and position coordinates of the plurality of mouth key points are obtained. For example, the multiple mouth key points here include multiple mouth key points corresponding to all frame images. For example, each frame image can get 25 mouth key points, and there are 10 frame images in total, so there are 250 mouth key points in total. Mouth keypoints are input into a decoding-generating sub-network as an aid to determine the precise position of the mouth.
例如,视频处理网络包括特征提取子网络和解码生成子网络。For example, a video processing network includes a feature extraction subnetwork and a decoding generation subnetwork.
例如,步骤S30可以包括:对音频片段进行频谱转换处理,得到特征频谱;利用特征提取子网络对至少一个嘴部模糊图片和特征频谱进行特征提取处理,得到M个视觉特征向量,其中,M个视觉特征向量与音频片段相匹配,M为正整数且小于等于至少一个嘴部模糊图片的数量;利用解码生成子网络对M个视觉特征向量进行处理,得到M个目标帧,其中,M个目标帧与音频片段中M个时点一一对应,且M个目标帧中每个目标帧具有与音频片段中对应时点对应的嘴型;根据M个目标帧得到目标视频。For example, step S30 may include: performing spectral conversion processing on the audio clip to obtain a feature spectrum; using a feature extraction sub-network to perform feature extraction processing on at least one blurred mouth picture and feature spectrum to obtain M visual feature vectors, wherein M Match the visual feature vector with the audio clip, M is a positive integer and is less than or equal to the number of at least one mouth blurred picture; use the decoding generation sub-network to process the M visual feature vectors to obtain M target frames, among which, M target Frames correspond one-to-one to M time points in the audio clip, and each target frame in the M target frames has a mouth shape corresponding to the corresponding time point in the audio clip; the target video is obtained according to the M target frames.
例如,对音频片段进行频谱转换处理时,可以提取音频片段的梅尔倒谱系数(Mel-scale Frequency Cepstral Coefficients,简称MFCC)作为特征频谱。在语音识别领域,MFCC是将语音物理信息(例如频谱包络和细节)进行编码运算得到的一组特征向量,这组特征向量可以理解为包括m1个n1维特征向量,这里,音频片段包括m1个音频帧,每个音频帧转换为n1维特征向量,由此,得到n1*m1的矩阵向量作为特征频谱。For example, when performing spectral conversion processing on an audio segment, Mel-scale Frequency Cepstral Coefficients (MFCC for short) of the audio segment may be extracted as a feature spectrum. In the field of speech recognition, MFCC is a set of feature vectors obtained by encoding speech physical information (such as spectral envelope and details). This set of feature vectors can be understood as including m1 n1-dimensional feature vectors. Here, audio clips include m1 audio frames, and each audio frame is converted into an n1-dimensional feature vector, thus, a matrix vector of n1*m1 is obtained as a feature spectrum.
图3为本公开至少一实施例提供的特征频谱的示意图。如图3所示,该特征频谱的横坐标表示时间,也即表示音频片段包括40个音频帧,纵坐标表示MFCC特征向量,位于同一列的表示一组特征向量,不同灰度表示不同的强度。Fig. 3 is a schematic diagram of a characteristic spectrum provided by at least one embodiment of the present disclosure. As shown in Figure 3, the abscissa of the feature spectrum represents time, which means that the audio clip includes 40 audio frames, the ordinate represents the MFCC feature vector, and the ones in the same column represent a set of feature vectors, and different gray levels represent different intensities .
当然,也可以采用其他提取频谱特征的方式对音频片段进行处理,以得到特征频谱,本公开对此不作限制。Of course, the audio segment may also be processed in other manners for extracting spectral features to obtain the characteristic spectrum, which is not limited in the present disclosure.
需要说明的是,在本公开中,视频与音频片段匹配是指,视频包括的帧图像中对象的嘴型应当是与该帧图像对应同一时点的音频中的内容的形状。例如,音频片段内容为“生日快乐”,则视频中的嘴型变化应当与对象说出“生日快乐”时的嘴型相匹配。It should be noted that in this disclosure, the matching of video and audio clips means that the mouth shape of the object in the frame image included in the video should be the shape of the content in the audio corresponding to the same time point as the frame image. For example, if the content of the audio clip is "Happy Birthday", the mouth movement in the video should match the mouth movement of the subject when they say "Happy Birthday".
例如,M个视觉特征向量与音频片段相匹配,表示M个视觉特征向量与音频片段同步。由于在训练阶段会使得特征提取子网络输出的音频特征向量(表示音频片段的特征信息,具体参见后文描述)和视觉特征向量一致,因此,特征频谱和所有帧图像对应的所有嘴部模糊图片输入特征提取子网络后,所输出的M个视觉特征向量与音频特征向量是基本相同的向量,从而实现与音频片段相匹配。For example, M visual feature vectors are matched with audio clips, which means that M visual feature vectors are synchronized with audio clips. Since the audio feature vector (representing the feature information of the audio clip, see the description below) output by the feature extraction sub-network is consistent with the visual feature vector during the training phase, the feature spectrum and all the blurred mouth pictures corresponding to all frame images After the feature extraction sub-network is input, the output M visual feature vectors and audio feature vectors are basically the same vectors, so as to achieve matching with audio clips.
例如,利用特征提取子网络对至少一个嘴部模糊图片和特征频谱进行特征提取处理,得到M个视觉特征向量,可以包括;将至少一个嘴部模糊图片依序分成M组,利用特征提取子网络提取每组对应的视觉特征向量,以得到M个视觉特征向量。For example, using the feature extraction sub-network to perform feature extraction processing on at least one blurred mouth picture and feature spectrum to obtain M visual feature vectors, which may include; dividing at least one blurred mouth picture into M groups in sequence, using the feature extraction sub-network The visual feature vectors corresponding to each group are extracted to obtain M visual feature vectors.
例如,帧图像的数量为y个,对y个帧图像进行模糊处理后得到y个嘴部模糊图片。之后,将y个嘴部模糊图片按照显示时点顺序,每x个嘴部模糊图片构成一组,共得到M=y/x组嘴部模糊图片,这里,x和y均为正整数。之后,将M组嘴部模糊图片依次输入特征提取子网络,得到每组嘴部模糊图片对应的视觉特征向量,从而得到M个视觉特征向量。For example, the number of frame images is y, and y blurred mouth pictures are obtained after blurring y frame images. Afterwards, the y blurred mouth pictures are displayed in sequence, and each x blurred mouth pictures form a group, and a total of M=y/x groups of mouth blurred pictures are obtained, where x and y are both positive integers. Afterwards, M groups of mouth blurred pictures are sequentially input into the feature extraction sub-network to obtain visual feature vectors corresponding to each group of mouth blurred pictures, thereby obtaining M visual feature vectors.
在帧图像的数量比较多的情况下,如果不进行上述分组处理,可能使得视频处理网络训练难度增加,且网络不易收敛。考虑对象在讲话过程中嘴型不会迅速发生变化,每个发音会持续一段时间,因此可以对帧图像进行分组处理,在不影响最终效果的前提下,降低网络训练过程中的难度,更易得到收敛的网络。In the case of a relatively large number of frame images, if the above-mentioned grouping process is not performed, the training difficulty of the video processing network may be increased, and the network is not easy to converge. Considering that the subject's mouth shape will not change rapidly during the speech process, each pronunciation will last for a period of time, so the frame images can be grouped, and the difficulty in the network training process is reduced without affecting the final effect, and it is easier to obtain convergent network.
例如,在嘴部特征信息还包括至少一个嘴部模糊图片分别对应的至少一个梯度特征图时,利用特征提取子网络对至少一个嘴部模糊图片和特征频谱进行特征提取处理,得到M个视觉特征向量,可以包括:利用特征提取子网络对至少一个嘴部模糊图片、至少一个梯度特征图和特征频谱进行特征提取处理,得到M个视觉特征向量,其中,至少一个梯度特征图用于为特征提取子网络提供对应的嘴部模糊图片中模糊区域和非模糊区域的范围。For example, when the mouth feature information also includes at least one gradient feature map corresponding to at least one mouth blurred picture, the feature extraction sub-network is used to perform feature extraction processing on at least one mouth blurred picture and feature spectrum to obtain M visual features The vector may include: using a feature extraction sub-network to perform feature extraction processing on at least one blurred mouth picture, at least one gradient feature map and feature spectrum, to obtain M visual feature vectors, wherein at least one gradient feature map is used for feature extraction The sub-networks provide the ranges of blurred and non-blurred regions in the corresponding mouth blurred images.
例如,嘴部模糊图片为彩色图像,则每个像素点的像素值包括一组RGB像素值,因而特征提取子网络的输入通道数至少为3,分别对应R通道、G通道以及B通道。例如,对应R通道、G通道以及B通道增加一个输入通道,在得到嘴部模糊图片对应的梯度特征图后,将梯度特征图由该增加的输入通道输入特征提取子网络,也即特征提取子网络的输入大小为M*N*4,其中,M表示嘴部模糊图片的宽度,N表示嘴部模糊图片的高度,4表示4个输入通道。For example, if the blurred mouth picture is a color image, the pixel value of each pixel includes a set of RGB pixel values, so the number of input channels of the feature extraction sub-network is at least 3, corresponding to the R channel, G channel and B channel respectively. For example, add an input channel corresponding to the R channel, G channel and B channel. After obtaining the gradient feature map corresponding to the blurred mouth picture, the gradient feature map is input from the added input channel to the feature extraction subnetwork, that is, the feature extraction subnetwork. The input size of the network is M*N*4, where M represents the width of the blurred mouth image, N represents the height of the blurred mouth image, and 4 represents 4 input channels.
例如,若依序对多个嘴部模糊图片进行分组,则对梯度特征图也进行相同的分组,嘴部模糊图片和与其对应的梯度特征图一起输入特征提取子网络进行处理。For example, if multiple mouth blurred pictures are grouped sequentially, the gradient feature map is also grouped in the same way, and the mouth blurred picture and its corresponding gradient feature map are input to the feature extraction sub-network for processing.
例如,在嘴部特征信息还包括多个嘴部关键点时,利用解码生成子网络对M个视觉特征向量进行处理,得到M个目标帧,可以包括:利用解码生成子网络对每个视觉特征向量进行处理,生成带有嘴部区域的中间帧;利用多个嘴部关键点对中间帧的嘴部区域的位置和图像信息进行修正,得到视觉特征向量对应的目标帧。For example, when the mouth feature information also includes multiple mouth key points, using the decoding generation sub-network to process M visual feature vectors to obtain M target frames may include: using the decoding generation sub-network to process each visual feature vector The vector is processed to generate an intermediate frame with the mouth area; the position and image information of the mouth area of the intermediate frame are corrected by using multiple mouth key points, and the target frame corresponding to the visual feature vector is obtained.
若嘴部特征信息仅包括嘴部模糊图片,则生成的视觉特征向量中的嘴部仍然是模糊状态,而解码生成子网络不能够直接像人类的认知一样懂得脸部的构造与大致形状,经过解码生成子网络生成的带有嘴部区域的图片中嘴部位置可能不太准确,因此,可以利用多个嘴部关键点辅助提高嘴部位置的精度,辅助网络生成更逼真的图片。If the mouth feature information only includes the blurred picture of the mouth, the mouth in the generated visual feature vector is still blurred, and the decoding generation sub-network cannot directly understand the structure and general shape of the face like human cognition. The position of the mouth in the picture with the mouth area generated by the decoding generation sub-network may not be accurate. Therefore, multiple key points of the mouth can be used to help improve the accuracy of the mouth position, and the auxiliary network can generate more realistic pictures.
例如,图像信息包括嘴部区域周围的肌肉等图像信息。例如,利用嘴部关键点可以定位嘴部在帧图像中的位置,从而利用嘴部关键点可以辅助解码生成子网络只关注嘴部及其周围肌肉等图像信息,不需要再额外学习整体脸部轮廓、方向与构造等信息,因此,嘴部关键点结合嘴部模糊图片可以有效提高目标帧中生成的嘴部区域的准确率。For example, the image information includes image information such as muscles around the mouth area. For example, the mouth key points can be used to locate the position of the mouth in the frame image, so that the mouth key points can be used to assist decoding and generate sub-networks that only focus on image information such as the mouth and its surrounding muscles, without additional learning of the overall face Therefore, the key points of the mouth combined with the blurred picture of the mouth can effectively improve the accuracy of the mouth area generated in the target frame.
例如,特征提取子网络和解码生成子网络可以采用卷积神经网络等,本公开对特征提取子网络和解码生成子网络的结构不作限制。For example, the feature extraction sub-network and the decoding generation sub-network may use a convolutional neural network, etc., and the present disclosure does not limit the structure of the feature extraction sub-network and the decoding generation sub-network.
图4为本公开至少一实施例提供的视频处理方法的流程图。下面结合图4,具体说明本公开一实施例提供的视频处理方法的执行过程。Fig. 4 is a flowchart of a video processing method provided by at least one embodiment of the present disclosure. The following describes in detail the execution process of the video processing method provided by an embodiment of the present disclosure with reference to FIG. 4 .
如图4所示,首先获取音频片段和帧图像,关于音频片段和帧图像的相关内容可以参考步骤S10的描述,这里不再赘述。As shown in FIG. 4 , the audio segment and the frame image are first obtained. For the related content of the audio segment and the frame image, please refer to the description of step S10 , which will not be repeated here.
对每个帧图像包括的所有对象的嘴部进行模糊处理,得到每个帧图像对应的嘴部模糊图片,对每个嘴部模糊图片进行梯度特征提取,得到每个嘴部模糊图片对应的梯度特征图,并且对每个帧图像利用面部关键点检测模型进行处理,得到多个嘴部关键点,关于嘴部模糊图片、梯度特征图和嘴部关键点的生成过程可以参考步骤S20的相关描述,重复之处不再赘述。Blur the mouths of all objects included in each frame image to obtain the mouth blurred picture corresponding to each frame image, and perform gradient feature extraction on each mouth blurred picture to obtain the gradient corresponding to each mouth blurred picture feature map, and each frame image is processed using a facial key point detection model to obtain multiple mouth key points. For the generation process of mouth blurred pictures, gradient feature maps and mouth key points, please refer to the relevant description of step S20 , the repetitions will not be repeated.
之后,将特征频谱和依序划分成M组的嘴部模糊图片和梯度特征图输入特征提取子网络,得到M个视觉特征向量。Afterwards, the feature spectrum and the blurred mouth pictures and gradient feature maps divided into M groups are input into the feature extraction sub-network to obtain M visual feature vectors.
之后,将M个视觉特征向量和多个嘴部关键点输入解码生成子网络进行处理,得到M个目标帧,M个目标帧中每个目标帧具有与音频片段中对应时点相对应的嘴型,例如,音频片段为“生日快乐”,则M个目标帧中对象的嘴型跟随音频片段依次显示为“生日快乐”的嘴型。After that, M visual feature vectors and multiple mouth key points are input into the decoding generation sub-network for processing, and M target frames are obtained, and each target frame in the M target frames has a mouth corresponding to the corresponding time point in the audio clip. For example, if the audio clip is "Happy Birthday", then the mouth shapes of the objects in the M target frames follow the audio clip and are sequentially displayed as the mouth shapes of "Happy Birthday".
之后,将M个目标帧按照显示时点顺序依次排列,得到目标视频。Afterwards, the M target frames are arranged sequentially according to the order of display time points to obtain the target video.
本公开至少一实施例还提供一种神经网络的训练方法。图5为本公开一实施例提供的一种神经网络的训练方法的流程图。At least one embodiment of the present disclosure also provides a neural network training method. FIG. 5 is a flow chart of a neural network training method provided by an embodiment of the present disclosure.
如图5所示,本公开至少一实施例提供的神经网络的训练方法包括步骤S40至步骤S60。例如,神经网络包括视频处理网络。As shown in FIG. 5 , the neural network training method provided by at least one embodiment of the present disclosure includes steps S40 to S60. For example, neural networks include video processing networks.
步骤S40,获取训练视频和与训练视频匹配的训练音频片段。Step S40, acquiring a training video and a training audio segment matched with the training video.
例如,训练视频包括至少一个训练帧图像,每个训练帧图像包括至少一个对象,每个对象包括面部区域。For example, the training video includes at least one training frame image, each training frame image includes at least one object, and each object includes a face area.
步骤S50,对训练视频进行预处理,得到训练视频对应的嘴部特征信息。Step S50, preprocessing the training video to obtain mouth feature information corresponding to the training video.
步骤S60,基于嘴部特征信息和训练音频片段,对视频处理网络进行训练。Step S60, based on the mouth feature information and the training audio clips, the video processing network is trained.
例如,训练视频可以是带有嘴型变化的视频,并且,训练视频中的嘴型变化即为训练音频片段的内容。例如,训练视频可以是讲话人对着镜头说“生日快乐”,训练帧图像中的对象即为讲话人,训练帧图像包括讲话人的面部区域,训练音频片段为“生日快乐”。For example, the training video may be a video with mouth shape changes, and the mouth shape changes in the training video are the content of the training audio clip. For example, the training video can be a speaker saying "Happy Birthday" to the camera, the object in the training frame image is the speaker, the training frame image includes the speaker's facial area, and the training audio clip is "Happy Birthday".
例如,关于训练帧图像、对象、面部区域的具体概念可以参考前述步骤S10中关于帧图像、对象、面部区域的描述,重复之处不再赘述。For example, for specific concepts about training frame images, objects, and facial regions, reference may be made to the descriptions about frame images, objects, and facial regions in the aforementioned step S10, and repeated descriptions will not be repeated here.
例如,嘴部特征信息可以包括与各个训练帧图像分别对应的嘴部模糊图片,关于获得嘴部模糊图片的过程可以参考步骤S20的相关描述,这里不再赘述。For example, the mouth feature information may include mouth blurred pictures corresponding to each training frame image. For the process of obtaining the mouth blurred picture, please refer to the related description of step S20, which will not be repeated here.
例如,嘴部特征信息可以包括与各个嘴部模糊图片分别对应的梯度特征图,关于获得梯度特征图的过程可以参考步骤S20的相关描述,这里不再赘述。For example, the mouth feature information may include a gradient feature map corresponding to each blurred mouth picture. For the process of obtaining the gradient feature map, please refer to the relevant description of step S20, which will not be repeated here.
例如,嘴部特征信息还可以包括多个嘴部关键点,关于获得多个嘴部关键点的过程可以参考步骤S20的相关描述,这里不再赘述。For example, the mouth feature information may also include a plurality of key points of the mouth. For the process of obtaining the key points of the mouth, reference may be made to the relevant description of step S20 , which will not be repeated here.
如前所述,嘴部特征信息用于提供面部和嘴部的大致轮廓,以及面部和嘴部之间的位置关系,由于嘴部模糊图片仍然保留了图片的整体轮廓,网络不需要再去做从无到有的创造,方便网络快速收敛,加快网络训练进程,降低训练难度和时间开销。As mentioned earlier, the mouth feature information is used to provide the approximate outline of the face and mouth, as well as the positional relationship between the face and the mouth. Since the mouth blurred picture still retains the overall outline of the picture, the network does not need to do it again Created from scratch, it facilitates rapid network convergence, speeds up the network training process, and reduces training difficulty and time overhead.
例如,如前所述,梯度特征图用于提供梯度特征图对应的嘴部模糊图片中模糊区域和非模糊区域的范围,为视频处理网络提供更多的限定参数,便于特征提取子网络确定准确的嘴部位置,降低图像噪声干扰,方便网络快速收敛,加快网络训练进程,降低训练难度和时间开销。For example, as mentioned above, the gradient feature map is used to provide the range of the blurred area and the non-blurred area in the blurred mouth picture corresponding to the gradient feature map, and provides more limited parameters for the video processing network, which is convenient for the feature extraction sub-network to determine accurate The position of the mouth can reduce image noise interference, facilitate rapid network convergence, speed up the network training process, and reduce training difficulty and time overhead.
此外,如前所述,嘴部关键点用于提供嘴部位置信息,使得网络在训练过程中主要考虑嘴部及其周围的肌肉等图像信息,不需要再学习整体面部轮廓、方向与构造等信息,有效提高训练效率,并可以得到准确率更高的视频处理网络。In addition, as mentioned above, the mouth key points are used to provide mouth position information, so that the network mainly considers image information such as the mouth and its surrounding muscles during the training process, and does not need to learn the overall facial contour, direction and structure, etc. Information, effectively improve the training efficiency, and can get a video processing network with higher accuracy.
例如,视频处理网络包括特征提取子网络和解码生成子网络。例如,在训练视频处理网络时,先对特征提取子网络进行训练,在特征提取子网络训练完成后,再结合训练好的特征提取子网络对解码生成子网络进行训练,也 即在解码生成子网络的训练过程中,特征提取子网络中的权重参数不发生变化,只更新解码生成子网络的参数。For example, a video processing network includes a feature extraction subnetwork and a decoding generation subnetwork. For example, when training a video processing network, the feature extraction sub-network is first trained, and after the feature extraction sub-network is trained, the decoding generation sub-network is combined with the trained feature extraction sub-network, that is, the decoding generation sub-network During the training process of the network, the weight parameters in the feature extraction sub-network do not change, only the parameters of the decoding generation sub-network are updated.
例如,步骤S60可以包括:对训练音频片段进行频谱转换处理,得到训练特征频谱;利用训练特征频谱和嘴部特征信息,对待训练的特征提取子网络进行训练,以得到训练好的特征提取子网络。For example, step S60 may include: performing spectral conversion processing on the training audio segment to obtain the training feature spectrum; using the training feature spectrum and mouth feature information to train the feature extraction sub-network to be trained to obtain the trained feature extraction sub-network .
例如,可以提取训练音频片段的梅尔倒谱系数作为训练特征频谱。For example, the Mel cepstral coefficients of the training audio clips can be extracted as the training feature spectrum.
例如,利用训练特征频谱和嘴部特征信息,对待训练的特征提取子网络进行训练,以得到训练好的特征提取子网络,可以包括:利用待训练的特征提取子网络对训练特征频谱和至少一个嘴部模糊图片进行处理,得到训练视觉特征向量和训练音频片段特征向量;根据训练视觉特征向量和训练音频特征向量,通过特征提取子网络对应的损失函数计算特征提取子网络的损失值;基于损失值对待训练的特征提取子网络的参数进行修正;以及在待训练的特征提取子网络对应的损失值不满足预定准确率条件时,继续输入训练特征频谱和至少一个嘴部模糊图片以重复执行上述训练过程。For example, using the training feature spectrum and mouth feature information to train the feature extraction sub-network to be trained to obtain a trained feature extraction sub-network may include: using the feature extraction sub-network to be trained to train the feature spectrum and at least one The mouth blurred picture is processed to obtain the training visual feature vector and the training audio segment feature vector; according to the training visual feature vector and the training audio feature vector, the loss value of the feature extraction sub-network is calculated through the loss function corresponding to the feature extraction sub-network; based on the loss modify the parameters of the feature extraction subnetwork to be trained; and when the loss value corresponding to the feature extraction subnetwork to be trained does not meet the predetermined accuracy rate condition, continue to input the training feature spectrum and at least one blurred mouth picture to repeat the above training process.
例如,在训练特征子网络的过程中,还可以输入每个嘴部模糊图片对应的梯度特征图,具体输入过程参考视频处理方法中的相关介绍,这里不再赘述。For example, in the process of training the feature sub-network, you can also input the gradient feature map corresponding to each blurred mouth picture. For the specific input process, refer to the relevant introduction in the video processing method, and will not repeat it here.
特征提取子网络的训练目标是输出的视觉特征向量与音频特征向量相匹配,关于匹配的概念参考如前所述的内容。例如,视觉特征向量中的第i个特征元素和音频特征向量中的第i个特征元素应当是匹配的,体现在特征值上就是视觉特征向量和音频特征向量特征值很接近或一致。因此,在训练时,利用训练视觉特征向量和训练音频特征计算损失值,基于损失值对特征提取子网络的参数进行修正,从而使得训练好的特征提取子网络输出的视觉特征向量与音频特征向量一致。The training goal of the feature extraction sub-network is to match the output visual feature vector with the audio feature vector. For the concept of matching, refer to the content mentioned above. For example, the i-th feature element in the visual feature vector and the i-th feature element in the audio feature vector should match, which is reflected in the feature value that the feature value of the visual feature vector and the audio feature vector are very close or identical. Therefore, during training, the loss value is calculated by using the training visual feature vector and the training audio feature, and the parameters of the feature extraction sub-network are corrected based on the loss value, so that the visual feature vector and audio feature vector output by the trained feature extraction sub-network unanimous.
在特征提取子网络训练完成后,步骤S60还可以包括:利用训练好的特征提取子网络对训练特征频谱和至少一个嘴部模糊图片进行处理,得到至少一个目标视觉特征向量;根据至少一个目标视觉特征向量以及训练视频,对解码生成子网络进行训练。After the feature extraction sub-network training is completed, step S60 may also include: using the trained feature extraction sub-network to process the training feature spectrum and at least one blurred mouth picture to obtain at least one target visual feature vector; Feature vectors and training videos are used to train the decoder-generation sub-network.
例如,根据至少一个目标视觉特征向量以及训练视频,对解码生成子网络进行训练,可以包括:利用多个嘴部关键点提供的嘴部位置信息,结合至少一个目标视觉特征向量对解码生成子网络进行训练。例如,在这个过程中, 利用嘴部关键点辅助训练,使得嘴型位置更加准确。关于嘴部关键点的具体技术效果参考如前所述的内容,这里不再赘述。For example, according to at least one target visual feature vector and the training video, the decoding generation sub-network is trained, which may include: using the mouth position information provided by a plurality of mouth key points, combined with at least one target visual feature vector to decode the generation sub-network to train. For example, in this process, the key points of the mouth are used to assist training, so that the position of the mouth shape is more accurate. For the specific technical effect of the key points of the mouth, refer to the content mentioned above, and will not repeat them here.
例如,神经网络还包括判别子网络,判别子网络和解码生成子网络构成生成式对抗网络(Generative Adversarial Networks,简称GAN),在对解码生成子网络训练的过程中,对生成式对抗网络进行交替迭代训练,以得到训练好的解码生成子网络。For example, the neural network also includes a discriminative sub-network. The discriminative sub-network and the decoding-generating sub-network constitute a Generative Adversarial Networks (GAN for short). Iterative training to obtain a trained decoder generation subnetwork.
例如,解码生成子网络充当生成式对抗网络中生成器(Generator)的角色,生成图像以“骗过”判别器,判别子网络充当生成式对抗网络中判别器(Discriminator)的角色,判断解码生成子网络生成的图像的真实性。例如,在训练过程中,首先让生成器不断生成图像数据由判别器判断,这个过程判别器的参数不作调整,只对生成器进行训练和参数调整,直到判别器无法判断生成器生成图像的真实性;之后,固定生成器的参数,继续训练判别器,直到判别器可以准确判断生成器生成图像的真实性;之后,不断循环上述过程,直到生成器和判别器的生成、判别能力越来越强,从而得到一个生成效果最优的生成器。For example, the decoding generation sub-network acts as the generator (Generator) in the generative confrontation network, generating images to "fool" the discriminator, and the discriminant sub-network acts as the discriminator (Discriminator) in the generative confrontation network, judging the decoding generation Authenticity of images generated by sub-networks. For example, in the training process, first let the generator continuously generate image data and be judged by the discriminator. In this process, the parameters of the discriminator are not adjusted, and only the generator is trained and parameter adjusted until the discriminator cannot judge the authenticity of the image generated by the generator. After that, fix the parameters of the generator and continue to train the discriminator until the discriminator can accurately judge the authenticity of the image generated by the generator; after that, continue to repeat the above process until the generation and discrimination capabilities of the generator and the discriminator are getting better and better. Strong, so as to get a generator with the best generating effect.
图6为本公开一实施例提供的一种神经网络的结构示意图。Fig. 6 is a schematic structural diagram of a neural network provided by an embodiment of the present disclosure.
如图6所示,本公开至少一实施例提供的神经网络100包括视频处理网络101和判别子网络102,视频处理网络101包括特征提取子网络1011和解码生成子网络1012,并且,解码生成子网络1012和判别子网络102构成生成式对抗网络。As shown in FIG. 6 , the neural network 100 provided by at least one embodiment of the present disclosure includes a video processing network 101 and a discrimination subnetwork 102, the video processing network 101 includes a feature extraction subnetwork 1011 and a decoding generation subnetwork 1012, and the decoding generation subnetwork 1012 The network 1012 and the discriminative sub-network 102 constitute a generative adversarial network.
下面结合图6,具体说明视频处理网络101的训练过程。The training process of the video processing network 101 will be described in detail below with reference to FIG. 6 .
首先,先对特征提取子网络1011进行训练。例如,参考步骤S50的描述得到多个训练帧图像分别对应的多个嘴部模糊图片,以及多个嘴部模糊图片分别对应的多个梯度特征图,对训练音频片段进行频谱转换处理,得到训练特征频谱,将多个嘴部模糊图片、多个梯度特征图和特征频谱一起输入特征提取子网络1011进行处理,得到视觉特征向量和音频特征向量。之后,根据视觉特征向量和音频特征向量进行损失值计算,根据损失值调整特征提取子网络的参数,直到特征提取子网络对应的损失值满足预定准确率条件,得到训练好的特征提取子网络1011。First, the feature extraction sub-network 1011 is trained. For example, refer to the description of step S50 to obtain a plurality of blurred mouth pictures corresponding to a plurality of training frame images, and a plurality of gradient feature maps corresponding to a plurality of blurred mouth pictures respectively, and perform spectral conversion processing on the training audio clips to obtain training The feature spectrum is to input multiple blurred mouth pictures, multiple gradient feature maps and feature spectrum into the feature extraction sub-network 1011 for processing to obtain visual feature vectors and audio feature vectors. After that, calculate the loss value according to the visual feature vector and the audio feature vector, adjust the parameters of the feature extraction sub-network according to the loss value, until the loss value corresponding to the feature extraction sub-network meets the predetermined accuracy rate condition, and obtain the trained feature extraction sub-network 1011 .
此时,训练好的特征提取子网络1011输出的视觉特征向量和音频特征向量保持一致。At this time, the visual feature vector output by the trained feature extraction sub-network 1011 is consistent with the audio feature vector.
之后,结合训练好的特征提取子网络1011对解码生成子网络1012进行训练。Afterwards, the decoding generation sub-network 1012 is trained in combination with the trained feature extraction sub-network 1011 .
例如,将多个嘴部模糊图片输入特征提取子网络1011后得到多个目标视觉特征向量,此时,目标视觉特征向量和特征提取子网络1011输出的音频特征向量是一致的。For example, after inputting multiple blurred mouth pictures into the feature extraction subnetwork 1011, multiple target visual feature vectors are obtained. At this time, the target visual feature vectors are consistent with the audio feature vectors output by the feature extraction subnetwork 1011.
将多个目标视觉特征向量和多个嘴部关键点输入解码生成子网络1012进行处理,得到输出帧,输出帧中对象的嘴型存在变化,但该变化可能与对应相同显示时点的训练帧图像的嘴型存在差异。Input a plurality of target visual feature vectors and a plurality of mouth key points into the decoding generation sub-network 1012 for processing to obtain an output frame, in which the mouth shape of the object changes, but the change may be different from the training frame corresponding to the same display time point There is a difference in the shape of the mouth in the image.
输出帧和训练帧图像输入判别子网络102,判别子网络102将训练帧图像中的嘴型作为标准,参考如前所述的过程交替训练解码生成子网络1012和判别子网络102,并且,基于二分类交叉熵损失函数计算损失值,交替对判别子网络102和解码生成子网络1012的参数进行修正,直到得到训练好的解码生成子网络1012。The output frame and the training frame image are input into the discrimination sub-network 102, and the discrimination sub-network 102 uses the mouth shape in the training frame image as a standard, referring to the process as previously described to alternately train the decoding generation sub-network 1012 and the discrimination sub-network 102, and, based on The binary classification cross-entropy loss function calculates the loss value, and alternately modifies the parameters of the discrimination subnetwork 102 and the decoding generation subnetwork 1012 until a trained decoding generation subnetwork 1012 is obtained.
在上述实施例中,由于嘴部模糊图片仍然保留了图片的整体轮廓,网络不需要再去做从无到有的创造,方便网络快速收敛,加快特征提取子网络的训练进程,降低训练难度和时间开销。梯度特征图用于提供嘴部模糊图片中的模糊区域和非模糊区域的范围,从而便于网络快速定位嘴部区域,方便网络快速收敛。此外,嘴部关键点用于提供嘴部位置信息,使得解码生成子网络在训练过程中主要考虑嘴部及其周围的肌肉等图像信息,不需要再学习整体面部轮廓、方向与构造等信息,有效提高训练效率,并可以得到准确率更高的视频处理网络。In the above embodiment, since the blurred mouth picture still retains the overall outline of the picture, the network does not need to be created from scratch, which facilitates the rapid convergence of the network, speeds up the training process of the feature extraction sub-network, and reduces the training difficulty and time overhead. The gradient feature map is used to provide the range of the blurred area and the non-blurred area in the mouth blurred picture, so that the network can quickly locate the mouth area and facilitate the network to quickly converge. In addition, the mouth key points are used to provide mouth position information, so that the decoding generation sub-network mainly considers image information such as the mouth and its surrounding muscles during the training process, and does not need to learn information such as the overall facial contour, direction, and structure. Effectively improve the training efficiency and obtain a video processing network with higher accuracy.
本公开至少一实施例还提供一种视频处理装置,图7为本公开至少一实施例提供的一种视频处理装置的示意性框图。At least one embodiment of the present disclosure further provides a video processing device, and FIG. 7 is a schematic block diagram of a video processing device provided by at least one embodiment of the present disclosure.
如图7所示,视频处理装置200可以包括获取单元201、预处理单元202和视频处理单元203。这些组件通过总线系统和/或其它形式的连接机构(未示出)互连。应当注意,图7所示的视频处理装置200的组件和结构只是示例性的,而非限制性的,根据需要,视频处理装置200也可以具有其他组件和结构。As shown in FIG. 7 , the video processing apparatus 200 may include an acquisition unit 201 , a preprocessing unit 202 and a video processing unit 203 . These components are interconnected by a bus system and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the video processing device 200 shown in FIG. 7 are exemplary rather than limiting, and the video processing device 200 may also have other components and structures as required.
例如,这些模块可以通过硬件(例如电路)模块、软件模块或二者的任意组合等实现,以下实施例与此相同,不再赘述。例如,可以通过中央处理单元(CPU)、视频处理器(GPU)、张量处理器(TPU)、现场可编程逻 辑门阵列(FPGA)或者具有数据处理能力和/或指令执行能力的其它形式的处理单元以及相应计算机指令来实现这些单元。For example, these modules may be implemented by hardware (such as circuit) modules, software modules, or any combination of the two, and the following embodiments are the same as this, and will not be repeated here. For example, a central processing unit (CPU), a video processing unit (GPU), a tensor processing unit (TPU), a field programmable logic gate array (FPGA), or other forms of processors with data processing capabilities and/or instruction execution capabilities The processing units and corresponding computer instructions implement these units.
例如,获取单元201用于获取至少一个帧图像和音频片段,例如,每个帧图像包括至少一个对象,每个对象包括面部区域。For example, the obtaining unit 201 is configured to obtain at least one frame image and an audio segment, for example, each frame image includes at least one object, and each object includes a face area.
例如,获取单元201可以包括存储器,存储器存储有帧图像和音频片段。例如,获取单元201可以包括一个或多个摄像头,以拍摄或录制包括多个帧图像的视频或包括对象的静态的帧图像,此外,获取单元201还可以包括录音装置,以获得音频片段。例如,获取单元201可以为硬件、软件、固件以及它们的任意可行的组合。For example, the acquiring unit 201 may include a memory storing frame images and audio clips. For example, the acquisition unit 201 may include one or more cameras to shoot or record a video including multiple frame images or a still frame image of an object. In addition, the acquisition unit 201 may also include a recording device to obtain audio clips. For example, the acquisition unit 201 may be hardware, software, firmware and any feasible combination thereof.
例如,预处理单元202用于对至少一个帧图像进行预处理,得到面部区域的嘴部特征信息。For example, the preprocessing unit 202 is configured to preprocess at least one frame image to obtain mouth feature information of the face area.
例如,视频处理单元203可以包括视频处理网络204。视频处理单元203基于嘴部特征信息和音频片段,使用视频处理网络204对至少一个帧图像进行处理,得到目标视频,其中,目标视频中的对象与音频片段具有同步的嘴型变化。For example, video processing unit 203 may include video processing network 204 . The video processing unit 203 uses the video processing network 204 to process at least one frame image based on the mouth feature information and the audio clip to obtain a target video, wherein the object in the target video and the audio clip have synchronous mouth shape changes.
视频处理网络204包括特征提取子网络和解码生成子网络,需要说明的是,视频处理单元203中的视频处理网络204与上述视频处理方法的实施例中的视频处理网络204的结构和功能相同,在此不再赘述。The video processing network 204 includes a feature extraction subnetwork and a decoding generation subnetwork. It should be noted that the video processing network 204 in the video processing unit 203 has the same structure and function as the video processing network 204 in the embodiment of the above-mentioned video processing method. I won't repeat them here.
需要说明的是,获取单元201可以用于实现图1所示的步骤S10,预处理单元202可以用于实现图1所示的步骤S20,视频处理单元203可以用于实现图1所示的步骤S30。从而关于获取单元201、预处理单元202和视频处理单元203能够实现的功能的具体说明可以参考上述视频处理方法的实施例中的步骤S10至步骤S30的相关描述,重复之处不再赘述。此外,视频处理装置200可以实现与前述视频处理方法相似的技术效果,在此不再赘述。It should be noted that the acquiring unit 201 can be used to realize step S10 shown in FIG. 1 , the preprocessing unit 202 can be used to realize step S20 shown in FIG. 1 , and the video processing unit 203 can be used to realize the steps shown in FIG. 1 S30. Therefore, for the specific description of the functions that can be realized by the acquisition unit 201 , the preprocessing unit 202 and the video processing unit 203 , reference may be made to the relevant descriptions of steps S10 to S30 in the embodiment of the above video processing method, and repeated descriptions will not be repeated. In addition, the video processing apparatus 200 can achieve technical effects similar to those of the aforementioned video processing method, which will not be repeated here.
本公开至少一实施例还提供一种神经网络的训练装置,图8为本公开至少一实施例提供的一种训练装置的示意性框图。At least one embodiment of the present disclosure further provides a neural network training device, and FIG. 8 is a schematic block diagram of a training device provided by at least one embodiment of the present disclosure.
如图8示,训练装置300可以包括训练数据获取单元301、预处理单元302和训练单元303。这些组件通过总线系统和/或其它形式的连接机构(未示出)互连。应当注意,图8所示的训练装置300的组件和结构只是示例性的,而非限制性的,根据需要,训练装置300也可以具有其他组件和结构。As shown in FIG. 8 , the training device 300 may include a training data acquisition unit 301 , a preprocessing unit 302 and a training unit 303 . These components are interconnected by a bus system and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the training device 300 shown in FIG. 8 are exemplary rather than limiting, and the training device 300 may also have other components and structures as required.
例如,训练数据获取单元301,配置为获取训练视频和与训练视频匹配 的训练音频片段。例如,训练视频包括至少一个训练帧图像,每个训练帧图像包括至少一个对象,每个对象包括面部区域。For example, the training data obtaining unit 301 is configured to obtain a training video and a training audio segment matched with the training video. For example, the training video includes at least one training frame image, each training frame image includes at least one object, and each object includes a face area.
例如,预处理单元302,配置为对训练视频进行预处理,得到面部区域的嘴部特征信息。For example, the preprocessing unit 302 is configured to preprocess the training video to obtain mouth feature information of the facial region.
例如,训练单元303,配置为基于嘴部特征信息和训练音频片段,对视频处理网络进行训练。For example, the training unit 303 is configured to train the video processing network based on mouth feature information and training audio clips.
例如,训练单元303包括神经网络304、损失函数(未示出),神经网络304包括视频处理网络,训练单元303用于对待训练的神经网络304进行训练,以得到训练好的视频处理网络。For example, the training unit 303 includes a neural network 304 and a loss function (not shown), the neural network 304 includes a video processing network, and the training unit 303 is used to train the neural network 304 to be trained to obtain a trained video processing network.
例如,视频处理网络包括特征提取子网络和解码生成子网络,神经网络304还包括判别子网络,判别子网络和解码生成子网络构成生成式对抗网络。需要说明的是,训练单元303中的神经网络304与上述神经网络的训练方法的实施例中的神经网络100的结构和功能相同,在此不再赘述。For example, the video processing network includes a feature extraction subnetwork and a decoding and generating subnetwork, and the neural network 304 also includes a discriminative subnetwork, which constitutes a generative confrontation network. It should be noted that the structure and function of the neural network 304 in the training unit 303 are the same as those of the neural network 100 in the above embodiment of the neural network training method, and will not be repeated here.
需要说明的是,训练数据获取单元301可以用于实现图5所示的步骤S40,预处理单元302可以用于实现图5所示的步骤S50,训练单元303可以用于实现图5所示的步骤S60。从而关于训练数据获取单元301、预处理单元302和训练单元303能够实现的功能的具体说明可以参考上述视频处理方法的实施例中的步骤S40至步骤S60的相关描述,重复之处不再赘述。此外,训练装置300可以实现与前述训练方法相似的技术效果,在此不再赘述。It should be noted that the training data acquisition unit 301 can be used to realize the step S40 shown in FIG. 5, the preprocessing unit 302 can be used to realize the step S50 shown in FIG. 5, and the training unit 303 can be used to realize the Step S60. Therefore, for specific descriptions of the functions that can be realized by the training data acquisition unit 301, the preprocessing unit 302, and the training unit 303, reference may be made to the relevant descriptions of steps S40 to S60 in the embodiment of the video processing method above, and repeated descriptions will not be repeated. In addition, the training device 300 can achieve technical effects similar to those of the aforementioned training method, which will not be repeated here.
图9为本公开一实施例提供的一种电子设备的示意性框图。如图9所示,该电子设备400例如适于用来实施本公开实施例提供的视频处理方法或训练方法。应当注意,图9所示的电子设备400的组件只是示例性的,而非限制性的,根据实际应用需要,该电子设备400还可以具有其他组件。Fig. 9 is a schematic block diagram of an electronic device provided by an embodiment of the present disclosure. As shown in FIG. 9 , the electronic device 400 is, for example, suitable for implementing the video processing method or the training method provided by the embodiments of the present disclosure. It should be noted that the components of the electronic device 400 shown in FIG. 9 are only exemplary rather than limiting, and the electronic device 400 may also have other components according to actual application requirements.
如图9所示,电子设备400可以包括处理装置(例如中央处理器、图形处理器等)401,其可以根据存储在存储器中的非暂时性计算机可读指令执行各种适当的动作和处理,以实现各种功能。As shown in FIG. 9, an electronic device 400 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 401, which may perform various appropriate actions and processes according to non-transitory computer-readable instructions stored in a memory, to achieve various functions.
例如,计算机可读指令被处理装置401运行时可以执行根据上述任一实施例所述的视频处理方法中的一个或多个步骤。需要说明的是,关于视频处理方法的处理过程的详细说明可以参考上述视频处理方法的实施例中的相关描述,重复之处不再赘述。For example, when the computer-readable instructions are executed by the processing device 401, one or more steps in the video processing method according to any of the foregoing embodiments may be executed. It should be noted that, for a detailed description of the processing process of the video processing method, reference may be made to relevant descriptions in the embodiments of the above video processing method, and repeated descriptions will not be repeated.
例如,计算机可读指令被处理装置401运行时可以执行根据上述任一实 施例所述的神经网络的训练方法中的一个或多个步骤。需要说明的是,关于训练方法的处理过程的详细说明可以参考上述训练方法的实施例中的相关描述,重复之处不再赘述。For example, when the computer-readable instructions are executed by the processing device 401, one or more steps in the neural network training method according to any of the above-mentioned embodiments may be executed. It should be noted that, for a detailed description of the processing process of the training method, reference may be made to relevant descriptions in the embodiments of the training method above, and repeated descriptions will not be repeated.
例如,存储器可以包括一个或多个计算机程序产品的任意组合,计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。易失性存储器例如可以包括随机存取存储器(RAM)403和/或高速缓冲存储器(cache)等,例如,计算机可读指令可以从存储装置408加载到随机存取存储器(RAM)403中以运行计算机可读指令。非易失性存储器例如可以包括只读存储器(ROM)402、硬盘、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器、闪存等。在计算机可读存储介质中还可以存储各种应用程序和各种数据,例如风格图像、以及应用程序使用和/或产生的各种数据等。For example, the memory may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) 403 and/or cache memory (cache), etc., for example, computer readable instructions may be loaded from storage device 408 into random access memory (RAM) 403 to run computer readable instructions. Non-volatile memory may include, for example, read-only memory (ROM) 402, hard disks, erasable programmable read-only memory (EPROM), compact disk read-only memory (CD-ROM), USB memory, flash memory, and the like. Various application programs and various data, such as style images, and various data used and/or generated by application programs, can also be stored in the computer-readable storage medium.
例如,处理装置401、ROM 402以及RAM 403通过总线404彼此相连。输入/输出(I/O)接口405也连接至总线404。For example, the processing device 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404. An input/output (I/O) interface 405 is also connected to bus 404 .
通常,以下装置可以连接至I/O接口405:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置406;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置407;包括例如磁带、硬盘、闪存等的存储装置408;以及通信装置409。通信装置409可以允许电子设备400与其他电子设备进行无线或有线通信以交换数据。虽然图9示出了具有各种装置的电子设备400,但应理解的是,并不要求实施或具备所有示出的装置,电子设备400可以替代地实施或具备更多或更少的装置。例如,处理器401可以控制电子设备400中的其它组件以执行期望的功能。处理器401可以是中央处理单元(CPU)、张量处理器(TPU)或者图形处理器GPU等具有数据处理能力和/或程序执行能力的器件。中央处理器(CPU)可以为X86或ARM架构等。GPU可以单独地直接集成到主板上,或者内置于主板的北桥芯片中。GPU也可以内置于中央处理器(CPU)上。Typically, the following devices can be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 407 such as a computer; a storage device 408 including, for example, a magnetic tape, a hard disk, a flash memory, etc.; and a communication device 409. The communication means 409 may allow the electronic device 400 to perform wireless or wired communication with other electronic devices to exchange data. Although FIG. 9 shows electronic device 400 having various means, it should be understood that it is not required to implement or have all of the means shown, and electronic device 400 may alternatively implement or have more or fewer means. For example, the processor 401 can control other components in the electronic device 400 to perform desired functions. The processor 401 may be a device with data processing capabilities and/or program execution capabilities, such as a central processing unit (CPU), a tensor processing unit (TPU), or a graphics processing unit (GPU). The central processing unit (CPU) may be an X86 or ARM architecture or the like. The GPU can be integrated directly on the motherboard alone, or built into the north bridge chip of the motherboard. A GPU can also be built into a central processing unit (CPU).
图10为本公开至少一实施例提供的一种非瞬时性计算机可读存储介质的示意图。例如,如图10所示,存储介质500可以为非瞬时性计算机可读存储介质,在存储介质500上可以非暂时性地存储一个或多个计算机可读指令501。例如,当计算机可读指令501由处理器执行时可以执行根据上文所述的视频处理方法或者训练方法中的一个或多个步骤。Fig. 10 is a schematic diagram of a non-transitory computer-readable storage medium provided by at least one embodiment of the present disclosure. For example, as shown in FIG. 10 , the storage medium 500 may be a non-transitory computer-readable storage medium, and one or more computer-readable instructions 501 may be stored non-transitory on the storage medium 500 . For example, when the computer-readable instructions 501 are executed by the processor, one or more steps in the above-mentioned video processing method or training method may be executed.
例如,该存储介质500可以应用于上述电子设备中,例如,该存储介质500可以包括电子设备中的存储器。For example, the storage medium 500 may be applied to the above-mentioned electronic device, for example, the storage medium 500 may include a memory in the electronic device.
例如,存储介质可以包括智能电话的存储卡、平板电脑的存储部件、个人计算机的硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、闪存、或者上述存储介质的任意组合,也可以为其他适用的存储介质。For example, the storage medium may include a memory card of a smartphone, a storage unit of a tablet computer, a hard disk of a personal computer, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM), Portable compact disc read-only memory (CD-ROM), flash memory, or any combination of the above-mentioned storage media may also be other applicable storage media.
例如,关于存储介质500的说明可以参考电子设备的实施例中对于存储器的描述,重复之处不再赘述。For example, for the description of the storage medium 500, reference may be made to the description of the memory in the embodiments of the electronic device, and repeated descriptions will not be repeated.
图11为本公开至少一实施例提供的一种硬件环境的示意图。本公开提供的电子设备可以应用在互联网系统。Fig. 11 is a schematic diagram of a hardware environment provided by at least one embodiment of the present disclosure. The electronic device provided by the present disclosure can be applied in the Internet system.
利用图11中提供的计算机系统可以实现本公开中涉及的图像处理装置和/或电子设备的功能。这类计算机系统可以包括个人电脑、笔记本电脑、平板电脑、手机、个人数码助理、智能眼镜、智能手表、智能指环、智能头盔及任何智能便携设备或可穿戴设备。本实施例中的特定系统利用功能框图解释了一个包含用户界面的硬件平台。这种计算机设备可以是一个通用目的的计算机设备,或一个有特定目的的计算机设备。两种计算机设备都可以被用于实现本实施例中的图像处理装置和/或电子设备。计算机系统可以包括实施当前描述的实现图像处理所需要的信息的任何组件。例如,计算机系统能够被计算机设备通过其硬件设备、软件程序、固件以及它们的组合所实现。为了方便起见,图11中只绘制了一台计算机设备,但是本实施例所描述的实现图像处理所需要的信息的相关计算机功能是可以以分布的方式、由一组相似的平台所实施的,分散计算机系统的处理负荷。The functions of the image processing apparatus and/or electronic equipment involved in the present disclosure can be realized by using the computer system provided in FIG. 11 . Such computer systems can include personal computers, laptops, tablets, mobile phones, personal digital assistants, smart glasses, smart watches, smart rings, smart helmets, and any smart portable or wearable device. The specific system in this embodiment illustrates a hardware platform including a user interface using functional block diagrams. Such computer equipment may be a general purpose computer equipment or a special purpose computer equipment. Both computer devices can be used to realize the image processing device and/or electronic device in this embodiment. The computer system may include any components that implement the presently described information needed to achieve image processing. For example, a computer system can be realized by a computer device through its hardware devices, software programs, firmware, and combinations thereof. For the sake of convenience, only one computer device is drawn in Fig. 11, but the relevant computer functions for realizing the information required for image processing described in this embodiment can be implemented by a group of similar platforms in a distributed manner, Distribute the processing load of a computer system.
如图11所示,计算机系统可以包括通信端口250,与之相连的是实现数据通信的网络,例如,计算机系统可以通过通信端口250发送和接收信息及数据,即通信端口250可以实现计算机系统与其他电子设备进行无线或有线通信以交换数据。计算机系统还可以包括一个处理器组220(即上面描述的处理器),用于执行程序指令。处理器组220可以由至少一个处理器(例如,CPU)组成。计算机系统可以包括一个内部通信总线210。计算机系统可以包括不同形式的程序储存单元以及数据储存单元(即上面描述的存储器或存储介质),例如硬盘270、只读存储器(ROM)230、随机存取存储器(RAM)240,能够用于存储计算机处理和/或通信使用的各种数据文件,以及处理器 组220所执行的可能的程序指令。计算机系统还可以包括一个输入/输出组件260,输入/输出组件260用于实现计算机系统与其他组件(例如,用户界面280等)之间的输入/输出数据流。As shown in Figure 11, the computer system can include a communication port 250, which is connected to a network for data communication, for example, the computer system can send and receive information and data through the communication port 250, that is, the communication port 250 can realize the communication between the computer system and the computer system. Other electronic devices communicate wirelessly or by wire to exchange data. The computer system may also include a processor group 220 (ie, the processor described above) for executing program instructions. The processor group 220 may consist of at least one processor (eg, CPU). The computer system may include an internal communication bus 210 . A computer system may include different forms of program storage units and data storage units (i.e., memory or storage media described above), such as hard disk 270, read-only memory (ROM) 230, random access memory (RAM) 240, which can be used to store Various data files used by the computer for processing and/or communicating, and possibly program instructions executed by the processor group 220 . The computer system may also include an input/output component 260 for enabling input/output data flow between the computer system and other components (eg, user interface 280, etc.).
通常,以下装置可以连接输入/输出组件260:例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置;例如显示器(例如,LCD、OLED显示器等)、扬声器、振动器等的输出装置;包括例如磁带、硬盘等的存储装置;以及通信接口。Typically, the following devices may be connected to the input/output assembly 260: input devices such as a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; displays (e.g., LCD, OLED displays, etc.), speakers, an output device such as a vibrator; a storage device including, for example, a magnetic tape, a hard disk, etc.; and a communication interface.
虽然图11示出了具有各种装置的计算机系统,但应理解的是,并不要求计算机系统具备所有示出的装置,可以替代地,计算机系统可以具备更多或更少的装置。While FIG. 11 shows a computer system with various devices, it should be understood that the computer system is not required to have all of the devices shown and, instead, the computer system may have more or fewer devices.
对于本公开,还有以下几点需要说明:For this disclosure, the following points need to be explained:
(1)本公开实施例附图只涉及到与本公开实施例涉及到的结构,其他结构可参考通常设计。(1) The drawings of the embodiments of the present disclosure only relate to the structures involved in the embodiments of the present disclosure, and other structures may refer to general designs.
(2)为了清晰起见,在用于描述本发明的实施例的附图中,层或结构的厚度和尺寸被放大。可以理解,当诸如层、膜、区域或基板之类的元件被称作位于另一元件“上”或“下”时,该元件可以“直接”位于另一元件“上”或“下”,或者可以存在中间元件。(2) For clarity, in the drawings used to describe the embodiments of the present invention, the thickness and size of layers or structures are exaggerated. It will be understood that when an element such as a layer, film, region, or substrate is referred to as being "on" or "under" another element, it can be "directly on" or "under" the other element, Or intervening elements may be present.
(3)在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合以得到新的实施例。(3) In the case of no conflict, the embodiments of the present disclosure and the features in the embodiments can be combined with each other to obtain new embodiments.
以上所述仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,本公开的保护范围应以所述权利要求的保护范围为准。The above description is only a specific implementation manner of the present disclosure, but the protection scope of the present disclosure is not limited thereto, and the protection scope of the present disclosure should be based on the protection scope of the claims.

Claims (22)

  1. 一种视频处理方法,包括:A video processing method, comprising:
    获取至少一个帧图像和音频片段,其中,每个帧图像包括至少一个对象,每个对象包括面部区域;acquiring at least one frame image and an audio clip, wherein each frame image includes at least one object, and each object includes a facial region;
    对所述至少一个帧图像进行预处理,得到所述面部区域的嘴部特征信息;Preprocessing the at least one frame image to obtain mouth feature information of the facial region;
    基于所述嘴部特征信息和所述音频片段,使用视频处理网络对所述至少一个帧图像进行处理,得到目标视频,Based on the mouth feature information and the audio clip, using a video processing network to process the at least one frame image to obtain a target video,
    其中,所述目标视频中的对象具有与所述音频片段同步的嘴型变化,所述嘴部特征信息至少用于向所述视频处理网络提供所述每个对象的面部区域和嘴部的基本轮廓,以及所述每个对象的所述面部区域和所述嘴部的位置关系。Wherein, the object in the target video has a mouth shape change synchronously with the audio clip, and the mouth feature information is at least used to provide the video processing network with basic information about the facial area and mouth of each object. outline, and the positional relationship between the facial area and the mouth of each object.
  2. 根据权利要求1所述的视频处理方法,其中,对所述至少一个帧图像进行预处理,得到所述面部区域的嘴部特征信息,包括:The video processing method according to claim 1, wherein said at least one frame image is preprocessed to obtain mouth feature information of said facial region, comprising:
    利用嘴部模糊模型对所述每个帧图像中的对象的嘴部进行模糊处理,得到所述每个帧图像对应的嘴部模糊图片,Blurring the mouth of the object in each frame image by using a mouth blur model to obtain a mouth blur picture corresponding to each frame image,
    其中,所述嘴部特征信息包括所述至少一个帧图像分别对应的至少一个嘴部模糊图片。Wherein, the mouth feature information includes at least one blurred mouth picture respectively corresponding to the at least one frame image.
  3. 根据权利要求2所述的视频处理方法,其中,利用嘴部模糊模型对所述每个帧图像中的对象的嘴部进行模糊处理,得到所述每个帧图像对应的嘴部模糊图片,包括:The video processing method according to claim 2, wherein the mouth of the object in each frame image is blurred using a mouth blur model to obtain a mouth blur picture corresponding to each frame image, including :
    对所述帧图像进行第一色彩空间转换,得到第一转换图像;performing a first color space conversion on the frame image to obtain a first converted image;
    提取所述第一转换图像中的嘴部区域,对所述嘴部区域进行第一滤波处理,得到所述帧图像对应的嘴部模糊图片。Extracting the mouth area in the first converted image, performing a first filtering process on the mouth area, and obtaining a blurred mouth picture corresponding to the frame image.
  4. 根据权利要求2所述的视频处理方法,其中,利用嘴部模糊模型对所述每个帧图像中的对象的嘴部进行模糊处理,得到所述每个帧图像对应的嘴部模糊图片,包括:The video processing method according to claim 2, wherein the mouth of the object in each frame image is blurred using a mouth blur model to obtain a mouth blur picture corresponding to each frame image, including :
    对所述帧图像进行第一色彩空间转换,得到第一转换图像;performing a first color space conversion on the frame image to obtain a first converted image;
    提取所述第一转换图像中的嘴部区域,对所述嘴部区域进行第一滤波处理,得到第一中间模糊图像;extracting the mouth area in the first converted image, and performing a first filtering process on the mouth area to obtain a first intermediate blurred image;
    对所述帧图像进行第二色彩空间转换,得到第二转换图像;performing a second color space conversion on the frame image to obtain a second converted image;
    提取所述第二转换图像中的皮肤区域,从所述皮肤区域中选择包括嘴部的预设区域;extracting a skin area in the second transformed image, and selecting a preset area including a mouth from the skin area;
    对所述预设区域进行第二滤波处理,得到第二中间模糊图像;performing a second filtering process on the preset area to obtain a second intermediate blurred image;
    对所述第一中间模糊图像和所述第二中间模糊图像进行合成处理,得到所述帧图像对应的嘴部模糊图片。Combining the first intermediate blurred image and the second intermediate blurred image to obtain a blurred mouth picture corresponding to the frame image.
  5. 根据权利要求4所述的视频处理方法,其中,所述第一色彩空间为HSI色彩空间,所述第二色彩空间为YCbCr色彩空间。The video processing method according to claim 4, wherein the first color space is an HSI color space, and the second color space is a YCbCr color space.
  6. 根据权利要求2-5任一项所述的视频处理方法,其中,对所述至少一个帧图像进行预处理,得到所述面部区域的嘴部特征信息,还包括:The video processing method according to any one of claims 2-5, wherein, performing preprocessing on the at least one frame image to obtain mouth feature information of the facial region, further comprising:
    对所述至少一个嘴部模糊图片进行梯度特征提取,得到每个嘴部模糊图片对应的梯度特征图,其中,所述嘴部特征信息还包括所述至少一个嘴部模糊图片分别对应的至少一个梯度特征图。Gradient feature extraction is performed on the at least one blurred mouth picture to obtain a gradient feature map corresponding to each blurred mouth picture, wherein the mouth feature information further includes at least one corresponding to the at least one blurred mouth picture. Gradient feature map.
  7. 根据权利要求6所述的视频处理方法,其中,对所述至少一个嘴部模糊图片进行梯度特征提取,得到每个嘴部模糊图片对应的梯度特征图,包括:The video processing method according to claim 6, wherein performing gradient feature extraction on the at least one blurred mouth picture to obtain a gradient feature map corresponding to each blurred mouth picture, comprising:
    获取所述每个嘴部模糊图片对应的灰度图;Obtain the grayscale image corresponding to each blurred mouth picture;
    获取第一卷积核和第二卷积核,其中,所述第一卷积核的尺寸小于所述第二卷积核的尺寸,所述第一卷积核中的所有元素之和为0,所述第二卷积核中的所有元素之和为0;Obtain the first convolution kernel and the second convolution kernel, wherein the size of the first convolution kernel is smaller than the size of the second convolution kernel, and the sum of all elements in the first convolution kernel is 0 , the sum of all elements in the second convolution kernel is 0;
    将所述灰度图与所述第一卷积核和所述第二卷积核进行卷积处理,得到所述每个嘴部模糊图片对应的梯度图。Convolving the grayscale image with the first convolution kernel and the second convolution kernel to obtain a gradient image corresponding to each blurred mouth picture.
  8. 根据权利要求2-7任一项所述的视频处理方法,其中,对所述至少一个帧图像进行预处理,得到所述面部区域的嘴部特征信息,还包括:The video processing method according to any one of claims 2-7, wherein, performing preprocessing on the at least one frame image to obtain mouth feature information of the facial region, further comprising:
    利用面部关键点检测模型对所述每个帧图像进行处理,得到多个面部关键点;Process each frame image by using a facial key point detection model to obtain a plurality of facial key points;
    提取所述多个面部关键点中与嘴部相关的多个嘴部关键点,其中,所述嘴部特征信息还包括所述多个嘴部关键点。Extracting a plurality of mouth key points related to the mouth among the plurality of facial key points, wherein the mouth feature information further includes the plurality of mouth key points.
  9. 根据权利要求2-8任一项所述的视频处理方法,其中,所述视频处理网络包括特征提取子网络和解码生成子网络,The video processing method according to any one of claims 2-8, wherein the video processing network includes a feature extraction subnetwork and a decoding generation subnetwork,
    基于所述嘴部特征信息和所述音频片段,使用所述视频处理网络对所述 至少一个帧图像进行处理,包括:Based on the mouth feature information and the audio clip, using the video processing network to process the at least one frame image, including:
    对所述音频片段进行频谱转换处理,得到特征频谱;performing spectral conversion processing on the audio segment to obtain a characteristic spectrum;
    利用所述特征提取子网络对所述至少一个嘴部模糊图片和所述特征频谱进行特征提取处理,得到M个视觉特征向量,其中,所述M个视觉特征向量与所述音频片段相匹配,M为正整数且小于等于所述至少一个嘴部模糊图片的数量;Using the feature extraction sub-network to perform feature extraction processing on the at least one blurred mouth picture and the feature spectrum to obtain M visual feature vectors, wherein the M visual feature vectors match the audio clips, M is a positive integer and is less than or equal to the number of the at least one mouth blurred picture;
    利用所述解码生成子网络对所述M个视觉特征向量进行处理,得到M个目标帧,其中,所述M个目标帧与所述音频片段中M个时点一一对应,且所述M个目标帧中每个目标帧具有与所述音频片段中对应时点对应的嘴型;Process the M visual feature vectors by using the decoding generation subnetwork to obtain M target frames, wherein the M target frames correspond to the M time points in the audio clip one-to-one, and the M Each of the target frames has a mouth shape corresponding to the corresponding time point in the audio clip;
    根据所述M个目标帧得到所述目标视频。The target video is obtained according to the M target frames.
  10. 根据权利要求9所述的视频处理方法,其中,利用所述特征提取子网络对所述至少一个嘴部模糊图片和所述特征频谱进行特征提取处理,得到M个视觉特征向量,包括:The video processing method according to claim 9, wherein, using the feature extraction sub-network to perform feature extraction processing on the at least one blurred mouth picture and the feature spectrum to obtain M visual feature vectors, including:
    将所述至少一个嘴部模糊图片依序分成M组,利用所述特征提取子网络提取每组对应的视觉特征向量,以得到所述M个视觉特征向量。Divide the at least one blurred mouth picture into M groups in sequence, and use the feature extraction sub-network to extract the visual feature vectors corresponding to each group, so as to obtain the M visual feature vectors.
  11. 根据权利要求9或10所述的视频处理方法,其中,所述嘴部特征信息还包括所述至少一个嘴部模糊图片分别对应的至少一个梯度特征图,The video processing method according to claim 9 or 10, wherein the mouth feature information further includes at least one gradient feature map respectively corresponding to the at least one blurred mouth picture,
    利用所述特征提取子网络对所述至少一个嘴部模糊图片和所述特征频谱进行特征提取处理,得到M个视觉特征向量,包括:Using the feature extraction sub-network to perform feature extraction processing on the at least one blurred mouth picture and the feature spectrum to obtain M visual feature vectors, including:
    利用所述特征提取子网络对所述至少一个嘴部模糊图片、所述至少一个梯度特征图和所述特征频谱进行特征提取处理,得到M个视觉特征向量,其中,所述至少一个梯度特征图用于为所述特征提取子网络提供对应的嘴部模糊图片中模糊区域和非模糊区域的范围。Using the feature extraction sub-network to perform feature extraction processing on the at least one blurred mouth picture, the at least one gradient feature map, and the feature spectrum to obtain M visual feature vectors, wherein the at least one gradient feature map It is used to provide the feature extraction sub-network with ranges of blurred areas and non-blurred areas in the corresponding mouth blurred picture.
  12. 根据权利要求9-11任一项所述的视频处理方法,其中,所述嘴部特征信息还包括多个嘴部关键点,The video processing method according to any one of claims 9-11, wherein the mouth feature information further includes a plurality of mouth key points,
    利用所述解码生成子网络对所述M个视觉特征向量进行处理,得到M个目标帧,包括:Using the decoding generation sub-network to process the M visual feature vectors to obtain M target frames, including:
    利用所述解码生成子网络对每个视觉特征向量进行处理,生成带有嘴部区域的中间帧;Each visual feature vector is processed by the decoding generation sub-network to generate an intermediate frame with a mouth area;
    利用所述多个嘴部关键点对所述中间帧的嘴部区域的位置和图像信息 进行修正,得到所述视觉特征向量对应的目标帧。The position and image information of the mouth region of the intermediate frame are corrected by using the plurality of mouth key points to obtain the target frame corresponding to the visual feature vector.
  13. 一种神经网络的训练方法,其中,所述神经网络包括视频处理网络,所述训练方法包括:A training method for a neural network, wherein the neural network includes a video processing network, and the training method includes:
    获取训练视频和与所述训练视频匹配的训练音频片段,其中,所述训练视频包括至少一个训练帧图像,每个训练帧图像包括至少一个对象,每个对象包括面部区域;Obtaining a training video and a training audio segment matched with the training video, wherein the training video includes at least one training frame image, each training frame image includes at least one object, and each object includes a facial region;
    对所述训练视频进行预处理,得到所述训练视频对应的嘴部特征信息;Preprocessing the training video to obtain mouth feature information corresponding to the training video;
    基于所述嘴部特征信息和所述训练音频片段,对所述视频处理网络进行训练,training the video processing network based on the mouth feature information and the training audio clips,
    其中,所述嘴部特征信息至少用于向所述视频处理网络提供所述每个对象的面部区域和嘴部的基本轮廓,以及所述每个对象的所述面部区域和所述嘴部的位置关系。Wherein, the mouth feature information is at least used to provide the video processing network with the basic outline of the facial area and mouth of each object, and the facial area and mouth of each object. Positional relationship.
  14. 根据权利要求13所述的训练方法,其中,所述视频处理网络包括特征提取子网络,The training method according to claim 13, wherein the video processing network comprises a feature extraction sub-network,
    基于所述嘴部特征信息和所述训练音频片段,对所述视频处理网络进行训练,包括:Based on the mouth feature information and the training audio clips, the video processing network is trained, including:
    对所述训练音频片段进行频谱转换处理,得到训练特征频谱;Perform spectral conversion processing on the training audio segment to obtain the training feature spectrum;
    利用所述训练特征频谱和所述嘴部特征信息,对待训练的特征提取子网络进行训练,以得到训练好的所述特征提取子网络。Using the training feature spectrum and the mouth feature information, the feature extraction sub-network to be trained is trained to obtain the trained feature extraction sub-network.
  15. 根据权利要求14所述的训练方法,其中,所述嘴部特征信息包括至少一个嘴部模糊图片,The training method according to claim 14, wherein the mouth feature information includes at least one blurred mouth picture,
    利用所述训练特征频谱和所述嘴部特征信息对待训练的所述特征提取子网络进行训练,以得到训练好的所述特征提取子网络,包括:Using the training feature spectrum and the mouth feature information to train the feature extraction sub-network to be trained to obtain the trained feature extraction sub-network, including:
    利用所述待训练的特征提取子网络对所述训练特征频谱和所述至少一个嘴部模糊图片进行处理,得到训练视觉特征向量和训练音频特征向量;Using the feature extraction sub-network to be trained to process the training feature spectrum and the at least one blurred mouth picture to obtain a training visual feature vector and a training audio feature vector;
    根据所述训练视觉特征向量和所述训练音频特征向量,通过所述特征提取子网络对应的损失函数计算所述特征提取子网络的损失值;According to the training visual feature vector and the training audio feature vector, calculate the loss value of the feature extraction sub-network through the loss function corresponding to the feature extraction sub-network;
    基于所述损失值对所述待训练的特征提取子网络的参数进行修正;以及modifying the parameters of the feature extraction sub-network to be trained based on the loss value; and
    在所述待训练的特征提取子网络对应的损失值不满足预定准确率条件时,继续输入所述训练特征频谱和所述至少一个嘴部模糊图片以重复执行上 述训练过程。When the loss value corresponding to the feature extraction sub-network to be trained does not meet the predetermined accuracy condition, continue to input the training feature spectrum and the at least one blurred mouth picture to repeat the above training process.
  16. 根据权利要求15所述的训练方法,其中,所述嘴部特征信息包括至少一个嘴部模糊图片,The training method according to claim 15, wherein the mouth feature information includes at least one blurred mouth picture,
    所述视频处理网络还包括解码生成子网络,The video processing network also includes a decoding generation subnetwork,
    基于所述嘴部特征信息和所述训练音频片段,对所述视频处理网络进行训练,还包括:Based on the mouth feature information and the training audio clips, the video processing network is trained, further comprising:
    利用训练好的所述特征提取子网络对所述训练特征频谱和所述至少一个嘴部模糊图片进行处理,得到至少一个目标视觉特征向量;Using the trained feature extraction sub-network to process the training feature spectrum and the at least one blurred mouth picture to obtain at least one target visual feature vector;
    根据所述至少一个目标视觉特征向量以及所述训练视频,对所述解码生成子网络进行训练。The decoding generation sub-network is trained according to the at least one target visual feature vector and the training video.
  17. 根据权利要求16所述的训练方法,其中,所述嘴部特征信息还包括多个嘴部关键点,The training method according to claim 16, wherein the mouth feature information further includes a plurality of mouth key points,
    根据所述至少一个目标视觉特征向量以及所述训练视频,对所述解码生成子网络进行训练,包括:According to the at least one target visual feature vector and the training video, the decoding generation sub-network is trained, including:
    利用所述多个嘴部关键点提供的嘴部位置信息,结合所述至少一个目标视觉特征向量对所述解码生成子网络进行训练。The decoding generation sub-network is trained by using the mouth position information provided by the plurality of mouth key points in combination with the at least one target visual feature vector.
  18. 根据权利要求16或17所述的训练方法,其中,所述神经网络还包括判别子网络,所述判别子网络和所述解码生成子网络构成生成式对抗网络,The training method according to claim 16 or 17, wherein the neural network further comprises a discriminant sub-network, and the discriminant sub-network and the decoding-generating sub-network constitute a generative confrontation network,
    在对所述解码生成子网络训练的过程中,对所述生成式对抗网络进行交替迭代训练,以得到训练好的所述解码生成子网络。In the process of training the decoding generation sub-network, the generative confrontation network is alternately and iteratively trained to obtain the trained decoding generation sub-network.
  19. 一种视频处理装置,包括:A video processing device, comprising:
    获取单元,配置为获取至少一个帧图像和音频片段,其中,每个帧图像包括至少一个对象,每个对象包括面部区域;An acquisition unit configured to acquire at least one frame image and an audio segment, wherein each frame image includes at least one object, and each object includes a face area;
    预处理单元,配置为对所述至少一个帧图像进行预处理,得到所述面部区域的嘴部特征信息;A preprocessing unit configured to preprocess the at least one frame image to obtain mouth feature information of the facial region;
    视频处理单元,配置为基于所述嘴部特征信息和所述音频片段,使用视频处理网络对所述至少一个帧图像进行处理,得到目标视频,其中,所述目标视频中的对象与所述音频片段具有同步的嘴型变化,所述嘴部特征信息至少用于向所述视频处理网络提供所述每个对象的面部区域和嘴部的基本轮廓,以及所述每个对象的所述面部区域和所述嘴部的位置关系。The video processing unit is configured to use a video processing network to process the at least one frame image based on the mouth feature information and the audio clip to obtain a target video, wherein the object in the target video is related to the audio clip The segment has synchronous mouth shape changes, and the mouth feature information is at least used to provide the video processing network with the facial area of each object and the basic outline of the mouth, and the facial area of each object and the positional relationship of the mouth.
  20. 一种神经网络的训练装置,所述神经网络包括视频处理网络,A training device for a neural network, the neural network comprising a video processing network,
    所述训练装置包括:The training device includes:
    训练数据获取单元,配置为获取训练视频和与所述训练视频匹配的训练音频片段,其中,所述训练视频包括至少一个训练帧图像,每个训练帧图像包括至少一个对象,每个对象包括面部区域;A training data acquisition unit configured to acquire a training video and a training audio segment matched with the training video, wherein the training video includes at least one training frame image, each training frame image includes at least one object, and each object includes a face area;
    预处理单元,配置为对所述训练视频进行预处理,得到所述面部区域的嘴部特征信息;A preprocessing unit configured to preprocess the training video to obtain mouth feature information of the facial region;
    训练单元,配置为基于所述嘴部特征信息和所述训练音频片段,对所述视频处理网络进行训练,a training unit configured to train the video processing network based on the mouth feature information and the training audio clip,
    其中,所述嘴部特征信息至少用于向所述视频处理网络提供所述每个对象的面部区域和嘴部的基本轮廓,以及所述每个对象的所述面部区域和所述嘴部的位置关系。Wherein, the mouth feature information is at least used to provide the video processing network with the basic outline of the facial area and mouth of each object, and the facial area and mouth of each object. Positional relationship.
  21. 一种电子设备,包括:An electronic device comprising:
    存储器,非瞬时性地存储有计算机可执行指令;memory non-transitoryly storing computer-executable instructions;
    处理器,配置为运行所述计算机可执行指令,a processor configured to execute said computer-executable instructions,
    其中,所述计算机可执行指令被所述处理器运行时实现根据权利要求1-12任一项所述视频处理方法或权利要求13-18任一项所述的神经网络的训练方法。Wherein, when the computer-executable instructions are run by the processor, the video processing method according to any one of claims 1-12 or the neural network training method according to any one of claims 13-18 are realized.
  22. 一种非瞬时性计算机可读存储介质,其中,所述非瞬时性计算机可读存储介质存储有计算机可执行指令,A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions,
    所述计算机可执行指令被处理器执行时实现根据权利要求1-12任一项所述视频处理方法或权利要求13-18任一项所述的神经网络的训练方法。When the computer-executable instructions are executed by the processor, the video processing method according to any one of claims 1-12 or the neural network training method according to any one of claims 13-18 are realized.
PCT/CN2022/088965 2021-11-04 2022-04-25 Video processing method and apparatus, and neural network training method and apparatus WO2023077742A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111296799.X 2021-11-04
CN202111296799.XA CN113723385B (en) 2021-11-04 2021-11-04 Video processing method and device and neural network training method and device

Publications (1)

Publication Number Publication Date
WO2023077742A1 true WO2023077742A1 (en) 2023-05-11

Family

ID=78686675

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/088965 WO2023077742A1 (en) 2021-11-04 2022-04-25 Video processing method and apparatus, and neural network training method and apparatus

Country Status (2)

Country Link
CN (1) CN113723385B (en)
WO (1) WO2023077742A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117557626A (en) * 2024-01-12 2024-02-13 泰安大陆医疗器械有限公司 Auxiliary positioning method for spray head installation of aerosol sprayer

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113723385B (en) * 2021-11-04 2022-05-17 新东方教育科技集团有限公司 Video processing method and device and neural network training method and device
CN114419702B (en) * 2021-12-31 2023-12-01 南京硅基智能科技有限公司 Digital person generation model, training method of model, and digital person generation method
CN116668611A (en) * 2023-07-27 2023-08-29 小哆智能科技(北京)有限公司 Virtual digital human lip synchronization method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102271241A (en) * 2011-09-02 2011-12-07 北京邮电大学 Image communication method and system based on facial expression/action recognition
US20160343389A1 (en) * 2015-05-19 2016-11-24 Bxb Electronics Co., Ltd. Voice Control System, Voice Control Method, Computer Program Product, and Computer Readable Medium
CN112562722A (en) * 2020-12-01 2021-03-26 新华智云科技有限公司 Audio-driven digital human generation method and system based on semantics
CN113378697A (en) * 2021-06-08 2021-09-10 安徽大学 Method and device for generating speaking face video based on convolutional neural network
CN113723385A (en) * 2021-11-04 2021-11-30 新东方教育科技集团有限公司 Video processing method and device and neural network training method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024156B (en) * 2010-11-16 2012-07-04 中国人民解放军国防科学技术大学 Method for positioning lip region in color face image
KR20210048441A (en) * 2018-05-24 2021-05-03 워너 브로스. 엔터테인먼트 인크. Matching mouth shape and movement in digital video to alternative audio
CN111212245B (en) * 2020-01-15 2022-03-25 北京猿力未来科技有限公司 Method and device for synthesizing video
CN111783566B (en) * 2020-06-15 2023-10-31 神思电子技术股份有限公司 Video synthesis method based on lip synchronization and enhancement of mental adaptation effect

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102271241A (en) * 2011-09-02 2011-12-07 北京邮电大学 Image communication method and system based on facial expression/action recognition
US20160343389A1 (en) * 2015-05-19 2016-11-24 Bxb Electronics Co., Ltd. Voice Control System, Voice Control Method, Computer Program Product, and Computer Readable Medium
CN112562722A (en) * 2020-12-01 2021-03-26 新华智云科技有限公司 Audio-driven digital human generation method and system based on semantics
CN113378697A (en) * 2021-06-08 2021-09-10 安徽大学 Method and device for generating speaking face video based on convolutional neural network
CN113723385A (en) * 2021-11-04 2021-11-30 新东方教育科技集团有限公司 Video processing method and device and neural network training method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FUDONG NIAN, WANG WENTAO, WANG YAN, ZHANG JINGJING, HU GUIHENG, LI TENG: "Speech Driven Talking Face Video Generation via Landmarks Representation", PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, vol. 34, no. 6, 15 June 2021 (2021-06-15), XP093063128 *
QIU XIAOXIN; ZHANG WENQIANG: "Adaptive Facial Skin Region Extraction Based on Lip Color and Complexion", WEIXING DIANNAO YINGYONG - MICROCOMPUTER APPLICATIONS, SHANGHAI SHI WEIXING DIANNAO YINGYONG XUEHUI, CN, vol. 31, no. 8, 20 August 2015 (2015-08-20), CN , pages 1 - 4, XP009545346, ISSN: 1007-757X *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117557626A (en) * 2024-01-12 2024-02-13 泰安大陆医疗器械有限公司 Auxiliary positioning method for spray head installation of aerosol sprayer
CN117557626B (en) * 2024-01-12 2024-04-05 泰安大陆医疗器械有限公司 Auxiliary positioning method for spray head installation of aerosol sprayer

Also Published As

Publication number Publication date
CN113723385B (en) 2022-05-17
CN113723385A (en) 2021-11-30

Similar Documents

Publication Publication Date Title
WO2023077742A1 (en) Video processing method and apparatus, and neural network training method and apparatus
US9811894B2 (en) Image processing method and apparatus
Grishchenko et al. Attention mesh: High-fidelity face mesh prediction in real-time
US11410457B2 (en) Face reenactment
JP6636154B2 (en) Face image processing method and apparatus, and storage medium
CN107993216B (en) Image fusion method and equipment, storage medium and terminal thereof
WO2017035966A1 (en) Method and device for processing facial image
US11900557B2 (en) Three-dimensional face model generation method and apparatus, device, and medium
CN107771336A (en) Feature detection and mask in image based on distribution of color
KR102045575B1 (en) Smart mirror display device
JP2006163871A (en) Image processor and processing method, and program
CN115699114A (en) Image augmentation for analysis
WO2022151655A1 (en) Data set generation method and apparatus, forgery detection method and apparatus, device, medium and program
CN112995534B (en) Video generation method, device, equipment and readable storage medium
WO2023066120A1 (en) Image processing method and apparatus, electronic device, and storage medium
CN113052783A (en) Face image fusion method based on face key points
Chen et al. Sound to visual: Hierarchical cross-modal talking face video generation
US20200082609A1 (en) Image processing method and image processing device
CN110059739B (en) Image synthesis method, image synthesis device, electronic equipment and computer-readable storage medium
CN110675438A (en) Lightweight rapid face exchange algorithm
KR100422470B1 (en) Method and apparatus for replacing a model face of moving image
WO2021155666A1 (en) Method and apparatus for generating image
WO2022266878A1 (en) Scene determining method and apparatus, and computer readable storage medium
WO2022022260A1 (en) Image style transfer method and apparatus therefor
US9563940B2 (en) Smart image enhancements