CN115116468A

CN115116468A - Video generation method and device, storage medium and electronic equipment

Info

Publication number: CN115116468A
Application number: CN202210688868.XA
Authority: CN
Inventors: 杨红庄; 甄海洋; 王超; 周维; 王磊; 王进
Original assignee: Rainbow Software Co ltd
Current assignee: Rainbow Software Co ltd
Priority date: 2022-06-16
Filing date: 2022-06-16
Publication date: 2022-09-27
Also published as: WO2023241298A1

Abstract

The embodiment of the invention discloses a video generation method, which comprises the following steps: acquiring a frame sequence to be selected; determining a target frame from a frame sequence to be selected according to the frame selection dimension; performing voice driving on a target frame based on a current voice signal to obtain a target video; the frame selection dimension comprises at least one of a first frame selection dimension and a second frame selection dimension. Through the scheme of the embodiment, the target frame meeting the voice driving requirement is obtained, and the subsequent voice driving effect is improved.

Description

Video generation method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to video generation technologies, and in particular, to a video generation method, an apparatus, a storage medium, and an electronic device.

Background

Methods for generating video through voice driving have been widely used in various fields. The prior art typically generates video by speech driving with an unscreened single frame static frame as the input frame. However, the voice driving has many requirements for the input frame, for example, the image quality of the input frame needs to be clear, the face is centered, the expression is neutral, and the voice driving requirement is difficult to meet only based on the non-screened single frame static frame.

Disclosure of Invention

Compared with the related art, the technical scheme disclosed by the application obtains the target frame meeting the voice driving requirement, and improves the effect of subsequent voice driving; meanwhile, the problem of facial detail loss possibly caused by expression coefficient change in the voice driving process is solved, and the generated video is more vivid and natural.

To achieve the object of the embodiment of the present invention, an embodiment of the present invention provides a video generating method, where the method may include:

acquiring a frame sequence to be selected;

determining a target frame from the frame sequence to be selected according to the frame selection dimension;

performing voice driving on the target frame based on the current voice signal to acquire a target video;

wherein the frame selection dimension comprises at least one of a first frame selection dimension and a second frame selection dimension.

In an exemplary embodiment of the present invention, the determining, according to a frame selection dimension, a target frame from the frame sequence to be selected includes:

according to the frame selection dimension, obtaining a preselected frame meeting a frame selection condition from the frame sequence to be selected, wherein the preselected frame is one or more frames;

when the preselected frame is a frame, the preselected frame is the target frame;

when the preselected frame is a plurality of frames, fusing the plurality of preselected frames to obtain the target frame;

the frame selection condition comprises at least one of a first frame selection condition and a second frame selection condition.

In an exemplary embodiment of the invention, the fusing includes at least one of a first fusing or a second fusing.

In an exemplary embodiment of the present invention, the acquiring, according to the frame selection dimension, a preselected frame that meets a frame selection condition from the frame sequence to be selected includes:

calculating a first dimension value of each frame in the frame sequence to be selected according to the first frame selection dimension; acquiring a first preselected frame of which the first dimension value meets the first frame selection condition from the frame sequence to be selected;

wherein the first preselected frame is one or more frames.

In an exemplary embodiment of the present invention, the first frame selection condition is that the first dimension value is within a first frame selection range.

In an exemplary embodiment of the present invention, when the first preselected frame is one frame, the first preselected frame is the target frame;

and when the first preselected frame is a plurality of frames, performing first fusion on the plurality of frames of the first preselected frame to obtain the target frame.

calculating a second dimension value of each frame in the frame sequence to be selected according to the second frame selection dimension;

acquiring a second preselected frame of which the second dimension value meets the second frame selection condition from the frame sequence to be selected;

wherein the second preselected frame is one or more frames.

In an exemplary embodiment of the present invention, when the second preselected frame is a frame, the second preselected frame is the target frame;

and when the second preselected frame is a plurality of frames, carrying out second fusion on the second preselected frame of the plurality of frames to obtain the target frame.

In an exemplary embodiment of the present invention, the second frame selection condition is that the second dimension value or the second dimension integrated value is lowest or highest.

calculating a second dimension value of each frame in the first preselected frame according to the second frame selecting dimension;

acquiring a second preselected frame with the second dimension value meeting the second frame selecting condition from the first preselected frame;

wherein the second preselected frame is one or more frames.

and when the second preselected frame is a plurality of frames, carrying out second fusion on the plurality of frames of the second preselected frame to obtain the target frame.

In an exemplary embodiment of the present invention, when the first preselected frame is a multi-frame, the second selected frame condition is that the second dimension value or the second dimension integrated value is lowest or highest.

In an exemplary embodiment of the present invention, when the first pre-selected frame is a frame, the second frame selection condition is that the second dimension value is within a second frame selection range.

In an exemplary embodiment of the present invention, the performing voice driving on the target frame based on the current voice signal to obtain the target video includes:

generating a corresponding driving expression coefficient through the trained voice driving model according to the current voice signal;

matching the target frame with the driving expression coefficient to generate a key frame;

performing expression matching on the key frame based on the frame sequence to be selected and the target frame to obtain a driving frame;

successive ones of the drive frames constitute the target video.

An embodiment of the present invention further provides a video generating apparatus, which may include:

the acquisition unit is configured to acquire a frame sequence to be selected;

the frame selection unit is configured to determine a target frame from the frame sequence to be selected according to a frame selection dimension;

a driving unit configured to perform voice driving on the target frame based on the current voice signal to obtain a target video

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the video generation method described in any one of the above.

An embodiment of the present invention further provides an electronic device, which may include:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform any of the video generation methods described above via execution of the executable instructions.

By the scheme of the embodiment of the invention, the target frame meeting the voice driving requirement is obtained, and the effect of subsequent voice driving is improved; meanwhile, the problem of facial detail loss possibly caused by expression coefficient change in the voice driving process is solved, and the generated video is more vivid and natural.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. Other advantages of the present application may be realized and attained by the instrumentalities and combinations particularly pointed out in the specification and the drawings.

Drawings

The accompanying drawings are included to provide an understanding of the present disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the examples serve to explain the principles of the disclosure and not to limit the disclosure.

Fig. 1 is a flow chart of a video generation method according to an embodiment of the present application;

FIG. 2 is a flow diagram of determining a target frame from a candidate frame sequence according to an embodiment of the present application;

FIG. 3 is a flow diagram of determining a target frame from a candidate frame sequence according to another embodiment of the present application;

FIG. 4a is a schematic view of an eye feature point according to an embodiment of the present application;

FIG. 4b is a schematic view of a feature point of a mouth according to an embodiment of the present application;

FIG. 5 is a flow diagram of determining a target frame from a candidate frame sequence according to yet another embodiment of the present application;

FIG. 6 is a flowchart of performing voice-driven on a target frame to obtain a target video according to an embodiment of the present application;

fig. 7 is a flowchart of a video generation method in a video call according to an embodiment of the present application;

fig. 8 is a block diagram of a video generation apparatus according to an embodiment of the present application.

Detailed Description

The description herein describes embodiments, but is intended to be exemplary, rather than limiting and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible that are within the scope of the embodiments described herein. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or instead of any other feature or element in any other embodiment, unless expressly limited otherwise.

The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The embodiments, features and elements disclosed in this application may also be combined with any conventional features or elements to form a unique inventive concept as defined by the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive aspects to form yet another unique inventive aspect, as defined by the claims. Thus, it should be understood that any of the features shown and/or discussed in this application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not limited except as by the appended claims and their equivalents. Furthermore, various modifications and changes may be made within the scope of the appended claims.

Further, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other orders of steps are possible as will be understood by those of ordinary skill in the art. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Further, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.

An embodiment of the present application provides a video generation method, as shown in fig. 1, the method includes:

s101: acquiring a frame sequence to be selected;

the candidate frame sequence may include: at least one of a real-time video cache frame and a pre-stored frame pre-shot by a user, wherein the frame sequence to be selected comprises at least two frames to be selected, and the pre-stored frame pre-shot by the user can be a frame shot by the user according to a system prompt and based on different frame selection dimensions;

the frame to be selected needs to contain face information;

s102: acquiring a target frame in a frame sequence to be selected according to the frame selection dimension;

based on the requirement of the voice-driven model on the target frame, the frame selection dimension may include at least one of a first frame selection dimension and a second frame selection dimension, where the first frame selection dimension may be a screen dimension including at least one of a face position, a face orientation, a body pose, and a light ray, the second frame selection dimension may be at least one of an image quality dimension and a five-sense organ dimension, the image quality dimension may include a blur degree, a shadow, a noise, and the like, and the five-sense organ dimension may include at least one of an eye dimension and a mouth dimension;

the frame selection dimension can be preset and can also be automatically generated according to the requirements of the voice driving model;

the requirement of the voice-driven model on the target frame can comprise at least one of picture requirement, image quality requirement and five sense organs requirement, wherein the picture requirement comprises at least one of face position centering, face orientation forward, human body posture neutral and light moderate, and the image quality requirement can comprise: the image is clear, and the requirements of five sense organs can comprise at least one of eye opening and mouth closing;

the target frame is one or more frames meeting all frame selection conditions in the frame sequence to be selected;

specifically, step S102 may include:

s1021: according to the frame selection dimension, obtaining a preselected frame meeting the frame selection condition from a frame sequence to be selected, wherein the preselected frame is one or more frames;

selecting frame conditions to meet the requirements of the voice driving model on the target frame;

the frame selection condition can comprise at least one of a first frame selection condition and a second frame selection condition;

s1022: judging whether the preselected frame is a frame or not;

s1023: when the preselection frame is a frame, the frame preselection frame is a target frame;

s1024: when the preselected frame is a plurality of frames, fusing the preselected frames of the plurality of frames to obtain a target frame;

the fusing comprises at least one of a first fusing or a second fusing;

s103: and performing voice driving on the target frame based on the current voice signal to generate a target video.

By the method in the embodiment, the target frame meeting the voice driving requirement can be obtained, and the effect of subsequent voice driving is improved.

The embodiment of the present application provides a method for acquiring a preselected frame meeting a frame selection condition from a frame sequence to be selected, as shown in fig. 2, the method includes:

s201: calculating a first dimension value of each frame in the frame sequence to be selected according to the first frame selection dimension;

in this embodiment, the first frame selection dimension is a picture dimension, and may include at least one of a face position, a face orientation, a body posture, and a light ray;

accordingly, the first dimension value may include at least one of a face position value, a face orientation value, a body pose value, and a light value;

in an exemplary embodiment, a method of calculating a face position value includes:

based on the human face feature point, obtaining a central point bbox _ center corresponding to a human face surrounding frame in a frame to be selected, and calculating a horizontal-vertical coordinate ratio bbox _ center _ u/v of the central point bbox _ center in the frame to be selected, wherein the horizontal-vertical coordinate ratio bbox _ center _ u/v is a human face position value;

in an exemplary embodiment, the method of calculating a face orientation value includes:

acquiring a face orientation angle (roll, yaw, pitch) of a frame to be selected based on the face feature point, wherein the face orientation angle (roll, yaw, pitch) is a face orientation value;

in an exemplary embodiment, a method of calculating a human pose value includes:

obtaining a relative relation value T of the human body joint points by comparing the posture-correcting human body joint points with the human body joint points of the frames to be selected _val The relative relationship value T of the human joint points _val Namely the human body posture value;

in an exemplary embodiment, a method of calculating a light value includes:

counting the pixel proportion smaller than the underexposure brightness threshold and larger than the overexposure brightness threshold to obtain an underexposure ratio and an overexposure ratio, wherein the underexposure ratio and the overexposure ratio are ray values;

the underexposure brightness threshold and the overexposure brightness threshold can be preset according to requirements, and can also be automatically generated by a system;

obtaining the dark part proportion of the frame to be selected by counting the brightness distribution of the pixels of the frame to be selected, wherein the proportion is the light value;

s202: acquiring a first preselected frame of which the first dimension value meets a first frame selection condition from a frame sequence to be selected;

the first frame selection condition is that the first dimension value is within a first frame selection range;

corresponding to the first dimension, the first frame selection range may include at least one of a face position range, a face orientation range, a body posture range, and a light range;

in an exemplary embodiment, the face location range is,

TMin _u/v <bbox _{_} center_u/v<TMax _u/v ；

wherein bbox _{_} center _ u/v is the face position value, TMin _u/v Is the abscissa to ordinate ratio minimum threshold value, TMax _u/v For the abscissa to ordinate ratio of the maximum threshold, TMin _u/v And TMax _u/v The system can be preset according to requirements or can be automatically generated by the system;

the frame with the face position value within the face position range meets the requirement of face position centering;

in an exemplary embodiment, the face orientation ranges are,

roll<T _roll ,yaw<T _yaw ,pitch<T _pitch ；

wherein (roll, yaw, pitch) is a face orientation value (T) _roll ,T _yaw ,T _pitch ) Is a face orientation threshold (T) _roll ,T _yaw ,T _pitch ) The system can be preset according to requirements or can be automatically generated by the system;

the frame with the face orientation value within the face orientation range meets the requirement that the face is oriented forwards;

in an exemplary embodiment, the range of body poses is,

T _val <∈；

wherein, T _val Is the human body posture value, belongs to the human body posture threshold value,the posture correcting human body joint points and the human body posture threshold value can be preset according to requirements and can also be automatically generated by the system;

the frame of the human body posture value in the human body posture range meets the neutral requirement of the human body posture;

in an exemplary embodiment, the range of light rays is,

the overexposure ratio is less than or equal to an overexposure threshold value, and the underexposure ratio is less than or equal to an underexposure threshold value;

the overexposure ratio and the overexposure ratio are light values, and the overexposure threshold value and the underexposure threshold value can be preset according to requirements or can be automatically generated by a system;

the frame with the light value in the light range meets the requirement of moderate light;

the first preselected frame may be one or more frames;

if the frame meeting the first frame selection condition does not exist in the frame sequence to be selected, prompting a user to shoot or upload an image to the frame sequence to be selected according to the first frame selection condition until the frame meeting the first frame selection condition exists in the frame sequence to be selected, wherein the frame is a target frame;

s203: judging whether the first preselected frame is a frame or not;

s204: when the first preselection frame is a frame, the first preselection frame of the frame is a target frame;

s205: when the first preselected frame is a plurality of frames, carrying out first fusion on the first preselected frames of the plurality of frames to obtain a target frame;

specifically, the first fusing includes:

taking any frame in a first preselected frame of a plurality of frames as a reference frame, and taking other frames as matching frames;

acquiring Harris angular points of a reference frame, and marking the Harris angular points as reference points;

calculating a feature descriptor of the reference point;

acquiring a matching range of the matching frame, wherein the matching range can be a range of a circle obtained by taking a point corresponding to a reference point of the reference frame in the matching frame as a circle center and a matching distance as a radius, and optionally, the matching distance is 5-15 pixels;

calculating the feature descriptors of the points in the matching range, and selecting the point closest to the feature descriptors of the reference points in the reference frame as a matching point;

obtaining a homography matrix through projective transformation based on the reference points of the reference frame and the matching points of the matching frame, wherein optionally, the homography matrix has 8 degrees of freedom, and at the moment, the homography matrix can be obtained by at least 4 pairs of the reference points and the matching points;

based on the homography matrix, obtaining the pixel corresponding relation between the reference frame and the matching frame through matrix transformation and pixel interpolation;

correspondingly subtracting pixels of the reference frame and the matched frame to obtain an absolute value of a pixel difference value;

comparing the absolute value of the pixel difference value with a pixel noise threshold value to obtain a pixel weight;

according to the pixel weight, carrying out weighted average on corresponding pixels in the reference frame and the matched frame to obtain a target frame;

through the first fusion, not only can the first preselected frames of multiple frames be fused into a frame target frame, but also the target frame subjected to the first fusion has higher spatial resolution, more obvious information representation and lower noise;

The embodiment of the present application provides a method for acquiring a preselected frame meeting a frame selection condition from a frame sequence to be selected, as shown in fig. 3, the method includes:

s301: calculating a second dimension value of each frame in the frame sequence to be selected according to the dimension of the second selected frame;

in this embodiment, the second frame selection dimension is at least one of an image quality dimension and a five-sense organ dimension, the image quality dimension may include a blur degree, and the five-sense organ dimension may include at least one of an eye dimension and a mouth dimension;

accordingly, the second dimension value may include at least one of a blur value, a five-feature dimension value (eye dimension value, mouth dimension value);

in an exemplary embodiment, the method of calculating the ambiguity value may include:

performing Gaussian blur on each frame in the frame sequence to be selected to obtain a Gaussian blur image of each frame;

performing horizontal gradient calculation and vertical gradient calculation on each frame in the frame sequence to be selected and the Gaussian blurred image thereof to obtain a horizontal gradient value and a vertical gradient value of each frame;

calculating the horizontal gradient difference and the vertical gradient difference of each frame in the frame sequence to be selected and the Gaussian blur image thereof based on the horizontal gradient value and the vertical gradient value;

summing the horizontal gradient difference and the vertical gradient difference to obtain a ambiguity value;

the relationship between the ambiguity value calculated by the method and the frame definition is that the higher the ambiguity value is, the more the frame is blurred, the lower the ambiguity value is, and the more the frame is clear;

the method for calculating the ambiguity value is not limited, and other methods can be selected to calculate the ambiguity value, wherein in other methods, the lower the ambiguity value is, the more ambiguous the frame is, the higher the ambiguity value is, the clearer the frame is;

in an exemplary embodiment, as shown in fig. 4a, the method for calculating the eye dimension value may be:

eye_val＝1-len(pt ₄₂ -pt ₄₈ )/len(pt ₃₉ -pt ₄₅ )

wherein pt is ₄₂ ，pt ₄₈ ，pt ₃₉ ，pt ₄₅ For the eye feature points obtained based on the face feature points, len (pt) ₄₂ -pt ₄₈ ) Is pt ₄₂ And pt ₄₈ Distance between, len (pt) ₃₉ -pt ₄₅ ) Is pt ₃₉ And pt ₄₅ The distance between them;

the eye dimension value calculated by the method is in a relation with the eyes, wherein the eye _ val is lower, and the eye opening degree is higher;

the method for calculating the eye dimension value is not limited, and other methods can be selected for calculating the eye dimension value, wherein in other methods, the lower the eye dimension value is, the lower the eye opening degree is, the higher the eye dimension value is, and the higher the eye opening degree is;

in an exemplary embodiment, as shown in fig. 4b, the method of calculating the mouth dimension value may include:

mouth_val＝len(pt ₈₉ -pt ₉₃ )/len(pt ₈₇ -pt ₉₁ )

wherein pt is ₈₉ ，pt ₉₃ ，pt ₈₇ ，pt ₉₁ For mouth feature points obtained based on face feature points, len (pt) ₈₉ -pt ₉₃ ) Is pt ₈₉ And pt ₉₃ Distance between, len (pt) ₈₇ -pt ₉₁ ) Is pt ₈₇ And pt ₉₁ The distance between them;

the relationship between the mouth dimension value and the mouth calculated by the method is that the lower the mouth dimension value is, the higher the mouth closing degree is;

the method for calculating the mouth dimension value is not limited in the application, and other methods can be selected to calculate the mouth dimension value, wherein in other methods, the lower the mouth dimension value is, the lower the mouth closing degree is, the higher the mouth dimension value is, and the higher the mouth closing degree is;

corresponding to the second frame selection dimension, the frame selection condition comprises at least one of clear image quality, eye opening and mouth closing;

when the first preselected frame is a multiframe, the method comprises:

s302: acquiring a second preselected frame with a second dimension value meeting a second frame selection condition from the frame sequence to be selected;

when the first preselected frame is a plurality of frames, the second frame selecting condition is that the second dimension value or the second dimension comprehensive value is lowest or highest;

when the second frame selection condition is that the second dimension value is lowest or highest;

specifically, the method comprises the following steps:

s3021: acquiring a frame with the lowest or the highest second dimension value in the frame sequence to be selected as a second preselected frame;

in an exemplary embodiment, when the second frame selection dimension comprises the ambiguity, acquiring a frame with the lowest ambiguity value in the frame sequence to be selected as a second preselected frame;

in an exemplary embodiment, when the second frame selection dimension comprises an eye dimension, acquiring a frame with the lowest eye dimension value in the frame sequence to be selected as a second preselected frame;

in an exemplary embodiment, when the second frame selection dimension includes a mouth dimension, acquiring a frame with the lowest mouth dimension value in the frame sequence to be selected as a second preselected frame;

in an exemplary embodiment, when the second frame selection dimension includes a ambiguity and an eye dimension, acquiring a frame with a lowest ambiguity value and a frame with a lowest eye dimension value in the frame sequence to be selected as the second preselected frame, where the frame with the lowest ambiguity value and the frame with the lowest eye dimension value may be the same frame or different frames;

in an exemplary embodiment, when the second frame selection dimension includes a ambiguity and a mouth dimension, acquiring a frame with a lowest ambiguity value and a frame with a lowest mouth dimension value in the frame sequence to be selected as the second preselected frame, where the frame with the lowest ambiguity value and the frame with the lowest mouth dimension value may be the same frame or different frames;

in an exemplary embodiment, when the second frame selection dimension includes an eye dimension and a mouth dimension, a frame with a lowest eye dimension value and a frame with a lowest mouth dimension value in the frame sequence to be selected are obtained as the second preselected frame, and the frame with the lowest eye dimension value and the frame with the lowest mouth dimension value may be the same frame or different frames;

in an exemplary embodiment, when the second selected frame dimension includes a ambiguity, an eye dimension and a mouth dimension, acquiring a frame with a lowest ambiguity value, a frame with a lowest eye dimension value and a frame with a lowest mouth dimension value in the frame sequence to be selected as the second preselected frame, where the frame with the lowest ambiguity value, the frame with the lowest eye dimension value and the frame with the lowest mouth dimension value may be the same frame or different frames;

in some embodiments, it is possible that the higher the ambiguity value is, the sharper the frame is, the higher the eye dimension value is, the higher the eye opening degree is, the higher the mouth dimension value is, and the higher the mouth closing degree is, at this time, at least one of a frame with the highest ambiguity value, a frame with the highest eye dimension value, and a frame with the highest mouth dimension value in the frame sequence to be selected is acquired as a second preselected frame;

due to the fact that various frame selection conditions are included in the second frame selection dimension, different evaluation indexes often have different dimensions and dimension units, the analysis result is influenced under the conditions, and in order to eliminate dimension influence among the indexes, a second dimension comprehensive value is introduced;

when the second frame selection condition is that the second dimension comprehensive value is lowest or highest;

specifically, the method comprises the following steps:

s3022: calculating a second dimension comprehensive value of each frame in the frame sequence to be selected;

the second-dimension integrated value may be a weighted value of the second-dimension value;

s3023: acquiring a frame with the lowest or highest second-dimension comprehensive value in a frame sequence to be selected as a second preselected frame;

in an exemplary embodiment, when the second frame selection dimension comprises the ambiguity, the eye dimension and the mouth dimension, calculating a weighted value of the second dimension of each frame in the frame sequence to be selected to obtain a second dimension comprehensive value, and acquiring a frame with the lowest or highest second dimension comprehensive value in the frame sequence to be selected as a second preselected frame;

the second preselected frame may be one or more frames;

s303: judging whether the second preselected frame is a frame or not;

s304: when the second preselected frame is a frame, the second preselected frame of the frame is a target frame;

s305: when the second preselected frame is a plurality of frames, carrying out second fusion on the plurality of frames and the second preselected frame to obtain a target frame;

specifically, the second fusing includes:

acquiring a face deviation value of a second preselected frame of a plurality of frames based on the face characteristic points;

comparing the face deviation value with a fusion threshold value;

when the face deviation value is smaller than the fusion threshold value, acquiring an optimal fusion boundary based on the face characteristic points;

performing five-organ fusion on the multi-frame second preselected frame according to the optimal fusion boundary to obtain a target frame;

when the face deviation value is not less than the fusion threshold value, acquiring the corresponding relation of the five sense organs of a plurality of frames of second preselected frames through affine transformation;

performing five sense organs fusion on the multi-frame second preselected frame based on the corresponding relation of the five sense organs to obtain a target frame;

the fusion threshold value can be preset according to requirements, and can also be automatically generated by the system;

by the method in the embodiment, the target frame meeting the voice driving requirement can be obtained, and the subsequent voice driving effect is improved.

The embodiment of the present application provides a method for acquiring a preselected frame meeting a frame selection condition from a frame sequence to be selected according to a frame selection dimension, as shown in fig. 5, the method includes:

s401: calculating a second dimension value of each frame in the first preselected frame according to the dimension of the second selected frame;

the first preselected frame may be one or more frames;

s402: acquiring a second preselected frame with a second dimension value meeting a second frame selecting condition from the first preselected frame;

specifically, S402 includes:

s4021: judging whether the first preselected frame is a multiframe;

when the first preselected frame is a multiframe, the second frame selecting condition is that the second dimension value or the second dimension comprehensive value is lowest or highest;

s4022: acquiring a second dimension value or a frame with the lowest or highest second dimension comprehensive value from a plurality of frames of first preselected frames as a second preselected frame;

when the first pre-selected frame is a frame, the second frame selecting condition is that the second dimension value is within the second frame selecting range;

s4023: judging whether a second dimension value of the first preselected frame is in a second frame selection range or not;

the second frame selection range can comprise at least one of an ambiguity range and a five-sense organ range, and the five-sense organ range can comprise at least one of an eye range and a mouth range;

in an exemplary embodiment, the ambiguity range is: ambiguity value < ambiguity threshold;

in an exemplary embodiment, the ambiguity range is: ambiguity value > ambiguity threshold;

in an exemplary embodiment, the five sense organs range is: the dimension value of the five sense organs is less than the threshold value of the five sense organs;

in an exemplary embodiment, the five sense organs range is: a facial feature dimension value > facial feature threshold;

the facial feature threshold may include at least one of an eye threshold, a mouth threshold;

the ambiguity threshold and the facial feature threshold can be preset according to requirements, and can also be automatically generated by a system;

s4024: when the second dimension value of the first preselected frame is within the second frame selecting range, the first preselected frame is a second preselected frame;

in an exemplary embodiment, the frame selection range includes an ambiguity range and a five sense organ range, if the ambiguity value of the first preselected frame is within the ambiguity range and the five sense organ dimension value is within the five sense organ range, the first preselected frame satisfies the second frame selection condition, and the first preselected frame is the second preselected frame;

s4025: when the second dimension value of the first preselected frame is not in the second frame selection range, acquiring a third preselected frame from the frame sequence to be selected;

the third preselected frame is a frame with the lowest or the highest second dimension value or second dimension comprehensive value in the frame sequence to be selected;

in an exemplary embodiment, the frame selection range includes an ambiguity range and a five sense organ range, if the ambiguity value of the first preselected frame is not within the ambiguity range, but the five sense organ dimension value is within the five sense organ range, the second dimension value of the first preselected frame is not within the second frame selection range, and the frame with the lowest or highest ambiguity value in the frame sequence to be selected is obtained, and the frame is the third preselected frame;

in an exemplary embodiment, the frame selection range includes an ambiguity range and a five-sense organ range, if the ambiguity value of the first preselected frame is not in the ambiguity range and the five-sense organ dimension value is not in the five-sense organ range, the second dimension value of the first preselected frame is not in the second frame selection range, the frame with the lowest or highest ambiguity value and the frame with the lowest or highest five-sense organ dimension value in the frame sequence to be selected are obtained, or the frame with the lowest or highest comprehensive ambiguity value and the five-sense organ dimension value in the frame sequence to be selected is obtained, and the frame is the third preselected frame;

the third preselected frame may be one or more frames;

s4026: fusing the first preselected frame and the third preselected frame to obtain a preselected fused frame;

s4027: judging whether a second dimension value of the preselected fusion frame is in a second frame selection range or not;

if the second dimension value of the preselected fusion frame is within the second frame selection range, the preselected fusion frame is a second preselected frame;

if the second dimension value of the preselected fusion frame is not in the second frame selection range, prompting a user to shoot or upload an image according to the second frame selection condition, and fusing the image serving as a third preselected frame with the first preselected frame until the obtained second dimension value of the preselected fusion frame is in the second frame selection range to obtain a second preselected frame;

the fusion comprises at least one of the first fusion and the second fusion;

s403: judging whether the second preselected frame is a frame or not;

s404: when the second preselected frame is a frame, the second preselected frame of the frame is a target frame;

s405: when the second preselected frame is a plurality of frames, carrying out second fusion on the plurality of frames and the second preselected frame to obtain a target frame;

by the method in the embodiment, the target frame meeting the requirement of the voice driving model can be obtained, and the subsequent voice driving effect is improved.

The embodiment of the present application provides a method for performing voice driving on a target frame based on a current voice signal to obtain a target video, as shown in fig. 6, the method includes:

s501: training a voice driving model;

specifically, step S501 includes:

s5011: acquiring a training material;

the training material needs to include voice information and corresponding expression coefficient information;

the training material can be a video material which needs to contain voice information and image information, wherein the image information needs to include expression information of a human face;

the video material can be a video recorded in advance or a video crawled on the internet;

s5012: collecting a voice signal sample in a training material and an expression coefficient sample corresponding to the voice signal sample;

the speech signal samples are time-series signals, which may be speech signals, and may also be spectral features of speech signals, e.g., mel-frequency features;

when the training material is a video material, specifically, the step S5012 may include:

extracting a voice signal sample and corresponding expression information thereof from the training material according to the frame rate of the training material;

acquiring an expression coefficient corresponding to the voice signal sample based on the expression information;

carrying out filtering synthesis on the expression coefficients to obtain expression coefficient samples;

s5013: training a voice driving model based on the voice signal sample and the expression coefficient sample;

specifically, 1D convolutional network training may be performed on the speech signal samples and the expression coefficient samples; or converting the voice signal samples into 2D images, and performing 2D convolutional network training on the voice signal samples and the expression coefficient samples; the auxiliary training can also be carried out by using an LSTM (Long short-term memory) network; training can also be performed by using a Transform network;

the Loss function Loss can be directly calculated by using the expression coefficient sample, and the expression coefficient sample can be restored to a grid for Loss training;

s502: generating a corresponding driving expression coefficient through the trained voice driving model according to the current voice signal;

s503: matching the target frame with the continuous driving expression coefficient to generate a key frame;

specifically, step S503 may include:

s5031: preprocessing a target frame;

the pretreatment comprises the following steps: foreground figure segmentation, figure depth estimation and 3D face reconstruction, wherein the foreground figure segmentation is used for obtaining a foreground mask image, the figure depth estimation is used for obtaining a figure depth image, and the 3D face reconstruction is used for obtaining a 3D face model;

s5032: obtaining a face driving model according to the driving expression coefficient;

s5033: obtaining a key frame based on the target frame and the face driving model;

specifically, step S5033 includes:

extracting the outline of a foreground region in the target frame according to the foreground mask image;

sampling the depth corresponding to the character in the target frame according to the character depth map;

performing Delaunay triangulation on the foreground region of the target frame by taking the outline of the foreground region as a boundary to obtain a figure 3D grid B of a projection space _s ；

Removing character 3D mesh B _s Face region on to obtain grid B' _S ；

Based on human face 3D reconstruction, obtaining a projection matrix P, transforming a human face deformation source grid into a projection space, and obtaining a 3D human face model F _s ；

Merging 3D face models F _s And mesh B' _S Merging, and triangulating the joint parts of the two boundaries to obtain the deformation source grid M _s ；

Transforming the face driving model into a projection space through a projection matrix P to obtain a face driving model F in the projection space _t ；

Drive model F with human face _t Applying all vertex positions in the deformed source grid M _s Middle 3D face model F _s To obtain a face mesh M _t ；

Order face grid M _t Non-face area U in (1) _t ＝M _t /F _t In the deformation source grid M _s Up is corresponding to U _s ＝M _s /F _s ；

Respectively take F _s 、F _t Is limited by

Then U is _s 、U _t Respectively has an inner boundary of

Outer boundaries are respectively

Wherein

Method for adjusting U based on optimized grid weighted Laplace energy _t Middle vertex position, let F _t A smooth, continuous transition is made in the face region, where,

the corresponding vertexes have the same position and are taken as fixed anchor points, and

the positions of the corresponding vertexes are different, and the vertexes are used as mobile anchor points;

calculate U _s To

1/d of the geodesic distance d ² Estimating the point weight for the coefficient, and obtaining smooth non-face area grid U 'through iterative optimization' _t And has a smooth deformation target mesh M' _t ＝U′ _t ∩F _t ；

For M' _t Rendering to a target pixel in image space, wherein the pixel is obtained to correspond to a grid M 'during rasterization' _t C, applying the coordinates to M _s Can obtain M _s A point p 'on the surface' _s ；

Prepared from p' _s Projecting the image to a preprocessed target frame to obtain corresponding source pixels;

obtaining a coordinate key frame by performing reverse interpolation on the offset of the target pixel coordinate and the source pixel coordinate in an image space;

based on the coordinate key frame, obtaining a key frame through an image warp algorithm of a least square algorithm;

s504: performing expression matching on the key frame based on the frame sequence to be selected and the target frame to obtain a driving frame;

when the expression match includes a mouth match, step S504 includes:

s5041: acquiring a facial expression coefficient of each frame in a frame sequence to be selected;

s5042: obtaining a face model corresponding to each frame in the frame sequence to be selected based on the face expression coefficient;

s5043: calculating mouth deviation of a face model and a face driving model corresponding to each frame in the frame sequence to be selected;

s5044: acquiring a frame corresponding to the face model with the minimum mouth deviation as a rendering frame;

s5045: rendering the key frame by using the rendering frame to obtain a driving frame;

in an exemplary embodiment, rendering the key frame using the rendering frame includes: extracting structural information z of mouth in key frame _geo And style information z _style Simultaneously, extracting the real style information of the mouth in the rendering frame

From true style information

And structural information z _geo Obtaining a driving frame with real mouth texture and tooth structure;

when the expression match includes an eye match, step S504 includes:

s5046: obtaining the opening amplitude of the eyes based on the driving expression coefficient;

s5047: inputting the eye opening amplitude and the target frame into a cGAN network, and outputting an eye image corresponding to the eye opening amplitude;

s5048: matching the eye image with the key frame to obtain a driving frame;

s505: the continuous driving frames form a target video;

by the method in the embodiment, the problem that details of the mouth (such as the inside of a cavity and teeth) are lost due to the change of the expression coefficients is solved, and the generated video is more vivid and natural.

An embodiment of the present application provides a video generation method in a video call, as shown in fig. 7, the method includes:

s601: monitoring the real-time network bandwidth of the video call;

s602: judging whether the real-time network bandwidth is smaller than a network threshold value;

the network threshold value can be preset according to the requirement, and can also be automatically generated by the system;

when the real-time network bandwidth is smaller than the network threshold value, the video call is blocked, and the video generation method comprises the following steps:

s603: acquiring a frame sequence to be selected;

the candidate frame sequence may include: at least one of a video cache frame before the pause and a pre-stored frame pre-shot by a user, wherein the frame sequence to be selected comprises at least two frames to be selected;

s604: determining a target frame from a frame sequence to be selected according to the frame selection dimension;

s605: performing voice driving on a target frame based on a current voice signal to obtain a target video;

the current voice signal is the voice signal of the user after the video is blocked;

s606: switching the picture of the video call to a target video;

s607: when the real-time network bandwidth is not less than the network threshold value, switching back to the video call;

by the method in the embodiment, when the network bandwidth is insufficient, the picture of the video call is still natural and smooth.

An embodiment of the present application provides a video generating apparatus 10, as shown in fig. 8, the apparatus including:

the acquisition unit 100 is configured to acquire a frame sequence to be selected;

a frame selection unit 200 configured to determine a target frame from a frame sequence to be selected according to a frame selection dimension;

and the driving unit 300 is configured to perform voice driving on the target frame based on the current voice signal to acquire the target video.

The present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the video generation method according to any of the foregoing embodiments.

The embodiment of the application also provides an electronic device, which comprises a processor and a memory, wherein the memory is used for storing the executable instructions of the processor; wherein the processor is configured to perform the video generation method as described in any of the previous embodiments via execution of the executable instructions.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method of video generation, comprising:

acquiring a frame sequence to be selected;

2. The video generation method of claim 1, wherein the determining a target frame from the sequence of candidate frames according to a frame selection dimension comprises:

3. A video generation method as defined in claim 2, wherein the fusing comprises at least one of a first fusing or a second fusing.

4. The video generation method according to claim 3, wherein the obtaining, according to the frame selection dimension, a preselected frame satisfying a frame selection condition from the frame sequence to be selected comprises:

calculating a first dimension value of each frame in the frame sequence to be selected according to the first frame selection dimension;

acquiring a first preselected frame of which the first dimension value meets the first frame selection condition from the frame sequence to be selected;

wherein the first preselected frame is one or more frames.

5. The video generation method of claim 2,

the first frame selection condition is that the first dimension value is within a first frame selection range.

6. The video generation method of claim 4,

when the first preselected frame is a frame, the first preselected frame is the target frame;

and when the first preselected frame is a plurality of frames, carrying out first fusion on the plurality of frames of the first preselected frame to obtain the target frame.

7. The video generation method according to claim 3, wherein the obtaining, according to the frame selection dimension, a preselected frame satisfying a frame selection condition from the frame sequence to be selected comprises:

wherein the second preselected frame is one or more frames.

8. The video generation method according to claim 7, characterized in that:

when the second preselected frame is a frame, the second preselected frame is the target frame;

9. The video generation method of claim 2,

and the second frame selection condition is that the second dimension value or the second dimension comprehensive value is the lowest or the highest.

10. The method according to claim 4, wherein the obtaining, according to the frame selection dimension, a preselected frame that satisfies a frame selection condition from the frame sequence to be selected comprises:

wherein the second preselected frame is one or more frames.

11. The video generation method of claim 10,

12. The video generation method of claim 10,

when the first preselected frame is a multi-frame, the second frame selecting condition is that the second dimension value or the second dimension comprehensive value is lowest or highest.

13. The video generation method of claim 10,

when the first pre-selected frame is a frame, the second frame selecting condition is that the second dimension value is within a second frame selecting range.

14. The video generation method according to claim 1, wherein the performing voice driving on the target frame based on the current voice signal to obtain the target video comprises:

successive ones of the drive frames constitute the target video.

15. A video generation apparatus, comprising:

the acquisition unit is configured to acquire a frame sequence to be selected;

and the driving unit is configured to perform voice driving on the target frame based on the current voice signal to acquire a target video.

16. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 14.

17. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the video generation method of any of claims 1 to 14 via execution of the executable instructions.