WO2024022065A1 - 虚拟表情生成方法、装置、电子设备和存储介质 - Google Patents
虚拟表情生成方法、装置、电子设备和存储介质 Download PDFInfo
- Publication number
- WO2024022065A1 WO2024022065A1 PCT/CN2023/105870 CN2023105870W WO2024022065A1 WO 2024022065 A1 WO2024022065 A1 WO 2024022065A1 CN 2023105870 W CN2023105870 W CN 2023105870W WO 2024022065 A1 WO2024022065 A1 WO 2024022065A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- coefficient
- image
- face
- expression
- target
- Prior art date
Links
- 230000014509 gene expression Effects 0.000 title claims abstract description 210
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000012545 processing Methods 0.000 claims abstract description 48
- 238000012937 correction Methods 0.000 claims abstract description 32
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 33
- 230000009466 transformation Effects 0.000 claims description 29
- 230000006978 adaptation Effects 0.000 claims description 28
- 239000011159 matrix material Substances 0.000 claims description 28
- 238000001514 detection method Methods 0.000 claims description 13
- 230000008921 facial expression Effects 0.000 claims description 12
- 230000007246 mechanism Effects 0.000 claims description 10
- 239000000284 extract Substances 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 8
- 230000002123 temporal effect Effects 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 abstract description 6
- 238000009877 rendering Methods 0.000 abstract description 5
- 230000005540 biological transmission Effects 0.000 abstract description 3
- 230000001815 facial effect Effects 0.000 abstract 9
- 230000008569 process Effects 0.000 description 14
- 230000033001 locomotion Effects 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000000513 principal component analysis Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 210000001097 facial muscle Anatomy 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000001131 transforming effect Effects 0.000 description 2
- 230000003321 amplification Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 235000020061 kirsch Nutrition 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/04—Texture mapping
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
Definitions
- the present disclosure relates to the field of data processing technology, and in particular, to a virtual expression generation method, device, electronic device and storage medium.
- Three-dimensional (3D) modeling is a key issue in the field of machine vision.
- 3D expression modeling is widely used in entertainment fields such as games, film and television special effects, and VR.
- the existing mainstream methods of 3D virtual expression modeling are based on pictures to generate 3D virtual expressions.
- the expression change process is a complex non-rigid body movement.
- the collection environment, modeling equipment and modeling process all have high requirements, and it is difficult to meet real-time requirements; and when processing each frame of the video, the correlation and continuity of expressions are ignored.
- the present disclosure provides a virtual expression generation method, device, electronic device and storage medium to solve the deficiencies of related technologies.
- a virtual expression generation method including: obtaining a face area in an original image to obtain a target face image; obtaining a first face coefficient corresponding to the target face image;
- the first face coefficient includes a template expression coefficient and a pose coefficient.
- the template expression coefficient is used to represent the matching degree of the facial expression with each template.
- the pose coefficient represents the rotation angle of the virtual image in three dimensions. ; Performing time domain correction processing on the template expression coefficient and/or the pose coefficient in the first face coefficient to obtain a target face coefficient; rendering the expression of the virtual image according to the target face coefficient , get virtual expressions.
- obtaining the face area in the original image to obtain the target face image includes: performing face detection on the original image to obtain at least one face area contained in the original image; A target face area is selected from the face area; and the target face area is corrected to obtain a target face image.
- selecting a target face area from the at least one face area includes: when the number of the face areas is one, determining that the face area is the target face area; When there are multiple face areas, the score value of each face area is calculated based on the regional parameter data of each face area, and the score value is used to represent the distance of each face area from the central axis of the original image. degree; determine the face area corresponding to the maximum value of the score value as the target face area.
- the regional parameter data includes length, width, face area and position data.
- Calculating the score value of each face region based on the regional parameter data of each face region includes: obtaining the middle value of each face region. The difference between the abscissa of the position and half the width, and the absolute value of the difference; obtain the ratio of the absolute value of the difference to the width, and the product of the ratio and the constant 2; obtain the constant 1 and the The difference between the products is obtained, and the product of the difference corresponding to the product and the preset distance weight is obtained; the ratio of the face area in each face area to the product of the length and width is obtained, and the The square root of the ratio corresponding to the face area; obtain the product of the square root and the preset area weight, the sum of the area weight and the distance weight is 1; calculate the product corresponding to the preset area weight and the preset area weight The sum of the products of the set distance weights is used to obtain the score value of each face area.
- performing correction processing on the target face area to obtain a target face image includes: determining a candidate square area corresponding to the target face area, and obtaining vertex coordinate data of the candidate square area;
- the vertex coordinate data of the square area and the vertex coordinate data of the preset square are subjected to affine transformation to obtain an affine transformation coefficient;
- the vertex coordinate data of the preset square includes a designated origin;
- the affine transformation coefficient is used to The original image is subjected to affine transformation to obtain an affine transformed image; a square area with a preset side length is intercepted from the affine transformed image based on the specified origin, and the image within the intercepted square area is used as the target person face image.
- obtaining the first face coefficient corresponding to the target face image includes: performing blur processing and sharpening processing on the target face image respectively to obtain at least one blurred image and at least one sharpened image; respectively Extract feature data from the target face image, each blurred image, and each sharpened image to obtain the original feature image, the blurred feature image, and the sharpened feature image; and splice the original feature image, the blurred feature image, and the sharpened feature image.
- the template expression coefficient and pose coefficient are determined according to the target feature image, and the first face coefficient is obtained.
- obtaining the first face coefficient corresponding to the target face image includes: inputting the target face image into a preset face coefficient recognition network, and obtaining the preset face coefficient recognition network.
- the first face coefficient corresponding to the output target face image includes: inputting the target face image into a preset face coefficient recognition network, and obtaining the preset face coefficient recognition network.
- the preset face coefficient recognition network includes: a blur and sharpening module, a feature extraction module, an attention module and a coefficient learning module; the blur and sharpening module respectively performs blur processing on the target face image. and sharpening processing to obtain at least one blurred image and at least one sharpened image; the feature extraction module respectively extracts feature data from the target face image, each blurred image and each sharpened image to obtain the original feature image, blurred image feature image and sharpened feature image; and splicing the original feature image, the blurred feature image and the sharpened feature image to obtain an initial feature image; the attention module obtains each feature image pair in the initial feature image The importance coefficient of the expression of the virtual image, and the initial feature image is adjusted according to the importance coefficient to obtain a target feature image; the coefficient learning module determines the template expression coefficient and pose coefficient according to the target feature image, and obtains The first face coefficient.
- the attention module is implemented using a network model of a temporal attention mechanism or a spatial attention mechanism.
- the coefficient learning module is implemented using at least one network model among Resnet50, Resnet18, Resnet100, DenseNet and YoloV5.
- performing time domain correction processing on the template expression coefficient and/or the pose coefficient in the first face coefficient to obtain the target face coefficient includes: obtaining the previous image before the original image.
- the method further includes: obtaining a preset expression adaptation matrix;
- the expression adaptation matrix refers to the conversion relationship corresponding to two face coefficients containing different template numbers; the target face coefficient is obtained by calculating the product of the face coefficient after time domain correction processing and the expression adaptation matrix.
- the preset expression adaptation matrix is obtained through the following steps: obtaining the first preset corresponding to the sample image Assume coefficients, and the first preset coefficients include coefficients of a first number of templates; obtain second preset coefficients corresponding to the sample image, and the second preset coefficients include coefficients of a second number of templates; according to the first A preset coefficient, the second preset coefficient and the least squares method are used to obtain the preset expression adaptation matrix.
- the method further includes: when no face area is detected in the original image, continue to detect the next frame of the original image, and obtain the virtual expression according to the target face coefficient of the previous frame of the original image; or , when no face area is detected in the original image and the duration exceeds the set duration threshold, the virtual expression is obtained according to the preset expression coefficient.
- a virtual expression generation device including: a target image acquisition module, used to acquire the face area in the original image to obtain the target face image; a first coefficient acquisition module, used to Obtain the first face coefficient corresponding to the target face image; the first face coefficient includes a template expression coefficient and a pose coefficient, and the template expression coefficient is used to represent the matching degree of the facial expression with each template, so The pose coefficient represents the rotation angle of the virtual image in three dimensions; a target coefficient acquisition module is used to perform time domain correction on the template expression coefficient and/or the pose coefficient in the first face coefficient Process to obtain the target face coefficient; the expression animation acquisition module is used to render the expression of the virtual image according to the target face coefficient to obtain the virtual expression.
- an electronic device including: a processor and a memory for storing executable instructions; the processor reads the executable instructions from the memory to implement the first aspect The steps of any of the methods described.
- a chip including: a processor and a memory for storing an executable program; the processor reads the executable program from the memory to implement any of the aspects of the first aspect. A step of the method described.
- a non-transitory computer-readable storage medium on which a computer executable program is stored.
- the executable program is executed, the steps of any one of the methods described in the first aspect are implemented. .
- the face area in the original image can be obtained to obtain the target face image; then, the first face coefficient corresponding to the target face image is obtained; and then, The template expression coefficient and/or the pose coefficient in the first face coefficient are subjected to time domain correction processing to obtain the target face coefficient; finally, the expression of the virtual image is rendered according to the target face coefficient. , get virtual expressions.
- the expressions of adjacent original images in the video can be made relevant and continuous, making the reconstructed expressions more natural and improving the viewing experience; and,
- the virtual expression is obtained by rendering the expression of the virtual image by transmitting the target face coefficients. Compared with transmitting image data, the amount of data transmission can be reduced, and the effect of reconstructing the virtual expression in real time can be achieved.
- Figure 1 is a flow chart of a virtual expression generation method according to an exemplary embodiment.
- Figure 2 is a flowchart of obtaining a target face image according to an exemplary embodiment.
- Figure 3 is a flowchart of obtaining a target face area according to an exemplary embodiment.
- Figure 4 is a flowchart illustrating a method of obtaining a score value of a human face area according to an exemplary embodiment.
- Figure 5 is a flowchart illustrating a method of obtaining a target face image according to an exemplary embodiment.
- Figure 6 is a flowchart of obtaining the first face coefficient according to an exemplary embodiment.
- Figure 7 is a block diagram of a face system recognition network according to an exemplary embodiment.
- Figure 8 is a flowchart illustrating a method of obtaining target face coefficients according to an exemplary embodiment.
- Figure 9 is another flowchart of obtaining target face coefficients according to an exemplary embodiment.
- Figure 10 is a flowchart of obtaining an expression adaptation matrix according to an exemplary embodiment.
- Figure 11 is another flowchart of obtaining an expression adaptation matrix according to an exemplary embodiment.
- Figure 12 is a flow chart of another virtual expression generation method according to an exemplary embodiment.
- Figure 13 is a block diagram of a virtual expression generating device according to an exemplary embodiment.
- Figure 14 is a block diagram of a server according to an exemplary embodiment.
- 3D modeling is a key issue in the field of machine vision.
- 3D expression modeling is widely used in entertainment fields such as games, film and television special effects, and VR.
- the existing mainstream methods of 3D virtual expression modeling are based on pictures to generate 3D virtual expressions.
- the expression change process is a complex non-rigid body movement.
- the collection environment, modeling equipment and modeling process all have high requirements, and it is difficult to meet real-time requirements; and when processing each frame of the video, the correlation and continuity of expressions are ignored.
- Figure 1 is a flow chart of a virtual expression generation method according to an exemplary embodiment.
- a virtual expression generation method includes steps 11 to 14.
- step 11 obtain the face area in the original image and obtain the target face image.
- the electronic device can communicate with the camera to obtain images and/or videos collected by the camera, and the camera's collection frame rate does not exceed 60fps; it can also read images and/or videos from a designated location.
- the electronic device processes an image or a frame of video each time, the subsequent processing of an image is taken as an example to describe the solutions of each embodiment, and the processed image is called the original image to distinguish it from other processed images. image.
- the electronic device can obtain the face area in the original image, and refer to See Figure 2, including steps 21 to 23.
- step 21 the electronic device can perform face detection on the original image to obtain at least one face area contained in the original image.
- the electronic device can use a preset face detection model to perform face detection on the original image.
- the above preset face detection models can include but are not limited to yolov5 model, resnet18 model, R-CNN model, mobilenet model, etc.
- the model that can realize the target detection function those skilled in the art can select an appropriate model according to specific scenarios, and the corresponding solution falls within the protection scope of the present disclosure.
- the above-mentioned preset face detection model can output at least one face area contained in the above-mentioned original image.
- the electronic device can record whether a face area is detected.
- the flag can be set to -1.
- the flag can be set. is the number of face areas, and the regional parameter data of each face area is recorded at the same time.
- the above-mentioned regional parameter data includes length, width, face area and position data.
- the area parameter data of the face area is [x, y, w, h, s], where x and y respectively represent the designated points of the face area (such as the center point, The horizontal and vertical coordinates of the upper left vertex, lower left vertex, upper right vertex or lower right vertex), w and h represent the width and height of the face area respectively, and s represents the area of the face area.
- the area parameter data of n face areas is represented by a list, that is, [[x1, y1, w1, h1, s1], [x2, y2, w2, h2, s2], ..., [xn1, yn1, wn1, hn1, sn1]].
- the electronic device may select a target face area from the at least one face area.
- the electronic device can determine that the face area is the target face area.
- the electronic device can select one of the multiple face areas as the target face area.
- the electronic device can calculate the score value of each face area according to the regional parameter data of each face area.
- the above score value is used to represent the distance of each face area from the central axis of the original image.
- the electronic device obtains the score value of each face area, including steps 41 to 46.
- the electronic device may obtain the difference between the abscissa and half the width of the middle position of each face area, as well as the absolute value of the difference.
- the absolute value of the above difference is
- xn1 represents the abscissa of the n1th face area
- w represents the width of the n1th face area
- represents the absolute value
- the electronic device may obtain a ratio of the absolute value of the difference to the width, and a product of the ratio and a constant 2.
- the product of the ratio and the constant 2 is
- the electronic device may obtain the difference between the constant 1 and the product, and obtain the product of the difference corresponding to the product and the preset distance weight.
- the product of the difference corresponding to the product and the preset distance weight is
- ⁇ represents the preset distance weight, or the normalized value of the distance between the center of the face area and the central axis, and this ⁇ is affected by the camera acquisition distance.
- the value of ⁇ is 0.2.
- the electronic device may obtain the ratio of the face area multiplied by the length and width in each face area, and the square root of the corresponding ratio of the face area. For example,
- sn represents the area of the face in the n1th face area
- h represents the height of the n1th face area
- w represents the width of the n1th face area
- the electronic device may obtain the product of the square root and the preset area weight, and the sum of the area weight and the distance weight is 1. For example, the square root multiplied by the preset area weight is
- 1- ⁇ represents the normalized value of the face area in the original image area.
- the electronic device may calculate the sum of the product corresponding to the preset area weight and the product of the preset distance weight to obtain a score value for each face area.
- the score value of each face area is shown in the following formula (1):
- the electronic device may determine that the face area corresponding to the maximum value of the score value is the target face area.
- the face area closest to the central axis of the original image and with a larger face area can be determined, which can be closer to the location of the object of interest during the actual image collection process.
- the scene in the shooting area is helpful to improve the accuracy of obtaining the target face area.
- step 23 the electronic device may perform correction processing on the target face area to obtain a target face image.
- the electronic device corrects the target face area including steps 51 to 54.
- the electronic device may determine a candidate square area corresponding to the target face area and obtain vertex coordinate data of the candidate square area.
- the electronic device can obtain the center point (x n1 , y n1 ) of the target face area, Determine a square area with the center point (x n1 , y n1 ) as the center.
- the side length of the square area is
- scale is the amplification coefficient of the target face area, and its value is greater than 1. In one example, the scale value is 1.25.
- w n1 and h n1 respectively represent the width and height of the target face area.
- the electronic device can obtain the vertex coordinate data of each vertex of the square area.
- the above square area will be called the candidate square area in the following.
- the electronic device can perform affine transformation on the vertex coordinate data of the candidate square area and the vertex coordinate data of the preset square to obtain an affine transformation coefficient; the vertex coordinate data of the preset square includes a specified origin.
- a preset square can be stored in the electronic device.
- the vertex coordinate data of the preset square includes a specified origin (0, 0) and the side length is the preset side length (such as 224 pixels).
- the vertex coordinate data of the four vertices of the preset square are the upper left corner (0, 0), the lower left corner (0, 224), the upper right corner (224, 0) and the lower right corner ( 224, 224).
- the electronic device can perform affine transformation on the candidate square area and the preset square, that is, establish an affine transformation relationship between each vertex of the preset square in the candidate square area, and obtain the affine transformation coefficient.
- the electronic device can zoom, translate and rotate the candidate square area to obtain the preset square. It is understandable that, to obtain the affine transformation relationship between two squares, reference can be made to solutions in related technologies, which will not be described again here.
- the electronic device may use the affine transformation coefficient to perform affine transformation on the original image to obtain an affine transformed image.
- the electronic device may intercept a square area with a preset side length from the affine transformation image based on the specified origin, and use the image within the intercepted square area as the target face image. For example, the electronic device intercepts a square with a length and width of 224 from the (0, 0) position in the affine transformation image to obtain the target face image.
- the face area can have better fidelity, that is, the facial expressions can have better Fidelity will help improve the accuracy of subsequent generation of virtual expressions.
- processing the original image into a high-fidelity normalized target face image can improve the accuracy of the first face coefficient in the subsequent step 12 and the authenticity of the virtual expression generated in step 14. Degree and fidelity are conducive to improving the interactive experience.
- step 12 obtain the first face coefficient corresponding to the target face image; the first face coefficient includes a template expression coefficient and a pose coefficient, and the template expression coefficient is used to represent the facial expression and each template The degree of matching, the pose coefficient represents the rotation angle of the virtual image in three dimensions.
- the electronic device can obtain the first face coefficient corresponding to the target face image. See Figure 6 , including steps 61 to 65.
- the electronic device may perform blur processing and sharpening processing on the target face image, respectively, to obtain at least one blurred image and at least one sharpened image.
- the target face image is part of the original image and its features are not prominent in the image
- the overall features and/or detailed features of the target face image are first refined in this step.
- the electronic device can blur the target face image.
- the blur algorithms used include but are not limited to, for example, Gaussian Blur (Gaussian Blur), Box Blur (Box Blur), Kawase Kawase Blur, Dual Blur, Bokeh Blur, Tilt Shift Blur, Iris Blur, Grainy Blur, Radial Blur ) and directional blur (Directional Blur), etc.
- the Gaussian blur algorithm is used to process the target face image, thereby obtaining at least one model image corresponding to the target face image.
- the electronic device can sharpen the target face image.
- the sharpening algorithms used include but are not limited to Robert operator, Prewitt operator, Sobel operator, and Laplacian operator. , Kirsch operator, etc., in one example, the Robert operator is used to process the target face image, thereby obtaining at least one sharpened image corresponding to the target face image.
- the above blurring algorithm and/or sharpening algorithm can also be implemented using neural networks in the field of machine vision (such as convolutional neural networks, etc.), and blurred images and/or sharpened images can also be obtained.
- neural networks in the field of machine vision (such as convolutional neural networks, etc.)
- blurred images and/or sharpened images can also be obtained.
- the corresponding solutions fall into scope of the present disclosure.
- the electronic device can respectively extract feature data from the target face image, each blurred image, and each sharpened image to obtain the original feature image, the blurred feature image, and the sharpened feature image.
- the electronic device can perform at least one layer of convolution operation on the target face image, each blurred image, and each sharpened image respectively, thereby obtaining the original feature image, the blurred feature image, and the sharpened feature image.
- the electronic device may splice the original feature image, the blurred feature image, and the sharpened feature image to obtain an initial feature image.
- the electronic device can splice the blurred feature image behind the original feature image; after the splicing of the blurred feature image is completed, the sharpened feature image can be spliced after the blurred feature image until all the feature images are spliced to obtain a blurred feature image.
- the feature image of the original feature and the sharpened feature is later called the initial feature image.
- the electronic device may obtain the importance coefficient of each characteristic image in the initial characteristic image to the expression of the virtual image, and adjust the initial characteristic image according to the importance coefficient to obtain a target characteristic image.
- the electronic device can obtain the importance coefficient of each feature image in the initial feature image to the expression of the virtual image's expression through the temporal attention mechanism and/or spatial attention mechanism. Then, the electronic device can calculate the product of the above importance coefficient and the initial feature image to obtain the target feature image.
- the initial feature image is adjusted through the importance coefficient, which can highlight the relatively important feature image and weaken the relatively unimportant feature image, improve the accuracy of the target feature image, and then improve the first face obtained in step 65.
- the accuracy of the coefficients can highlight the relatively important feature image and weaken the relatively unimportant feature image, improve the accuracy of the target feature image, and then improve the first face obtained in step 65.
- the electronic device may determine a template expression coefficient and a pose coefficient according to the target feature image to obtain the first face coefficient.
- a preset set of expression templates can be stored in the electronic device, and each expression template is called an expression base.
- the electronic device can match the fitness between the target feature image and each expression base, thereby determining the template expression coefficient and pose coefficient, and obtaining the above-mentioned first face coefficient.
- the above purpose can be restored.
- a preset face coefficient recognition network may be stored in the electronic device.
- the electronic device can input the above target face image into a preset face coefficient recognition network, and the preset face coefficient recognition network outputs the first face coefficient corresponding to the target face image.
- the above-mentioned preset face coefficient recognition network includes: blur sharpening module 71 , feature extraction module 72 , attention module 73 and coefficient learning module 74 .
- the blurring and sharpening module 71 respectively performs blurring processing and sharpening processing on the target face image to obtain at least one blurred image and at least one sharpened image
- the feature extraction module 72 respectively extracts the target face image, each The feature data in the blurred image and each sharpened image is used to obtain the original feature image, the blurred feature image, and the sharpened feature image; and the original feature image, the blurred feature image, and the sharpened feature image are spliced to obtain an initial feature image.
- the attention module 73 obtains the importance coefficient of each feature image in the initial feature image to the expression of the virtual image, and adjusts the initial feature image according to the importance coefficient to obtain the target feature image; the attention module uses Network model implementation of temporal attention mechanism or spatial attention mechanism.
- the coefficient learning module 74 determines the template expression coefficient and pose coefficient according to the target feature image, and obtains the first face coefficient.
- the coefficient learning module is implemented using at least one network model among Resnet50, Resnet18, Resnet100, DenseNet and YoloV5. Technical personnel can choose according to specific scenarios, and the corresponding solution falls within the protection scope of this disclosure.
- step 13 perform time domain correction processing on the template expression coefficient and/or the pose coefficient in the first face coefficient to obtain a target face coefficient; the target face coefficient is the same as the original face coefficient.
- the face coefficients of the original image in the previous frame of the image are associated.
- the electronic device can perform time domain correction processing on the template expression coefficient and/or the pose coefficient in the first face coefficient to obtain the target face coefficient, including steps 81 to 81. Step 82.
- the electronic device may obtain the first face coefficient and the preset weight coefficient of the previous frame image before the original image; the weight coefficient of the previous frame image and the weight coefficient of the original image. The sum is 1.
- the electronic device may perform a weighted summation of the first face coefficient of the original image and the first face coefficient of the previous frame image to obtain the target face coefficient corresponding to the original image.
- the target face coefficient is obtained by using a weighted summation value, so that the current original image and the face coefficient of the previous frame image have a correlation relationship.
- the greater the preset weight coefficient of the previous frame image the greater the proportion of the face coefficient of the previous frame image in the subsequent target face coefficients, the smoother the parameters of the previous frame image and the current original image, and thus This makes the virtual expression corresponding to the current original image and the virtual expression corresponding to the previous frame image change slower; the smaller the preset weight coefficient of the previous frame image, the faster the parameters of the previous frame image and the current original image change. , thereby making the virtual expression corresponding to the current original image and the virtual expression corresponding to the previous frame of image change faster.
- Technicians can select appropriate preset weight coefficients according to specific scenarios so that the changes in expressions of two adjacent frames of original images meet the needs of the scene.
- the preset weight coefficient value of the previous frame of images is 0.4.
- the weight coefficient corresponding to the current original image is 0.6.
- the electronic device can directly use the first face coefficient of the first frame image as the target face coefficient, that is, No time domain correction is performed on the first face coefficient, thereby ensuring the accuracy of the expression of the first frame image.
- the expression template set used in the virtual expression generation method provided by the present disclosure is fixed, where "fixed” includes that each template in the expression template set is fixed and the number of templates is fixed, taking into account the expressions that may be used by different electronic devices
- the template sets are different, so it is necessary to adapt the first face coefficients obtained by different electronic devices, for example, adapt 64 expression templates to 52 expression templates. Referring to Figure 9, the electronic device adapts the first face coefficient, including steps 91 to 92.
- the electronic device can obtain a preset expression adaptation matrix; the expression adaptation matrix refers to the conversion relationship corresponding to two face coefficients containing different template numbers.
- a preset expression adaptation matrix can be stored in the electronic device.
- the preset expression adaptation matrix can be obtained through the following steps, see Figure 10 and Figure 11, including steps 101 to 103.
- the electronic device can obtain the first preset coefficient corresponding to the sample image.
- the above-mentioned first preset coefficients include coefficients of a first number (for example, 64) templates, which refer to the degree of fitness of the target feature image corresponding to each template (or expression base) in the first number of templates.
- the electronic device may obtain a second preset coefficient corresponding to the sample image, where the second preset coefficient includes coefficients of a second number of templates.
- the second preset coefficient includes coefficients of a second number of templates.
- the above-mentioned second preset coefficients include coefficients of a second number (such as 52) templates, which refer to the degree of fitness of the sample image corresponding to each template (or expression base) in the second number of templates.
- the electronic device may obtain the preset expression adaptation matrix according to the first preset coefficient, the second preset coefficient and the least squares method.
- the first preset coefficient is The second preset coefficient is and and is a linear relationship
- J represents the sum of squares loss
- S ⁇ R k ⁇ (j+1) is the expression adaptation matrix
- k is the number of new expression bases, which is the second number
- j is the basic expression base data, which is the first number.
- the linear relationship between the first preset coefficient and the second preset coefficient is obtained in the following way:
- the analysis is as follows:
- the adjustment of the first preset coefficient can be divided into template expression coefficient adjustment and posture coefficient adjustment.
- the posture coefficient has spatial physical meaning, it is only a transformation of different dimensions of space or a different coordinate system. Transformation, for example, conversion between radian and angle, clockwise and counterclockwise direction and adaptation, etc.
- adjusting the first preset coefficient in this step refers to adjusting the template expression coefficient.
- m1 represents the number of discrete vertices constituting the human face
- (x i , y i , z i ) represents the spatial coordinate data of the i-th vertex.
- the electronic device can use principal component analysis (PCA) for dimensionality reduction to utilize the motion of low-dimensional discrete vertices to drive the high-dimensional model.
- PCA principal component analysis
- a matrix of feature vectors can be obtained, that is, a principal component set, in which the principal components in the principal component set are orthogonal to each other, and each principal component serves as an expression base. Therefore, the 3D expression of the human face is a linear combination of natural expressions and expression base sets, as shown in Equation (6):
- Equation (6) Represents natural expressions, that is, faces without any expression or initial faces;
- P ⁇ R n ⁇ m is a matrix composed of m feature vectors, considering that one feature vector is a fusion shape (Blendshape) in the application process;
- the expression space that is, human facial expressions, can be represented by different natural expressions and different feature vectors, as shown in Equation (7):
- C ⁇ R k ⁇ j is the mapping function between the basic expression base and the new expression base.
- the electronic device may calculate the product of the face coefficient after time domain correction processing and the expression adaptation matrix to obtain the target face coefficient.
- the target face coefficient in this step is a modified coefficient, which realizes the transformation from different expression bases to other expression bases, so that the target face coefficient matches the corresponding expression base to achieve the effect of expression migration.
- step 14 the expression of the virtual image is rendered according to the target face coefficient to obtain a virtual expression.
- the electronic device can use the target face coefficients to render the expression of the virtual image.
- the electronic device can transmit the above target face coefficients in the form of UDP (User Datagram Protocol) broadcast, and then preset rendering The program (such as the unity program) renders the image when receiving the above UDP data, and finally uses the 3D display to display the virtual expression of the avatar in real time.
- UDP User Datagram Protocol
- the electronic device when no face area is detected in the original image, can render the expression of the avatar according to the target face coefficient of the previous frame of the original image to obtain the virtual expression, thereby making the two adjacent frames
- the avatar's expressions in the original image are relevant and continuous.
- the electronic device can continue to detect the next frame of original image, that is, perform step 11 again.
- the electronic device can start the timing (or counting) when no face area is detected in the original image.
- the timing duration exceeds the set duration threshold (such as 3 to 5 seconds)
- the electronic device still detects For areas that are less than the human face, the virtual expression is obtained according to the preset expression coefficient to display the initial expression of the avatar.
- the electronic device can also reduce the face detection frequency to save the processing resources of the electronic device. For example, the face area is detected every 3 to 5 frames of the original image, and then the face area is detected again until the face area is detected again, and then the face area is detected once per original image frame. face area.
- the face area in the original image can be obtained to obtain the target face image; then, the first face coefficient corresponding to the target face image is obtained; and then, the first face coefficient is obtained
- the template expression coefficient and/or the pose coefficient in the face coefficient are subjected to time domain correction processing to obtain the target face coefficient; finally, the expression of the virtual image is rendered according to the target face coefficient to obtain the virtual expression .
- the expressions of adjacent original images in the video can be made relevant and continuous, making the reconstructed expressions more natural and improving the viewing experience; and,
- the virtual expression is obtained by rendering the expression of the virtual image by transmitting the target face coefficients. Compared with transmitting image data, the amount of data transmission can be reduced, and the effect of reconstructing the virtual expression in real time can be achieved.
- Embodiments of the present disclosure provide a virtual expression generation method, see Figure 12 , including steps 121 to 128.
- step 121 the model is initialized and the model structure and parameters are loaded.
- step 122 the camera collects video, and its collection frame rate is not greater than 60fps.
- face detection and correction uses the preset face detection model to obtain all face areas in the video frame (i.e., the original image); the best person is selected according to the weighted value of the face size and the center position of the face. face, and at the same time correct it to make it a face image with a size of 224 ⁇ 224 pixels to meet the input needs of the face coefficient recognition network.
- step 124 the template expression coefficient is generated, and the 224 ⁇ 224 pixel face image obtained in step 123 is sent to the face coefficient recognition network to obtain the first face coefficient, which is used to describe the expression and posture of the face.
- the adaptation correction mainly involves mapping the basic expression base coefficients to new expression base coefficients and the transformation of pose coefficients.
- the new expression base coefficients can be regarded as a linear combination of the basic expression base coefficients, so this process is overall There is only one matrix multiplication in the implementation process; the pose coefficient has clear physical meaning, and the template pose only needs to be fixed and changed according to the actual physical meaning.
- time domain correction considers the temporal correlation of facial expressions rather than independent expression reconstruction for each frame. Therefore, time domain correction of expression coefficients and pose coefficients is introduced to smooth the facial expression transformation process and improve 3D Continuity and stability of virtual expressions.
- step 127 the Unity program is used to render the virtual expression.
- the processed expression coefficient and pose coefficient that is, the target face coefficient, are transmitted to the Unity program using the UDP port to drive the created virtual expression movement. .
- step 128 a 3D display device is provided to use the 3D display device to view 3D virtual expressions, and then steps 122 to 127 are repeated to realize real-time interaction of 3D virtual expressions.
- the embodiment of the disclosure also provides a virtual expression generation device.
- the device includes: a target image acquisition module 131 for acquiring the original The face area in the image is used to obtain the target face image; the first coefficient acquisition module 132 is used to obtain the first face coefficient corresponding to the target face image; the first face coefficient includes a template expression coefficient and a bit The posture coefficient, the template expression coefficient is used to represent the matching degree of the facial expression with each template, the posture coefficient represents the rotation angle of the virtual image in three dimensions; the target coefficient acquisition module 133 is used to obtain the third The template expression coefficient and/or the pose coefficient in a face coefficient are subjected to time domain correction processing to obtain a target face coefficient; the target face coefficient is the same as the face of the original image in the previous frame of the original image. The coefficients are associated; the expression animation acquisition module 134 is used to render the expression of the virtual image according to the target face coefficient to obtain the virtual expression.
- the target image acquisition module includes: a face area acquisition sub-module, used to perform face detection on the original image and obtain at least one face area contained in the original image; the target area acquisition sub-module A module is used to select a target face area from the at least one face area; a target image acquisition sub-module is used to perform correction processing on the target face area to obtain a target face image.
- the target area acquisition sub-module includes: a first determination unit, configured to determine that the face area is the target face area when the number of the face areas is one; a second determination unit; A determination unit configured to calculate a score value of each face region based on the regional parameter data of each face region when the number of face regions is multiple, and the score value is used to represent each face region. The distance from the central axis of the original image; determine the face area corresponding to the maximum value of the score value as the target face area.
- the area parameter data includes length, width, face area and position data
- the second determination unit includes: an absolute value acquisition subunit, used to obtain the abscissa and the abscissa of the middle position of each face area. The difference between half the width and the absolute value of the difference; the ratio acquisition subunit is used to obtain the ratio of the absolute value of the difference to the width, and the product of the ratio and a constant 2; the product acquisition subunit unit, used to obtain the difference between the constant 1 and the product, and obtain the product of the difference corresponding to the product and the preset distance weight; the square root acquisition subunit, used to obtain the face in each of the face areas The ratio of the area to the product of the length and width, and the square root of the corresponding ratio of the face area; the product acquisition subunit is used to obtain the product of the square root and the preset area weight, the area weight and the The sum of the distance weights is 1; the score acquisition subunit is used to calculate the sum of the product corresponding to the preset area weight and the product
- the target image acquisition sub-module includes: a candidate area acquisition unit, used to determine the candidate square area corresponding to the target face area, and obtain the vertex coordinate data of the candidate square area; affine coefficient acquisition A unit for performing affine transformation on the vertex coordinate data of the candidate square area and the vertex coordinate data of the preset square to obtain an affine transformation coefficient; the vertex coordinate data of the preset square includes a designated origin; affine An image acquisition unit is used to perform affine transformation on the original image using the affine transformation coefficient to obtain an affine transformed image; a target image acquisition unit is used to obtain the affine transformed image from the affine transformed image based on the specified origin point. A square area with a preset side length is intercepted, and the image within the intercepted square area is used as the target face image.
- the first coefficient acquisition module includes: an image processing submodule, used to perform blur processing and sharpening processing on the target face image respectively to obtain at least one blurred image and at least one sharpened image;
- the feature image acquisition submodule is used to respectively extract the feature data in the target face image, each blurred image and each sharpened image, and obtain the original feature image, the blurred feature image and the sharpened feature image;
- the initial image acquisition submodule Used to splice the original feature image, the blurred feature image and the sharpened feature image to obtain an initial feature image;
- a target image acquisition submodule used to obtain the response of each feature image in the initial feature image to the virtual image The importance coefficient of expression expression, and adjust the initial feature image according to the importance coefficient to obtain the target feature image;
- the face coefficient acquisition submodule is used to determine the template expression coefficient and pose coefficient according to the target feature image, and obtain The first face coefficient.
- the first coefficient acquisition module includes: a first coefficient acquisition sub-module for inputting the target face image into a preset face coefficient recognition network to obtain the preset face The first face coefficient corresponding to the target face image output by the coefficient identification network.
- the preset face coefficient recognition network includes: a blur and sharpening module, a feature extraction module, an attention module and a coefficient learning module; the blur and sharpening module respectively performs operations on the target face image. Blur processing and sharpening processing are performed to obtain at least one blurred image and at least one sharpened image; the feature extraction module respectively extracts feature data from the target face image, each blurred image and each sharpened image to obtain an original feature image , blurred feature image and sharpened feature image; and splicing the original feature image, the blurred feature image and the sharpened feature image to obtain an initial feature image; the attention module obtains each feature in the initial feature image The importance coefficient of the image to the expression of the virtual image, and the initial feature image is adjusted according to the importance coefficient to obtain the target feature image; the coefficient learning module determines the template expression coefficient and pose coefficient according to the target feature image , obtain the first face coefficient.
- the attention module adopts a network of temporal attention mechanism or spatial attention mechanism. network model implementation.
- the coefficient learning module is implemented using at least one network model among Resnet50, Resnet18, Resnet100, DenseNet and YoloV5.
- the target coefficient acquisition module includes: a weight coefficient acquisition sub-module, used to obtain the first face coefficient and the preset weight coefficient of the previous frame image before the original image; the previous The sum of the weight coefficient of the frame image and the weight coefficient of the original image is 1; the target coefficient acquisition submodule is used to compare the first face coefficient of the original image and the first face coefficient of the previous frame image. Perform weighted summation to obtain the target face coefficient corresponding to the original image.
- the device further includes: an adaptation matrix acquisition module, used to obtain a preset expression adaptation matrix; the expression adaptation matrix refers to the conversion relationship corresponding to two face coefficients containing different template numbers. ;
- the target coefficient acquisition module is used to calculate the product of the face coefficient after time domain correction and the expression adaptation matrix to obtain the target face coefficient.
- the preset expression adaptation matrix is obtained through the following steps: obtaining the first preset coefficient corresponding to the sample image, the first preset coefficient including the coefficients of the first number of templates; obtaining the sample image Corresponding second preset coefficients, the second preset coefficients include coefficients of a second number of templates; the preset coefficients are obtained according to the first preset coefficients, the second preset coefficients and the least squares method Expression adaptation matrix.
- the expression animation acquisition module is also used to continue to detect the next frame of the original image when no face area is detected in the original image, and based on the target face coefficient of the previous frame of the original image Obtain virtual expressions; or, the expression animation acquisition module is also used to obtain virtual expressions according to the preset expression coefficient when no face area is detected in the original image and the duration exceeds the set duration threshold.
- an electronic device including: a processor 141; a memory 142 for storing a computer program executable by the processor; wherein the processor is configured to The computer program in the memory is executed to implement the methods described in Figures 1 to 12.
- a non-transitory computer-readable storage medium such as a memory including an executable computer program.
- the above-mentioned executable computer program can be executed by a processor to implement Figures 1 to 12 The method of the illustrated embodiment.
- the readable storage medium can be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computer Graphics (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Geometry (AREA)
- Image Processing (AREA)
Abstract
本公开是关于一种虚拟表情生成方法、装置、电子设备和存储介质。该方法包括:获取原始图像中的人脸区域,得到目标人脸图像;获取所述目标人脸图像对应的第一人脸系数;对所述第一人脸系数中的所述模板表情系数和/或所述位姿系数进行时域修正处理,得到目标人脸系数;根据所述目标人脸系数渲染所述虚拟形象的表情,得到虚拟表情。本实施例中通过对第一人脸系数进行时域修正处理,可以使视频中相邻的原始图像的表情具有相关性和连续性,使重建的表情更自然,提升观看体验;通过传输目标人脸系数而非图像数据来渲染虚拟形象的表情来获得虚拟表情,可降低数据传输量,达到实时重建虚拟表情的效果。
Description
本公开涉及数据处理技术领域,尤其涉及一种虚拟表情生成方法、装置、电子设备和存储介质。
三维(Three Dimensional,3D)建模是机器视觉领域中的一个关键问题,其中3D表情建模,在游戏、影视特效、VR等娱乐化领域应用广泛。现有3D虚拟表情建模主流方法均是基于图片进行3D虚拟表情的生成,但因人脸本身结构复杂,涉及脸部肌肉协调运动,其表情变化过程是复杂的非刚体运动,对于采集设备、采集环境、建模设备以及建模过程均有着较高的要求,难以做到实时的要求;并且,在处理视频中的各帧图像时,忽略了表情上的相关性与连续性。
发明内容
本公开提供一种虚拟表情生成方法、装置、电子设备和存储介质,以解决相关技术的不足。
根据本公开实施例的第一方面,提供一种虚拟表情生成方法,包括:获取原始图像中的人脸区域,得到目标人脸图像;获取所述目标人脸图像对应的第一人脸系数;所述第一人脸系数包括模板表情系数和位姿系数,所述模板表情系数用于表征人脸表情与各个模板的匹配程度,所述位姿系数表示虚拟形象在三个维度上的旋转角度;对所述第一人脸系数中的所述模板表情系数和/或所述位姿系数进行时域修正处理,得到目标人脸系数;根据所述目标人脸系数渲染所述虚拟形象的表情,得到虚拟表情。
可选地,获取原始图像中的人脸区域,得到目标人脸图像,包括:对所述原始图像进行人脸检测,获得所述原始图像包含的至少一个人脸区域;从所述至少一个人脸区域内选取出目标人脸区域;对所述目标人脸区域进行校正处理得到目标人脸图像。
可选地,从所述至少一个人脸区域内选取出目标人脸区域,包括:当所述人脸区域的数量为一个时,确定所述人脸区域为所述目标人脸区域;当所述人脸区域的数量为多个时,根据所述各个人脸区域的区域参数数据计算出各个人脸区域的评分值,所述评分值用于表示各个人脸区域距离所述原始图像中轴的程度;确定所述评分值的最大值对应的人脸区域为所述目标人脸区域。
可选地,所述区域参数数据包括长度、宽度、人脸面积和位置数据,根据所述各个人脸区域的区域参数数据计算出各个人脸区域的评分值,包括:获取各个人脸区域中间位置的横坐标与宽度一半的差值,以及所述差值的绝对值;获取所述差值的绝对值与所述宽度的比值,以及所述比值与常数2的乘积;获取常数1与所述乘积的差值,并获取所述乘积对应的差值与预设的距离权重的乘积;获取所述各个人脸区域中人脸面积与所述长度和宽度两者乘积的比值,以及所述人脸面积对应比值的平方根;获取所述平方根与预设的面积权重的乘积,所述面积权重和所述距离权重之和为1;计算所述预设的面积权重对应的乘积和所述预设的距离权重的乘积的和值,得到各个人脸区域的评分值。
可选地,对所述目标人脸区域进行校正处理得到目标人脸图像,包括:确定所述目标人脸区域对应的候选正方形区域,得到所述候选正方形区域的顶点坐标数据;对所述候选正方形区域的顶点坐标数据和预设正方形的顶点坐标数据进行仿射变换,得到仿射变换系数;所述预设正方形的顶点坐标数据中包括一个指定原点;利用所述仿射变换系数对所述原始图像进行仿射变换,得到仿射变换图像;以所述指定原点为基准从所述仿射变换图像截取预设边长的正方形区域,并将截取的正方形区域内的图像作为所述目标人脸图像。
可选地,获取所述目标人脸图像对应的第一人脸系数,包括:分别对所述目标人脸图像进行模糊处理和锐化处理,得到至少一张模糊图像和至少一张锐化图像;分别提取所述目标人脸图像、各张模糊图像和各张锐化图像中的特征数据,得到原始特征图像、模糊特征图像和锐化特征图像;拼接所述原始特征图像、所述模糊特征图像和所述锐化特征图像,得到初始特征图像;获取所述初始特征图像中各特征图像对所述虚拟形象表情表达的重要性系数,并根据所述重要性系数调整所述初始特征图像得到目标特征图像;根据所述目标特征图像确定模板表情系数和位姿系数,得到所述第一人脸系数。
可选地,获取所述目标人脸图像对应的第一人脸系数,包括:将所述目标人脸图像输入到预设的人脸系数识别网络,获得所述预设的人脸系数识别网络输出的所述目标人脸图像对应的第一人脸系数。
可选地,所述预设的人脸系数识别网络包括:模糊锐化模块、特征提取模块、注意力模块和系数学习模块;所述模糊锐化模块分别对所述目标人脸图像进行模糊处理和锐化处理,得到至少一张模糊图像和至少一张锐化图像;所述特征提取模块分别提取所述目标人脸图像、各张模糊图像和各张锐化图像中的特征数据,得到原始特征图像、模糊特征图像和锐化特征图像;并拼接所述原始特征图像、所述模糊特征图像和所述锐化特征图像,得到初始特征图像;所述注意力模块获取所述初始特征图像中各特征图像对所述虚拟形象表情表达的重要性系数,并根据所述重要性系数调整所述初始特征图像得到目标特征图像;所述系数学习模块根据所述目标特征图像确定模板表情系数和位姿系数,得到所述第一人脸系数。
可选地,所述注意力模块采用时间注意力机制或者空间注意力机制的网络模型实现。
可选地,所述系数学习模块采用Resnet50、Resnet18、Resnet100、DenseNet和YoloV5中的至少一种网络模型实现。
可选地,对所述第一人脸系数中的所述模板表情系数和/或所述位姿系数进行时域修正处理,得到目标人脸系数,包括:获取所述原始图像之前的前一帧图像的第一人脸系数和预设的权重系数;所述前一帧图像的权重系数和所述原始图像的权重系数之和为1;对所述原始图像的第一人脸系数和所述前一帧图像的第一人脸系数进行加权求和值,得到所述原始图像对应的目标人脸系数。
可选地,对所述第一人脸系数中的所述模板表情系数和/或所述位姿系数进行时域修正处理之后,所述方法还包括:获取预设的表情适配矩阵;所述表情适配矩阵是指包含不同模板数量的两个人脸系数对应的转换关系;计算时域修正处理后的人脸系数和所述表情适配矩阵的乘积,得到所述目标人脸系数。
可选地,所述预设的表情适配矩阵通过以下步骤获取:获取样本图像对应的第一预
设系数,所述第一预设系数包括第一数量个模板的系数;获取样本图像对应的第二预设系数,所述第二预设系数包括第二数量个模板的系数;根据所述第一预设系数、所述第二预设系数和最小二乘法获取所述预设的表情适配矩阵。
可选地,所述方法还包括:当在所述原始图像中未检测到人脸区域时,继续检测下一帧原始图像,并根据上一帧原始图像的目标人脸系数获取虚拟表情;或者,当在所述原始图像中未检测到人脸区域且持续时长超过设定时长阈值时,根据预设的表情系数获取虚拟表情。
根据本公开实施例的第二方面,提供一种虚拟表情生成装置,包括:目标图像获取模块,用于获取原始图像中的人脸区域,得到目标人脸图像;第一系数获取模块,用于获取所述目标人脸图像对应的第一人脸系数;所述第一人脸系数包括模板表情系数和位姿系数,所述模板表情系数用于表征人脸表情与各个模板的匹配程度,所述位姿系数表示虚拟形象在三个维度上的旋转角度;目标系数获取模块,用于对所述第一人脸系数中的所述模板表情系数和/或所述位姿系数进行时域修正处理,得到目标人脸系数;表情动画获取模块,用于根据所述目标人脸系数渲染所述虚拟形象的表情,得到虚拟表情。
根据本公开实施例的第三方面,提供一种电子设备,包括:处理器和用于存储可执行指令的存储器;所述处理器从所述存储器中读取可执行指令,以实现第一方面任一项所述的方法的步骤。
根据本公开实施例的第四方面,提供一种芯片,包括:处理器和用于存储可执行程序的存储器;所述处理器从所述存储器中读取可执行程序,以实现第一方面任一项所述的方法的步骤。
根据本公开实施例的第五方面,提供一种非暂态计算机可读存储介质,其上存储有计算机可执行程序,该可执行程序被执行时实现第一方面任一项所述方法的步骤。
本公开的实施例提供的技术方案可以包括以下有益效果:
由上述实施例可知,本公开实施例提供的方案中可以获取原始图像中的人脸区域,得到目标人脸图像;然后,获取所述目标人脸图像对应的第一人脸系数;之后,对所述第一人脸系数中的所述模板表情系数和/或所述位姿系数进行时域修正处理,得到目标人脸系数;最后,根据所述目标人脸系数渲染所述虚拟形象的表情,得到虚拟表情。这样,本实施例中通过对第一人脸系数进行时域修正处理,可以使视频中相邻的原始图像的表情具有相关性和连续性,使重建的表情更自然,提升观看体验;并且,通过传输目标人脸系数来渲染虚拟形象的表情来获得虚拟表情,与传输图像数据相比较可降低数据传输量,达到实时重建虚拟表情的效果。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。
图1是根据一示例性实施例示出的一种虚拟表情生成方法的流程图。
图2是根据一示例性实施例示出的一种获取目标人脸图像的流程图。
图3是根据一示例性实施例示出的一种获取目标人脸区域的流程图。
图4是根据一示例性实施例示出的一种获取人脸区域的评分值的流程图。
图5是根据一示例性实施例示出的一种获取目标人脸图像的流程图。
图6是根据一示例性实施例示出的一种获取第一人脸系数的流程图。
图7是根据一示例性实施例示出的一种人脸系统识别网络的框图。
图8是根据一示例性实施例示出的一种获取目标人脸系数的流程图。
图9是根据一示例性实施例示出的另一种获取目标人脸系数的流程图。
图10是根据一示例性实施例示出的一种获取表情适配矩阵的流程图。
图11是根据一示例性实施例示出的另一种获取表情适配矩阵的流程图。
图12是根据一示例性实施例示出的另一种虚拟表情生成方法的流程图。
图13是根据一示例性实施例示出的一种虚拟表情生成装置的框图。
图14是根据一示例性实施例示出的一种服务器的框图。
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性所描述的实施例并不代表与本公开相一致的所有实施例。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置例子。需要说明的是,在不冲突的情况下,下述的实施例及实施方式中的特征可以相互组合。
3D建模是机器视觉领域中的一个关键问题,其中3D表情建模,在游戏、影视特效、VR等娱乐化领域应用广泛。现有3D虚拟表情建模主流方法均是基于图片进行3D虚拟表情的生成,但因人脸本身结构复杂,涉及脸部肌肉协调运动,其表情变化过程是复杂的非刚体运动,对于采集设备、采集环境、建模设备以及建模过程均有着较高的要求,难以做到实时的要求;并且,在处理视频中的各帧图像时,忽略了表情上的相关性与连续性。
为解决上述技术问题,本公开实施例提供了一种虚拟表情生成方法,可以应用于电子设备,图1是根据一示例性实施例示出的一种虚拟表情生成方法的流程图。
参见图1,一种虚拟表情生成方法,包括步骤11~步骤14。
在步骤11中,获取原始图像中的人脸区域,得到目标人脸图像。
本实施例中,电子设备可以摄像头通信,获取摄像头所采集的图像和/或视频,并且该摄像头的采集帧率不超过60fps;也可以从指定位置读取图像和/或视频。考虑到电子设备每次处理一张图像或者视频中的一帧图像,后续处理一张图像为例描述各实施例的方案,并且将所处理的图像称之为原始图像以区别于处理过的其他图像。
本实施例中,在获取到原始图像后,电子设备可以获取原始图像中的人脸区域,参
见图2,包括步骤21~步骤23。
在步骤21中,电子设备可以对所述原始图像进行人脸检测,获得所述原始图像包含的至少一个人脸区域。
本步骤中,电子设备可以采用预设的人脸检测模型对原始图像进行人脸检测,上述预设的人脸检测模型可以包括但不限于yolov5模型、resnet18模型、R-CNN模型、mobilenet模型等能够实现目标检测功能的模型,本领域技术人员可以根据具体场景选择合适的模型,相应方案落入本公开的保护范围。这样,上述预设的人脸检测模型可以输出上述原始图像所包含的至少一个人脸区域。
需要说明的是,在人脸检测过程中,电子设备可以记录是否检测到人脸区域,当没有人脸区域存在时,可以将标记设置为-1,当存在人脸区域时,可以将标记设置为人脸区域的数量,同时记录各个人脸区域的区域参数数据。上述区域参数数据包括长度、宽度、人脸面积和位置数据。
例如,当人脸区域的数量为1个时,人脸区域的区域参数数据为[x,y,w,h,s],其中x和y分别表示人脸区域的指定点(如中心点、左上顶点、左下顶点、右上顶点或者右下顶点)的横纵坐标,w和h分别表示人脸区域宽度和高度,s表示人脸区域的面积。又如,当人脸区域的数量为n1(n1为大于1的整数)时,n个人脸区域的区域参数数据采用列表表示,即[[x1,y1,w1,h1,s1],[x2,y2,w2,h2,s2],……,[xn1,yn1,wn1,hn1,sn1]]。
在步骤22中,电子设备可以从所述至少一个人脸区域内选取出目标人脸区域。
例如,当人脸区域的数量为1个时,电子设备可以确定该人脸区域为目标人脸区域。
又如,当人脸区域的数量为多个(如n1为大于1的整数)时,电子设备可以从该多个人脸区域内选取一个作为目标人脸区域。参见图3,在步骤31中,电子设备可以根据所述各个人脸区域的区域参数数据计算出各个人脸区域的评分值,上述评分值用于表示各个人脸区域距离所述原始图像中轴的程度。其中原始图像中轴是指经过原始图像中心点的竖直线,如原始图像大小为1920*1080,则直线x=960可以作为原始图像的中轴。
本示例中,参见图4,电子设备获取各个人脸区域的评分值,包括步骤41~步骤46。
在步骤41中,电子设备可以获取各个人脸区域中间位置的横坐标与宽度一半的差值,以及所述差值的绝对值。例如,上述差值的绝对值为
其中xn1表示第n1个人脸区域的横坐标,w表示第n1个人脸区域的宽度,||表示求绝对值。
在步骤42中,电子设备可以获取所述差值的绝对值与所述宽度的比值,以及所述比值与常数2的乘积。例如,比值与常数2的乘积为
在步骤43中,电子设备可以获取常数1与所述乘积的差值,并获取所述乘积对应的差值与预设的距离权重的乘积。例如,乘积对应的差值与预设的距离权重的乘积为
其中,α表示预设的距离权重,或者说人脸区域中心距离中轴距离的归一化数值,并且该α受到摄像头采集距离的影响,在一示例中该α取值为0.2。
在步骤44中,电子设备可以获取所述各个人脸区域中人脸面积与所述长度和宽度两者乘积的比值,以及所述人脸面积对应比值的平方根。例如,
其中,sn表示第n1个人脸区域中人脸的面积,h表示第n1个人脸区域的高度,w表示第n1个人脸区域的宽度。
在步骤45中,电子设备可以获取所述平方根与预设的面积权重的乘积,所述面积权重和所述距离权重之和为1。例如,平方根与预设的面积权重的乘积为
其中,1-α表示人脸面积占原始图像面积的归一化数值。
在步骤46中,电子设备可以计算所述预设的面积权重对应的乘积和所述预设的距离权重的乘积的和值,得到各个人脸区域的评分值。例如,各个人脸区域的评分值score如下式(1)所示:
在步骤32中,电子设备可以确定所述评分值的最大值对应的人脸区域为所述目标人脸区域。
本示例中通过确定评分值最大的人脸区域作为目标人脸区域,即可确定出距离原始图像的中轴最近且人脸面积较大的人脸区域,可以贴近实际采集图像过程中关注对象在拍摄区域内的场景,有利于提升所获取目标人脸区域的准确度。
在步骤23中,电子设备可以对所述目标人脸区域进行校正处理得到目标人脸图像。
本步骤中,参见图5,电子设备校正目标人脸区域包括步骤51~步骤54。
在步骤51中,电子设备可以确定所述目标人脸区域对应的候选正方形区域,得到所述候选正方形区域的顶点坐标数据。电子设备可以获取目标人脸区域的中心点(xn1,yn1),
以该中心点(xn1,yn1)为中心来确定一个正方形区域,该正方形区域的边长为
其中,scale为目标人脸区域的放大系数,其取值大于1,在一示例中scale取值为1.25。wn1,hn1分别表示目标人脸区域的宽度和高度。电子设备可以获取到正方形区域各个顶点的顶点坐标数据。为方便描述,后续将上述正方形区域称之为候选正方形区域。
在步骤52中,电子设备可以对所述候选正方形区域的顶点坐标数据和预设正方形的顶点坐标数据进行仿射变换,得到仿射变换系数;所述预设正方形的顶点坐标数据中包括一个指定原点。
本步骤中,电子设备内可以存储一个预设正方形,该预设正方形的顶点坐标数据中包括一个指定原点(0,0),边长为预设边长(如224像素)。以左上顶点是指定原点为例,该预设正方形的四个顶点的顶点坐标数据分别为左上角(0,0)、左下角(0,224)、右上角(224,0)和右下角(224,224)。
本步骤中,电子设备可以对候选正方形区域和预设正方形进行仿射变换,即将候选正方形区域的预设正方形的各个顶点建立仿射变换关系,得到仿射变换系数。或者说,电子设备可以对候选正方形区域进行缩放、平移和旋转,从而得到预设正方形。可理解的是,获取两个正方形的仿射变换关系可以参考相关技术的方案,在此不再赘述。
在步骤53中,电子设备可以利用所述仿射变换系数对所述原始图像进行仿射变换,得到仿射变换图像。
在步骤54中,电子设备可以以所述指定原点为基准从所述仿射变换图像截取预设边长的正方形区域,并将截取的正方形区域内的图像作为所述目标人脸图像。例如,电子设备从仿射变换图像中(0,0)位置截取长度和宽度均为224的正方形,得到目标人脸图像。
这样,本示例中通过对目标人脸区域进行仿射变换修正,与拉伸或者挤压方式人脸区域相比较,可以使人脸区域具有更好的保真度即人脸表情具有更好的保真度,有利于提升后续生成虚拟表情的准确度。或者说,本示例中将原始图像处理为高保真度的归一化的目标人脸图像,可以提高后续步骤12中的第一人脸系数的准确度,以及步骤14所生成的虚拟表情的真实度和逼真度,有利于提升交互体验。
在步骤12中,获取所述目标人脸图像对应的第一人脸系数;所述第一人脸系数包括模板表情系数和位姿系数,所述模板表情系数用于表征人脸表情与各个模板的匹配程度,所述位姿系数表示虚拟形象在三个维度上的旋转角度。
本步骤中,电子设备可以获取目标人脸图像对应的第一人脸系数,参见图6,包括步骤61~步骤65。
在步骤61中,电子设备可以分别对所述目标人脸图像进行模糊处理和锐化处理,得到至少一张模糊图像和至少一张锐化图像。
考虑到目标人脸图像是原始图像的一部分,其图像中特征并不突出,因此本步骤中先提炼目标人脸图像的整体特征和/或细节特征。
以获取整体轮廓特征为例,本步骤中,电子设备可以对目标人脸图像进行模糊处理,所采用的模糊算法包括但不限于例如高斯模糊(Gaussian Blur)、方框模糊(Box Blur)、Kawase模糊(Kawase Blur)、双重模糊(Dual Blur)、散景模糊(Bokeh Blur)、移轴模糊(Tilt Shift Blur)、光圈模糊(Iris Blur)、粒状模糊(Grainy Blur)、径向模糊(Radial Blur)和方向模糊(Directional Blur)等,在一示例中采用高斯模糊算法处理目标人脸图像,从而获得目标人脸图像对应的至少一张模型图像。
以获取细节轮廓特征为例,本步骤中,电子设备可以对目标人脸图像进行锐化处理,所采用的锐化算法包括但不限于Robert算子,Prewitt算子,Sobel算子,Laplacian算子,Kirsch算子等,在一示例中使用Robert算子处理目标人脸图像,从而获得目标人脸图像对应的至少一张锐化图像。
在一些示例中,上述模糊算法和/或锐化算法还可以采用机器视觉领域中的神经网络(如卷积神经网络等)实现,同样可以得到模糊图像和/或锐化图像,相应方案落入本公开的保护范围。
这样,本步骤中通过对目标人脸图像进行模糊处理和锐化处理,方便使用目标人脸图像的整体轮廓特征、细节轮廓特征和原始特征,从而丰富目标人脸图像的特征数量和类别,有利于提升后续获得第一人脸系数的准确度。
在步骤62中,电子设备可以分别提取所述目标人脸图像、各张模糊图像和各张锐化图像中的特征数据,得到原始特征图像、模糊特征图像和锐化特征图像。例如,电子设备可以分别对目标人脸图像、各张模糊图像和各张锐化图像进行至少一层卷积操作,从而得到原始特征图像、模糊特征图像和锐化特征图像。
在步骤63中,电子设备可以拼接所述原始特征图像、所述模糊特征图像和所述锐化特征图像,得到初始特征图像。例如,电子设备可以在原始特征图像的后方拼接模糊特征图像;在模糊特征图像拼接完成后,将锐化特征图像拼接到模糊特征图像之后,直至拼接完所有的特征图像,得到一张模糊特征、原始特征和锐化特征的特征图像,后续称之为初始特征图像。
在步骤64中,电子设备可以获取所述初始特征图像中各特征图像对所述虚拟形象表情表达的重要性系数,并根据所述重要性系数调整所述初始特征图像得到目标特征图像。例如,电子设备可以通过时间注意力机制和/空间注意力机制来获取初始特征图像中各特征图像对虚拟形象表情表达的重要性系数。然后,电子设备可以计算上述重要性系数和初始特征图像的乘积得到目标特征图像。
这样,本步骤中通过重要性系数对初始特征图像进行调整,可以突出相对重要的特征图像而弱化相对不重要的特征图像,提升目标特征图像的准确度,进而提升步骤65中所得第一人脸系数的准确度。
在步骤65中,电子设备可以根据所述目标特征图像确定模板表情系数和位姿系数,得到所述第一人脸系数。
本步骤中,电子设备内可以存储预设的表情模板集合,每个表情模板称之为一个表情基。电子设备可以匹配目标特征图像与各个表情基的适配度,从而确定模板表情系数和位姿系数,得到上述第一人脸系数。或者说,通过第一人脸系数中的模板表情系数来调整各个表情基,以及通过位姿系数来调整各个表情基的空间位姿,可以还原出上述目
标特征图像。
在另一实施例中,电子设备内可以存储预设的人脸系数识别网络。电子设备可以将上述目标人脸图像输入到预设的人脸系数识别网络,由预设的人脸系数识别网络输出目标人脸图像对应的第一人脸系数。
参见图7,上述预设的人脸系数识别网络包括:模糊锐化模块71、特征提取模块72、注意力模块73和系数学习模块74。其中,模糊锐化模块71分别对所述目标人脸图像进行模糊处理和锐化处理,得到至少一张模糊图像和至少一张锐化图像;特征提取模块72分别提取所述目标人脸图像、各张模糊图像和各张锐化图像中的特征数据,得到原始特征图像、模糊特征图像和锐化特征图像;并拼接所述原始特征图像、所述模糊特征图像和所述锐化特征图像,得到初始特征图像;注意力模块73获取所述初始特征图像中各特征图像对所述虚拟形象表情表达的重要性系数,并根据所述重要性系数调整所述初始特征图像得到目标特征图像;该注意力模块采用时间注意力机制或者空间注意力机制的网络模型实现。
系数学习模块74根据所述目标特征图像确定模板表情系数和位姿系数,得到所述第一人脸系数。该系数学习模块采用Resnet50、Resnet18、Resnet100、DenseNet和YoloV5中的至少一种网络模型实现,技术人员可以根据具体场景进行选择,相应方案落入本公开的保护范围。
在步骤13中,对所述第一人脸系数中的所述模板表情系数和/或所述位姿系数进行时域修正处理,得到目标人脸系数;所述目标人脸系数与所述原始图像前一帧原始图像的人脸系数相关联。
本步骤中,参见图8,电子设备可以对所述第一人脸系数中的所述模板表情系数和/或所述位姿系数进行时域修正处理,得到目标人脸系数,包括步骤81~步骤82。
在步骤81中,电子设备可以获取所述原始图像之前的前一帧图像的第一人脸系数和预设的权重系数;所述前一帧图像的权重系数和所述原始图像的权重系数之和为1。
在步骤82中,电子设备可以对所述原始图像的第一人脸系数和所述前一帧图像的第一人脸系数进行加权求和值,得到所述原始图像对应的目标人脸系数。
这样,本实施例中通过加权求和值的方式来获取目标人脸系数,可以使当前的原始图像和前一帧图像的人脸系数具有关联关系。前一帧图像的预设的权重系数越大,前一帧图像的人脸系数在后续目标人脸系数中所占比例越大,则前一帧图像和当前的原始图像的参数越平滑,进而使得当前的原始图像对应的虚拟表情与前一帧图像对应的虚拟表情变化更慢;前一帧图像的预设的权重系数越小,则前一帧图像和当前的原始图像的参数变化越快,进而使得当前的原始图像对应的虚拟表情与前一帧图像对应的虚拟表情变化更快。技术人员可以根据具体场景选择适合的预设的权重系数,使得相邻两帧原始图像的表情的变化满足场景的需求,在一示例中前一帧图像的预设的权重系数取值为0.4,并且当前的原始图像对应的权重系数为0.6。
需要说明的是,当当前的原始图像为视频的第一帧图像时,其不存在前一帧图像,电子设备可以直接使用该第一帧图像的第一人脸系数作为目标人脸系数,即不对第一人脸系数作时域修正,从而保证第一帧图像表情的准确度。
考虑到步骤65和/或系数学习模块74中获取第一人脸系数时所使用模板。
本公开提供的虚拟表情生成方法中所使用表情模板集合是固定的,其中“固定”包括表情模板集合中各个模板是固定的且模板数量是固定的,考虑到不同的电子设备可能所使用的表情模板集合是不同的,因此需要对不同电子设备所获得的第一人脸系数进行适配,例如将64个表情模板适配到52个表情模板之上。参见图9,电子设备对第一人脸系数进行适配,包括步骤91~步骤92。
在步骤91中,电子设备可以获取预设的表情适配矩阵;所述表情适配矩阵是指包含不同模板数量的两个人脸系数对应的转换关系。
本步骤中,电子设备内可以存储预设的表情适配矩阵。该预设的表情适配矩阵可以通过以下步骤获取,参见图10和图11,包括步骤101~步骤103。
在步骤101中,电子设备可以获取样本图像对应的第一预设系数。获取方式可以参见图6或者图7所示实施例的内容,在此不再赘述。上述第一预设系数包括第一数量(如64个)个模板的系数,是指目标特征图像相对应于第一数量个模板中各个模板(或者说表情基)的适配度。
在步骤102中,电子设备可以获取样本图像对应的第二预设系数,所述第二预设系数包括第二数量个模板的系数。获取方式可以参见图6或者图7所示实施例的内容,在此不再赘述。上述第二预设系数包括第二数量(如52个)个模板的系数,是指样本图像相对应于第二数量个模板中各个模板(或者说表情基)的适配度。
在步骤103中,电子设备可以根据所述第一预设系数、所述第二预设系数和最小二乘法获取所述预设的表情适配矩阵。
本步骤中,第一预设系数为第二预设系数为并且和为线性关系
第一预设系数和第二预设系数的差值的平方和尽量小,如式(3)所示:
式(3)中,J表示平方和损失,S∈Rk×(j+1)为表情适配矩阵,k为新表情基的数量即第二数量,j为基础表情基数据即第一数量。
对式(3)进行计算,可以得到S,如式(4)所示:
需要说明的是,第一预设系数和第二预设系数成线性关系通过以下方式获取,
分析如下:本步骤中,对第一预设系数进行调整时可以分为模板表情系数调整和位姿系数调整,考虑到位姿系数具有空间物理意义,仅是空间不同维度的变换或者说不同坐标系的变换,例如,弧度与角度的转换、顺时针与逆时针方向和适配等等,该部分内容可以参考相关技术中的变换方案,在此不再赘述。因此,本步骤中对第一预设系数进行调整是指对模板表情系数的调整。
可理解的是,考虑到人脸表情在空间维度的表示可以看成多个离散顶点围成空间几何体的形状属性,如式(5)所示:
F=((x1,y1,z1),(x2,y2,z2),…,(xi,yi,zi),…,(xm1,ym1,zm1)) (5)
F=((x1,y1,z1),(x2,y2,z2),…,(xi,yi,zi),…,(xm1,ym1,zm1)) (5)
式(5)中,m1表示构成人脸的离散顶点的数量,(xi,yi,zi)表示第i个顶点的空间坐标数据。
当描绘人脸表情所需要的离散顶点过多时,电子设备的计算量也比较大,不利于生成动画。本步骤中,电子设备可以使用主成分分析(PCA)方式进行降维,以达到利用低维离散顶点的运动来驱动高维模型。PCA处理后可以得到一个特征向量的矩阵,即主成分集合,其中主成分集合中各个主成分相互正交,每个主成分作为一个表情基。因此,人脸的3D表情是一个自然表情和表情基集合的线性组合,如式(6)所示:
式(6)中,表示自然表情,即没有任何表情的人脸或者说初始人脸;P∈Rn×m为m个特征向量组成的矩阵,考虑到一个特征向量为应用过程中的一个融合形状(Blendshape);P表示一组融合形状(Blendshape);表示为表情特征向量的系数,如第一预设系数或者第一人脸系数。
表情空间即人脸表情可以用不同的自然表情和不同的特征向量表示,如式(7)所示:
式(7)中,basic和new分别表示基础表情基空间和新表情基空间,Pbasic∈Rn×j,Pnew∈Rn×k,
对式(7)进行变换,可以得到式(8):
式(8)中,C∈Rk×j为基础表情基和新表情基的映射函数。
对式(8)进行变换,可以得到式(9):
式(9)中,表示差值特征矢量系数,根据式(8)和式(9)可得式(10)和式(11):
将式(7)和式(11)联立可得到式(2):
在步骤92中,电子设备可以计算时域修正处理后的人脸系数和所述表情适配矩阵的乘积,得到所述目标人脸系数。这样,本步骤中的目标人脸系数是经过修正后的系数,实现了从不同表情基变换到其他表情基,从而使目标人脸系数与相应的表情基相匹配,达到表情迁移的效果。
在步骤14中,根据所述目标人脸系数渲染所述虚拟形象的表情,得到虚拟表情。
本步骤中,电子设备可以利用目标人脸系数来渲染虚拟形象的表情,如电子设备可以通过UDP(User Datagram Protocol,用户数据报协议)广播的形式传输上述目标人脸系数,然后预设的渲染程序(如unity程序)在接收到上述UDP数据时渲染图像,最后利用3D显示器实时显示虚拟形象的虚拟表情。
在一实施例中,当在原始图像中未检测到人脸区域时,电子设备可以根据上一帧原始图像的目标人脸系数渲染虚拟形象的表情,以获得虚拟表情,从而使相邻两帧原始图像中虚拟形象的表情具有相关性和连续性。并且,电子设备可以继续检测下一帧原始图像,即重新执行步骤11。
在另一实施例中,电子设备可以在原始图像中未检测到人脸区域时启动定时(或者计数),当定时持续时长超过设定时长阈值(如3~5秒)时,电子设备仍然检测不到人脸区域则根据预设的表情系数获取虚拟表情,以显示虚拟形象的初始表情。并且,电子设备还可以降低人脸检测频率以节省电子设备的处理资源,例如每3~5帧原始图像检测一次人脸区域,直至再次检测到人脸区域时再恢复到每帧原始图像检测一次人脸区域。
至此,本公开实施例提供的方案中可以获取原始图像中的人脸区域,得到目标人脸图像;然后,获取所述目标人脸图像对应的第一人脸系数;之后,对所述第一人脸系数中的所述模板表情系数和/或所述位姿系数进行时域修正处理,得到目标人脸系数;最后,根据所述目标人脸系数渲染所述虚拟形象的表情,得到虚拟表情。这样,本实施例中通过对第一人脸系数进行时域修正处理,可以使视频中相邻的原始图像的表情具有相关性和连续性,使重建的表情更自然,提升观看体验;并且,通过传输目标人脸系数来渲染虚拟形象的表情来获得虚拟表情,与传输图像数据相比较可降低数据传输量,达到实时重建虚拟表情的效果。
本公开实施例提供了一种虚拟表情生成方法,参见图12,包括步骤121~步骤128。
在步骤121中,对模型初始化,载入模型结构及参数。
在步骤122中,摄像头采集视频,其采集帧率不大于60fps。
在步骤123中,人脸检测及修正是利用预设的人脸检测模型获取视频帧(即原始图像)中所有的人脸区域;按照人脸大小及脸部中心位置的加权值选取最佳人脸,同时对其进行修正,使其成为224×224像素大小的人脸图像,以满足人脸系数识别网络的输入需求。
在步骤124中,模板表情系数生成,将步骤123得到的224×224像素的人脸图像送入人脸系数识别网络,获得第一人脸系数,用于描绘人脸的表情与位姿。
在步骤125中,适配修正,主要是将基础表情基系数映射为新表情基系数以及位姿系数的变换,新表情基系数可以看作基础表情基系数的线性组合,故这一过程在整体实现流程中仅为一次矩阵乘法;位姿系数具有明确的物理意义,只需将模板位姿根据实际物理意义进行固定变化即可。
在步骤126中,时域修正,是考虑人脸表情具有时序相关性,而非每帧独立的表情重建,故引入表情系数与位姿系数的时域修正,以平滑面部表情变换过程,提升3D虚拟表情的连续性和稳定性。
在步骤127中,利用Unity程序渲染虚拟表情,是将处理完的表情系数与位姿系数即目标人脸系数,并利用UDP端口的方式传入到Unity程序中,以驱动建立好的虚拟表情运动。
在步骤128中,送3D显示设备,是利用3D显示设备观看3D虚拟表情,然后重复步骤122~步骤127,实现3D虚拟表情的实时交互。
在本公开实施例提供的一种虚拟表情生成方法的基础上,本公开实施例还提供了一种虚拟表情生成装置,参见图13,所述装置包括:目标图像获取模块131,用于获取原始图像中的人脸区域,得到目标人脸图像;第一系数获取模块132,用于获取所述目标人脸图像对应的第一人脸系数;所述第一人脸系数包括模板表情系数和位姿系数,所述模板表情系数用于表征人脸表情与各个模板的匹配程度,所述位姿系数表示虚拟形象在三个维度上的旋转角度;目标系数获取模块133,用于对所述第一人脸系数中的所述模板表情系数和/或所述位姿系数进行时域修正处理,得到目标人脸系数;所述目标人脸系数与所述原始图像前一帧原始图像的人脸系数相关联;表情动画获取模块134,用于根据所述目标人脸系数渲染所述虚拟形象的表情,得到虚拟表情。
在一实施例中,所述目标图像获取模块包括:人脸区域获取子模块,用于对所述原始图像进行人脸检测,获得所述原始图像包含的至少一个人脸区域;目标区域获取子模块,用于从所述至少一个人脸区域内选取出目标人脸区域;目标图像获取子模块,用于对所述目标人脸区域进行校正处理得到目标人脸图像。
在一实施例中,所述目标区域获取子模块包括:第一确定单元,用于在所述人脸区域的数量为一个时,确定所述人脸区域为所述目标人脸区域;第二确定单元,用于在所述人脸区域的数量为多个时,根据所述各个人脸区域的区域参数数据计算出各个人脸区域的评分值,所述评分值用于表示各个人脸区域距离所述原始图像中轴的程度;确定所述评分值的最大值对应的人脸区域为所述目标人脸区域。
在一实施例中,所述区域参数数据包括长度、宽度、人脸面积和位置数据,所述第二确定单元包括:绝对值获取子单元,用于获取各个人脸区域中间位置的横坐标与宽度一半的差值,以及所述差值的绝对值;比值获取子单元,用于获取所述差值的绝对值与所述宽度的比值,以及所述比值与常数2的乘积;乘积获取子单元,用于获取常数1与所述乘积的差值,并获取所述乘积对应的差值与预设的距离权重的乘积;平方根获取子单元,用于获取所述各个人脸区域中人脸面积与所述长度和宽度两者乘积的比值,以及所述人脸面积对应比值的平方根;乘积获取子单元,用于获取所述平方根与预设的面积权重的乘积,所述面积权重和所述距离权重之和为1;评分值获取子单元,用于计算所述预设的面积权重对应的乘积和所述预设的距离权重的乘积的和值,得到各个人脸区域的评分值。
在一实施例中,所述目标图像获取子模块包括:候选区域获取单元,用于确定所述目标人脸区域对应的候选正方形区域,得到所述候选正方形区域的顶点坐标数据;仿射系数获取单元,用于对所述候选正方形区域的顶点坐标数据和预设正方形的顶点坐标数据进行仿射变换,得到仿射变换系数;所述预设正方形的顶点坐标数据中包括一个指定原点;仿射图像获取单元,用于利用所述仿射变换系数对所述原始图像进行仿射变换,得到仿射变换图像;目标图像获取单元,用于以所述指定原点为基准从所述仿射变换图像截取预设边长的正方形区域,并将截取的正方形区域内的图像作为所述目标人脸图像。
在一实施例中,所述第一系数获取模块包括:图像处理子模块,用于分别对所述目标人脸图像进行模糊处理和锐化处理,得到至少一张模糊图像和至少一张锐化图像;特征图像获取子模块,用于分别提取所述目标人脸图像、各张模糊图像和各张锐化图像中的特征数据,得到原始特征图像、模糊特征图像和锐化特征图像;初始图像获取子模块,用于拼接所述原始特征图像、所述模糊特征图像和所述锐化特征图像,得到初始特征图像;目标图像获取子模块,用于获取所述初始特征图像中各特征图像对所述虚拟形象表情表达的重要性系数,并根据所述重要性系数调整所述初始特征图像得到目标特征图像;人脸系数获取子模块,用于根据所述目标特征图像确定模板表情系数和位姿系数,得到所述第一人脸系数。
在一实施例中,所述第一系数获取模块包括:第一系数获取子模块,用于将所述目标人脸图像输入到预设的人脸系数识别网络,获得所述预设的人脸系数识别网络输出的所述目标人脸图像对应的第一人脸系数。
在一实施例中,所述预设的人脸系数识别网络包括:模糊锐化模块、特征提取模块、注意力模块和系数学习模块;所述模糊锐化模块分别对所述目标人脸图像进行模糊处理和锐化处理,得到至少一张模糊图像和至少一张锐化图像;所述特征提取模块分别提取所述目标人脸图像、各张模糊图像和各张锐化图像中的特征数据,得到原始特征图像、模糊特征图像和锐化特征图像;并拼接所述原始特征图像、所述模糊特征图像和所述锐化特征图像,得到初始特征图像;所述注意力模块获取所述初始特征图像中各特征图像对所述虚拟形象表情表达的重要性系数,并根据所述重要性系数调整所述初始特征图像得到目标特征图像;所述系数学习模块根据所述目标特征图像确定模板表情系数和位姿系数,得到所述第一人脸系数。
在一实施例中,所述注意力模块采用时间注意力机制或者空间注意力机制的网
络模型实现。
在一实施例中,所述系数学习模块采用Resnet50、Resnet18、Resnet100、DenseNet和YoloV5中的至少一种网络模型实现。
在一实施例中,所述目标系数获取模块包括:权重系数获取子模块,用于获取所述原始图像之前的前一帧图像的第一人脸系数和预设的权重系数;所述前一帧图像的权重系数和所述原始图像的权重系数之和为1;目标系数获取子模块,用于对所述原始图像的第一人脸系数和所述前一帧图像的第一人脸系数进行加权求和值,得到所述原始图像对应的目标人脸系数。
在一实施例中,所述装置还包括:适配矩阵获取模块,用于获取预设的表情适配矩阵;所述表情适配矩阵是指包含不同模板数量的两个人脸系数对应的转换关系;目标系数获取模块,用于计算时域修正处理后的人脸系数和所述表情适配矩阵的乘积,得到所述目标人脸系数。
在一实施例中,所述预设的表情适配矩阵通过以下步骤获取:获取样本图像对应的第一预设系数,所述第一预设系数包括第一数量个模板的系数;获取样本图像对应的第二预设系数,所述第二预设系数包括第二数量个模板的系数;根据所述第一预设系数、所述第二预设系数和最小二乘法获取所述预设的表情适配矩阵。
在一实施例中,所述表情动画获取模块,还用于在所述原始图像中未检测到人脸区域时,继续检测下一帧原始图像,并根据上一帧原始图像的目标人脸系数获取虚拟表情;或者,所述表情动画获取模块,还用于在所述原始图像中未检测到人脸区域且持续时长超过设定时长阈值时,根据预设的表情系数获取虚拟表情。
需要说明的是,本实施例中示出的装置与方法实施例的内容相匹配,可以参考上述方法实施例的内容,在此不再赘述。
在示例性实施例中,还提供了一种电子设备,参见图14,包括:处理器141;用于存储所述处理器可执行的计算机程序的存储器142;其中,所述处理器被配置为执行所述存储器中的计算机程序,以实现如图1~图12所述的方法。
在示例性实施例中,还提供了一种非暂态计算机可读存储介质,例如包括可执行的计算机程序的存储器,上述可执行的计算机程序可由处理器执行,以实现如图1~图12所示实施例的方法。其中,可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
本领域技术人员在考虑说明书及实践这里公开的公开后,将容易想到本公开的其它实施方案。本公开旨在涵盖任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。
Claims (18)
- 一种虚拟表情生成方法,其特征在于,包括:获取原始图像中的人脸区域,得到目标人脸图像;获取所述目标人脸图像对应的第一人脸系数;所述第一人脸系数包括模板表情系数和位姿系数,所述模板表情系数用于表征人脸表情与各个模板的匹配程度,所述位姿系数表示虚拟形象在三个维度上的旋转角度;对所述第一人脸系数中的所述模板表情系数和/或所述位姿系数进行时域修正处理,得到目标人脸系数;所述目标人脸系数与所述原始图像前一帧原始图像的人脸系数相关联;根据所述目标人脸系数渲染所述虚拟形象的表情,得到虚拟表情。
- 根据权利要求1所述的方法,其特征在于,获取原始图像中的人脸区域,得到目标人脸图像,包括:对所述原始图像进行人脸检测,获得所述原始图像包含的至少一个人脸区域;从所述至少一个人脸区域内选取出目标人脸区域;对所述目标人脸区域进行校正处理得到目标人脸图像。
- 根据权利要求2所述的方法,其特征在于,从所述至少一个人脸区域内选取出目标人脸区域,包括:当所述人脸区域的数量为一个时,确定所述人脸区域为所述目标人脸区域;当所述人脸区域的数量为多个时,根据所述各个人脸区域的区域参数数据计算出各个人脸区域的评分值,所述评分值用于表示各个人脸区域距离所述原始图像中轴的程度;确定所述评分值的最大值对应的人脸区域为所述目标人脸区域。
- 根据权利要求3所述的方法,其特征在于,所述区域参数数据包括长度、宽度、人脸面积和位置数据,根据所述各个人脸区域的区域参数数据计算出各个人脸区域的评分值,包括:获取各个人脸区域中间位置的横坐标与宽度一半的差值,以及所述差值的绝对值;获取所述差值的绝对值与所述宽度的比值,以及所述比值与常数2的乘积;获取常数1与所述乘积的差值,并获取所述乘积对应的差值与预设的距离权重的乘积;获取所述各个人脸区域中人脸面积与所述长度和宽度两者乘积的比值,以及所述人脸面积对应比值的平方根;获取所述平方根与预设的面积权重的乘积,所述面积权重和所述距离权重之和为1;计算所述预设的面积权重对应的乘积和所述预设的距离权重的乘积的和值,得到各个人脸区域的评分值。
- 根据权利要求2所述的方法,其特征在于,对所述目标人脸区域进行校正处理得到目标人脸图像,包括:确定所述目标人脸区域对应的候选正方形区域,得到所述候选正方形区域的顶点坐标数据;对所述候选正方形区域的顶点坐标数据和预设正方形的顶点坐标数据进行仿射变换,得到仿射变换系数;所述预设正方形的顶点坐标数据中包括一个指定原点;利用所述仿射变换系数对所述原始图像进行仿射变换,得到仿射变换图像;以所述指定原点为基准从所述仿射变换图像截取预设边长的正方形区域,并将截取的正方形区域内的图像作为所述目标人脸图像。
- 根据权利要求1所述的方法,其特征在于,获取所述目标人脸图像对应的第一人脸系数,包括:分别对所述目标人脸图像进行模糊处理和锐化处理,得到至少一张模糊图像和至少一张锐化图像;分别提取所述目标人脸图像、各张模糊图像和各张锐化图像中的特征数据,得到原始特征图像、模糊特征图像和锐化特征图像;拼接所述原始特征图像、所述模糊特征图像和所述锐化特征图像,得到初始特征图像;获取所述初始特征图像中各特征图像对所述虚拟形象表情表达的重要性系数,并根据所述重要性系数调整所述初始特征图像得到目标特征图像;根据所述目标特征图像确定模板表情系数和位姿系数,得到所述第一人脸系数。
- 根据权利要求1所述的方法,其特征在于,获取所述目标人脸图像对应的第一人脸系数,包括:将所述目标人脸图像输入到预设的人脸系数识别网络,获得所述预设的人脸系数识别网络输出的所述目标人脸图像对应的第一人脸系数。
- 根据权利要求7所述的方法,其特征在于,所述预设的人脸系数识别网络包括:模糊锐化模块、特征提取模块、注意力模块和系数学习模块;所述模糊锐化模块分别对所述目标人脸图像进行模糊处理和锐化处理,得到至少一张模糊图像和至少一张锐化图像;所述特征提取模块分别提取所述目标人脸图像、各张模糊图像和各张锐化图像中的特征数据,得到原始特征图像、模糊特征图像和锐化特征图像;并拼接所述原始特征图像、所述模糊特征图像和所述锐化特征图像,得到初始特征图像;所述注意力模块获取所述初始特征图像中各特征图像对所述虚拟形象表情表达的重要性系数,并根据所述重要性系数调整所述初始特征图像得到目标特征图像;所述系数学习模块根据所述目标特征图像确定模板表情系数和位姿系数,得到所述第一人脸系数。
- 根据权利要求8所述的方法,其特征在于,所述注意力模块采用时间注意力机制或者空间注意力机制的网络模型实现。
- 根据权利要求8所述的方法,其特征在于,所述系数学习模块采用Resnet50、Resnet18、Resnet100、DenseNet和YoloV5中的至少一种网络模型实现。
- 根据权利要求8所述的方法,其特征在于,对所述第一人脸系数中的所述模板表情系数和/或所述位姿系数进行时域修正处理,得到目标人脸系数,包括:获取所述原始图像之前的前一帧图像的第一人脸系数和预设的权重系数;所述前一帧图像的权重系数和所述原始图像的权重系数之和为1;对所述原始图像的第一人脸系数和所述前一帧图像的第一人脸系数进行加权求和值,得到所述原始图像对应的目标人脸系数。
- 根据权利要求8所述的方法,其特征在于,对所述第一人脸系数中的所述模板表情系数和/或所述位姿系数进行时域修正处理之后,所述方法还包括:获取预设的表情适配矩阵;所述表情适配矩阵是指包含不同模板数量的两个人脸系数对应的转换关系;计算时域修正处理后的人脸系数和所述表情适配矩阵的乘积,得到所述目标人脸系数。
- 根据权利要求8所述的方法,其特征在于,所述预设的表情适配矩阵通过以下步骤获取:获取样本图像对应的第一预设系数,所述第一预设系数包括第一数量个模板的系数;获取样本图像对应的第二预设系数,所述第二预设系数包括第二数量个模板的系数;根据所述第一预设系数、所述第二预设系数和最小二乘法获取所述预设的表情适配矩阵。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:当在所述原始图像中未检测到人脸区域时,继续检测下一帧原始图像,并根据上一帧原始图像的目标人脸系数获取虚拟表情;或者,当在所述原始图像中未检测到人脸区域且持续时长超过设定时长阈值时,根据预设的表情系数获取虚拟表情。
- 一种虚拟表情生成装置,其特征在于,包括:目标图像获取模块,用于获取原始图像中的人脸区域,得到目标人脸图像;第一系数获取模块,用于获取所述目标人脸图像对应的第一人脸系数;所述第一人脸系数包括模板表情系数和位姿系数,所述模板表情系数用于表征人脸表情与各个模板的匹配程度,所述位姿系数表示虚拟形象在三个维度上的旋转角度;目标系数获取模块,用于对所述第一人脸系数中的所述模板表情系数和/或所述位姿系数进行时域修正处理,得到目标人脸系数;表情动画获取模块,用于根据所述目标人脸系数渲染所述虚拟形象的表情,得到虚拟表情。
- 一种电子设备,其特征在于,包括:处理器和用于存储可执行指令的存储器;所述处理器从所述存储器中读取可执行指令,以实现权利要求1~14任一项所述的方法的步骤。
- 一种芯片,其特征在于,包括:处理器和用于存储可执行程序的存储器;所述处理器从所述存储器中读取可执行程序,以实现权利要求1~14任一项所述的方法的步骤。
- 一种非暂态计算机可读存储介质,其上存储有计算机可执行程序,其特征在于,该可执行程序被执行时实现权利要求1~14任一项所述方法的步骤。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210878271.1 | 2022-07-25 | ||
CN202210878271.1A CN115272570A (zh) | 2022-07-25 | 2022-07-25 | 虚拟表情生成方法、装置、电子设备和存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024022065A1 true WO2024022065A1 (zh) | 2024-02-01 |
Family
ID=83768545
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/105870 WO2024022065A1 (zh) | 2022-07-25 | 2023-07-05 | 虚拟表情生成方法、装置、电子设备和存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115272570A (zh) |
WO (1) | WO2024022065A1 (zh) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115272570A (zh) * | 2022-07-25 | 2022-11-01 | 京东方科技集团股份有限公司 | 虚拟表情生成方法、装置、电子设备和存储介质 |
CN115908655B (zh) * | 2022-11-10 | 2023-07-14 | 北京鲜衣怒马文化传媒有限公司 | 一种虚拟人物面部表情处理方法及装置 |
CN115797556B (zh) * | 2022-11-22 | 2023-07-11 | 灵瞳智能科技(北京)有限公司 | 一种虚拟数字人面部轮廓3d重建装置 |
CN116258800A (zh) * | 2022-11-25 | 2023-06-13 | 北京字跳网络技术有限公司 | 一种表情驱动方法、装置、设备及介质 |
CN115997239A (zh) * | 2022-11-25 | 2023-04-21 | 广州酷狗计算机科技有限公司 | 人脸图像生成方法、装置、设备以及存储介质 |
CN115953813B (zh) * | 2022-12-19 | 2024-01-30 | 北京字跳网络技术有限公司 | 一种表情驱动方法、装置、设备及存储介质 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111161395A (zh) * | 2019-11-19 | 2020-05-15 | 深圳市三维人工智能科技有限公司 | 一种人脸表情的跟踪方法、装置及电子设备 |
US20210174567A1 (en) * | 2011-12-12 | 2021-06-10 | Apple Inc. | Method for Facial Animation |
CN113239738A (zh) * | 2021-04-19 | 2021-08-10 | 深圳市安思疆科技有限公司 | 一种图像的模糊检测方法及模糊检测装置 |
CN113537056A (zh) * | 2021-07-15 | 2021-10-22 | 广州虎牙科技有限公司 | 虚拟形象驱动方法、装置、设备和介质 |
CN114422832A (zh) * | 2022-01-17 | 2022-04-29 | 上海哔哩哔哩科技有限公司 | 主播虚拟形象生成方法及装置 |
CN115272570A (zh) * | 2022-07-25 | 2022-11-01 | 京东方科技集团股份有限公司 | 虚拟表情生成方法、装置、电子设备和存储介质 |
-
2022
- 2022-07-25 CN CN202210878271.1A patent/CN115272570A/zh active Pending
-
2023
- 2023-07-05 WO PCT/CN2023/105870 patent/WO2024022065A1/zh unknown
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210174567A1 (en) * | 2011-12-12 | 2021-06-10 | Apple Inc. | Method for Facial Animation |
CN111161395A (zh) * | 2019-11-19 | 2020-05-15 | 深圳市三维人工智能科技有限公司 | 一种人脸表情的跟踪方法、装置及电子设备 |
CN113239738A (zh) * | 2021-04-19 | 2021-08-10 | 深圳市安思疆科技有限公司 | 一种图像的模糊检测方法及模糊检测装置 |
CN113537056A (zh) * | 2021-07-15 | 2021-10-22 | 广州虎牙科技有限公司 | 虚拟形象驱动方法、装置、设备和介质 |
CN114422832A (zh) * | 2022-01-17 | 2022-04-29 | 上海哔哩哔哩科技有限公司 | 主播虚拟形象生成方法及装置 |
CN115272570A (zh) * | 2022-07-25 | 2022-11-01 | 京东方科技集团股份有限公司 | 虚拟表情生成方法、装置、电子设备和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN115272570A (zh) | 2022-11-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2024022065A1 (zh) | 虚拟表情生成方法、装置、电子设备和存储介质 | |
US12041389B2 (en) | 3D video conferencing | |
CN106910247B (zh) | 用于生成三维头像模型的方法和装置 | |
US11290682B1 (en) | Background modification in video conferencing | |
Kuster et al. | Gaze correction for home video conferencing | |
US9232189B2 (en) | Background modification in video conferencing | |
US11765332B2 (en) | Virtual 3D communications with participant viewpoint adjustment | |
WO2023109753A1 (zh) | 虚拟角色的动画生成方法及装置、存储介质、终端 | |
JP2023548921A (ja) | 画像の視線補正方法、装置、電子機器、コンピュータ可読記憶媒体及びコンピュータプログラム | |
US20240212252A1 (en) | Method and apparatus for training video generation model, storage medium, and computer device | |
US20240296531A1 (en) | System and methods for depth-aware video processing and depth perception enhancement | |
CN117274501A (zh) | 一种可驱动数字人建模方法、装置、设备及介质 | |
CN114051148A (zh) | 一种虚拟主播生成方法、装置及电子设备 | |
US10152818B2 (en) | Techniques for stereo three dimensional image mapping | |
WO2023088276A1 (zh) | 漫画化模型构建方法、装置、设备、存储介质及程序产品 | |
CN115914834A (zh) | 视频处理方法及装置 | |
Shen et al. | Virtual mirror by fusing multiple RGB-D cameras | |
CN115883792B (zh) | 一种利用5g和8k技术的跨空间实景用户体验系统 | |
Song et al. | TextToon: Real-Time Text Toonify Head Avatar from Single Video | |
CN117011122A (zh) | 图像处理方法及装置、设备、存储介质、程序产品 | |
CN117830085A (zh) | 视频转换方法及装置 | |
CN118799439A (zh) | 数字人图像融合方法、装置、设备及可读存储介质 | |
CN115359159A (zh) | 虚拟视频通信方法、装置、设备、存储介质和程序产品 | |
CN116828165A (zh) | 图像处理方法及装置、存储介质、电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23845278 Country of ref document: EP Kind code of ref document: A1 |