CN116012218A

CN116012218A - Virtual anchor expression control method, device, equipment and medium

Info

Publication number: CN116012218A
Application number: CN202211352075.7A
Authority: CN
Inventors: 王东鸿
Original assignee: Guangzhou Cubesili Information Technology Co Ltd
Current assignee: Guangzhou Cubesili Information Technology Co Ltd
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-04-25

Abstract

The application relates to the technical field of live broadcasting, and provides a virtual anchor expression control method, device, equipment and medium. The method comprises the following steps: constructing a virtual anchor model, and acquiring the position of a first key point on the face of the virtual anchor model; acquiring a face image set containing a anchor face in a video stream, extracting a gray level image of each face image, and filtering the gray level image by using a Gaussian filter to obtain a smooth image; calculating edge points of each gray level image according to the smooth images, and marking second key points of each face image according to the edge points to obtain average positions of the second key points; obtaining a predicted position of a second key point on the face image containing the anchor face according to the average positions of the rear face image and the second key point; obtaining a predicted position of the first key point according to the predicted position of the second key point; and controlling the virtual main broadcasting expression according to the predicted position of the first key point. The facial expression of the virtual anchor can be consistent with the facial expression of the real anchor.

Description

Virtual anchor expression control method, device, equipment and medium

Technical Field

The present invention relates to the field of network live broadcasting technologies, and in particular, to a method and apparatus for controlling a virtual anchor expression, an electronic device, and a computer readable storage medium.

Background

With the rise of the Internet and the advent of the 5G age, network live broadcast has been rapidly developed, and virtual anchor has also sat on the developed expressway. The virtual anchor refers to an anchor that uses an avatar for live broadcasting. The existing avatar is typically a cartoon character whose actions can be controlled by the real anchor. At present, the actions, the expressions and the like of cartoon characters are preset, the expressions are single and stiff, and with the continuous progress of network technology, the requirements of net friends on the interactivity of virtual anchor are higher and higher.

The current virtual anchor cannot display corresponding expressions according to the needs of audiences, so that the interaction effect of the virtual anchor and the audiences is poor, and the interactivity of the virtual anchor is weak. In order to achieve a better live effect and user experience, it is necessary to solve the problem that the expression of the virtual anchor is difficult to be the same as or almost the same as that of the real anchor.

Disclosure of Invention

Based on this, it is necessary to provide a virtual anchor expression control method, apparatus, electronic device, and computer-readable storage medium in view of the above-mentioned technical problems.

In a first aspect, the present application provides a method for controlling a virtual anchor expression, including: constructing a virtual anchor model, and acquiring the position of a first key point on the face of the virtual anchor model;

acquiring a face image set containing a anchor face in a video stream, extracting a gray level image of each face image, and filtering the gray level image by using a Gaussian filter to obtain a smooth image;

calculating edge points of each gray level image according to the smooth images, marking second key points of each face image according to the edge points, and obtaining average positions of the second key points according to marked positions of the second key points in each face image;

transforming each face image containing the anchor face to obtain a transformed face image respectively, and obtaining a predicted position of a second key point on the face image containing the anchor face according to the average positions of the transformed face image and the second key point;

obtaining a predicted position of the first key point according to the predicted position of the second key point; and controlling the virtual main broadcasting expression according to the predicted position of the first key point.

In one embodiment, the calculating the edge point of each gray scale image according to the smoothed image includes: calculating the magnitude M (i) J) and gradient direction θ (i, j); acquiring edge points of the gray scale image according to the gradient amplitude M (i, j) and the gradient direction theta (i, j); wherein,

θ(i,j)＝arctan×[Q(i，j)/P(i，j)](i, j) represents the pixel coordinates of the smoothed image; p (i, j), Q (i, j) are the results of the smooth image being filtered by filters P and Q, respectively, which are denoted as filters, respectively, and

/>

in one embodiment, the acquiring the edge point of the gray scale according to the gradient magnitude M (i, j) and the gradient direction θ (i, j) includes: selecting pixel points; constructing a K neighborhood according to the gradient direction of the pixel by taking the selected pixel as a center point, and calculating the gradient amplitude of the selected pixel and the gradient amplitude of the neighborhood pixel corresponding to the selected pixel; judging whether the gradient amplitude of the selected pixel point is larger than the gradient amplitude of the pixel point adjacent to the selected pixel point in the neighborhood according to the calculated result; if yes, the selected pixel point is regarded as an edge point; wherein K is a positive integer.

In one embodiment, the obtaining the predicted position of the first key point according to the predicted position of the second key point includes: and respectively defining each first key point and each second key point, establishing a corresponding relation between nodes with the same definition, and predicting the positions of the first key points with the same definition according to the predicted positions of the second key points and the corresponding relation to obtain the predicted positions of the first key points.

In one embodiment, the constructing the virtual anchor model, obtaining the position of the first key point on the face of the virtual anchor model includes: the tree-shaped model structure comprises first key points of a virtual anchor face, wherein the first key points are respectively combined to form a plurality of model nodes, and the model nodes comprise virtual face nodes, virtual eye nodes, virtual mouth nodes, virtual nose nodes and virtual eyebrow nodes; setting skin grids for the virtual face nodes, the virtual eye nodes, the virtual mouth nodes, the virtual nose nodes and the virtual eyebrow nodes to form a virtual anchor model comprising a virtual face, a virtual eye, a virtual mouth, a virtual nose and a virtual eyebrow; and acquiring each first key point in the tree model structure.

In one embodiment, the transforming each face image including the anchor face to obtain a transformed face image, and obtaining the predicted position of the second key point on the face image including the anchor face according to the average positions of the transformed face image and the second key point includes: inputting the data set of the face image containing the anchor face into a second key point detection model to be trained, and transforming the face image containing the anchor face by the second key point detection model to be trained by using a first space transformation network to obtain a transformed face image; obtaining a prediction fitting coefficient corresponding to each principal component according to the transformed face image by using a coefficient prediction network; obtaining the predicted position of the second key point of each face on the face image after transformation according to the predicted fitting coefficient, the principal components and the average position; and obtaining the predicted positions of the second key points of the faces on the face image containing the anchor face according to the predicted positions of the second key points of the faces on the face image after the transformation through a second space transformation network.

In one embodiment, the obtaining the predicted position of the second key point on the transformed face image according to the predicted fitting coefficient and the principal components and the average position includes: obtaining the predicted position change of the second key point on the transformed face image according to the predicted fitting coefficient corresponding to each principal component of the transformed face image and each principal component; and obtaining the predicted position of the second key point on the face image after transformation according to the predicted position change and the average position.

In a second aspect, the present application provides a virtual anchor expression control apparatus, the apparatus including:

the model construction module is used for constructing a virtual anchor model and acquiring the position of a first key point on the face of the virtual anchor model;

the acquisition module is used for acquiring a face image set containing a main broadcasting face in the video stream, extracting gray level images of the face images, and filtering the gray level images by using a Gaussian filter to obtain a smooth image;

the labeling module is used for calculating edge points of each gray level image according to the smooth images, labeling second key points of each face image according to the edge points, and obtaining average positions of the second key points according to labeling positions of the second key points in each face image;

The transformation module is used for transforming each face image containing the anchor face to obtain transformed face images respectively, and obtaining the predicted position of the second key point on the face image containing the anchor face according to the average positions of the transformed face images and the second key point;

the control module is used for obtaining the predicted position of the first key point according to the predicted position of the second key point; and controlling the virtual main broadcasting expression according to the predicted position of the first key point.

In a third aspect, the present application provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor, when executing the computer program, causes the electronic device to execute the steps of the above-mentioned virtual anchor expression control method.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by the processor, causes the processor to perform the steps of the above-described virtual anchor expression control method.

The method, the device, the electronic equipment and the computer readable storage medium for controlling the expression of the virtual anchor obtain the position of the first key point on the face of the virtual anchor model by constructing the virtual anchor model; acquiring a face image set containing a anchor face in a video stream, extracting a gray level image of each face image, and filtering the gray level image by using a Gaussian filter to obtain a smooth image; calculating edge points of each gray level image according to the smooth images, marking second key points of each face image according to the edge points, and obtaining average positions of the second key points according to marked positions of the second key points in each face image; transforming each face image containing the anchor face to obtain a transformed face image respectively, and obtaining a predicted position of a second key point on the face image containing the anchor face according to the average positions of the transformed face image and the second key point; obtaining a predicted position of the first key point according to the predicted position of the second key point; and controlling the virtual main broadcasting expression according to the predicted position of the first key point. According to the scheme, the positions of key points of the face of the virtual anchor are obtained by constructing the virtual anchor, then the predicted positions of the first key points are obtained by transforming the face image containing the face of the anchor and the predicted positions of the second key points, the virtual anchor expression is controlled by combining the predicted positions of the first key points, the predicted positions of the first key points are related to the positions of the second key points and the predicted positions of the second key points, and the facial expression of the virtual anchor can be consistent with the facial expression of the real anchor through the control of the facial expression of the first key points, so that a better live broadcast effect and user experience are achieved. In addition, the method further satisfies that the virtual anchor can display corresponding expressions according to the needs of audiences, improves interaction effects of the virtual anchor and the audiences, and satisfies the needs of interaction between the virtual anchor and the audiences.

Drawings

FIG. 1 is an application scenario diagram of a related method in an embodiment of the present application;

fig. 2 is a schematic flow chart of virtual anchor expression control in the embodiment of the present application;

FIG. 3 is a schematic diagram of a first key point position of a virtual anchor face according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a second keypoint location of a anchor face in an embodiment of the present application;

FIG. 5 is a schematic diagram of average positions of second key points in an embodiment of the present application;

FIG. 6 is a tree model diagram including a first keypoint in an embodiment of the present application;

FIG. 7 is a second keypoint tree model including a second keypoint in an embodiment of the present application;

FIG. 8 is a schematic diagram of the relationship between the principal component and the second key point in the embodiment of the present application;

fig. 9 is a schematic flow chart of processing a face image by the second keypoint detection model in the embodiment of the present application;

fig. 10 is a block diagram of a virtual anchor expression control device according to an embodiment of the present application;

FIG. 11 is an internal block diagram of an electronic device in an embodiment of the present application;

fig. 12 is an internal structural diagram of an electronic device in another embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The virtual anchor expression control method provided by the embodiment of the present application may be applied to an application scenario shown in fig. 1, where the application scenario may include a terminal 110 and a server 120, and the terminal 110 communicates with the server 120 through a network. The terminal 110 may be, but not limited to, various personal computers, notebook computers, smart phones, and tablet computers, and the server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

The following sections sequentially describe the virtual anchor expression control method of the present application with reference to various embodiments and related drawings on the basis of the application scenario shown in fig. 1.

In one embodiment, as shown in fig. 2, the present application provides a virtual anchor expression control method, which may be performed by the server 120, and may include the steps of:

step S201, a virtual anchor model is built, and the position of a first key point on the face of the virtual anchor model is obtained.

Specifically, referring to fig. 3 and 5, the virtual anchor model generally consists of a key node tree model i and a skin. The tree model i including the first key point is generally a tree model structure formed according to the connection relationship between parts or all parts of the human body. The tree model structure may be built in the server 120, in the live field, the anchor typically only has the upper body present in the camera capture space, or typically only needs to display the upper body in the video box. Thus, the first key point of the virtual anchor model in this embodiment may include virtual character parts such as virtual face nodes, virtual neck nodes, virtual shoulder nodes, virtual arm nodes, and the like. Preferably, the present example obtains the first key point on the virtual anchor model face, namely: facial nodes. The skin is a skin mesh which can be bound to a specific first key point or a plurality of first key points, and the skin is driven to move through the movement between the first key points. The skin is actually a technical method for completing the corresponding relation between the skin grid and the first key points, reproducing complex motion information through simple first key point motion and simultaneously driving the skin to synchronously move, so that a virtual host can simulate the actions or expressions of people. The skin includes vertex coordinate information, texture information, and the like of the skin mesh.

In one embodiment, the constructing the virtual anchor model, obtaining the position of the first key point on the face of the virtual anchor model includes:

a tree-shaped model structure i formed according to the connection relation among the nodes of the human face, wherein the tree-shaped model structure i comprises first key points (1 '-200') of a virtual anchor face, the first key points are respectively combined to form a plurality of model nodes, and the model nodes comprise a virtual face node i1, a virtual eyebrow node i2, a virtual nose node i3, a virtual eye node i4 and a virtual mouth node i5; setting skin grids for the virtual face node i1, the virtual eyebrow node i2, the virtual nose node i3, the virtual eye node i4 and the virtual mouth node i5 to form a virtual anchor model comprising a virtual face, a virtual eye, a virtual mouth, a virtual nose and a virtual eyebrow; and acquiring each first key point in the tree model structure.

Further, in some embodiments, the constructing the virtual anchor model in step S201, obtaining the location of the first key point on the face of the virtual anchor model may include: the method comprises the steps that a tree-shaped model structure is formed according to the connection relation among key points of human faces, the tree-shaped model structure comprises first key points of virtual anchor faces, the first key points are respectively combined to form a plurality of model nodes, and the model nodes comprise virtual face nodes, virtual eye nodes, virtual mouth nodes, virtual nose nodes and virtual eyebrow nodes; setting skin grids for the virtual face nodes, the virtual eye nodes, the virtual mouth nodes, the virtual nose nodes and the virtual eyebrow nodes to form a virtual anchor model comprising a virtual face, a virtual eye, a virtual mouth, a virtual nose and a virtual eyebrow; and acquiring each first key point in the tree model structure. Preferably, in this embodiment, the tree model may include three layers of nodes, where a first layer of nodes may be a root node i, including a head node, a second layer of nodes includes a virtual face node, a virtual eye node, a virtual mouth node, a virtual nose node, and a virtual eyebrow node, and a third layer of nodes is a leaf node, that is, a node set including a first key point.

Step S202, a face image set containing a main broadcasting face in a video stream is obtained, gray level images of the face images are extracted, and a Gaussian filter is utilized to filter the gray level images to obtain a smooth image.

Further, in some embodiments, in step S202, the acquiring a data set of a face image including a face of a presenter in a video stream includes: acquiring a face image set containing a anchor face in a video stream, and extracting a gray level image of each face image; calculating edge points of each gray level map; and acquiring a data set of the face image containing the anchor face according to the edge points.

Further, in some embodiments, in step S202, the calculating edge points of each gray scale image includes: filtering the gray level image by using a Gaussian filter to obtain a smooth image; computing the gradient magnitude of each pixel of the smooth image to be largeSmall M (i, j) and gradient direction θ (i, j); acquiring edge points of the gray level image according to the magnitude and the gradient direction of the gradient; the gradient amplitude calculation formula is as follows:

the pixel direction calculation formula is: θ (i, j) =arctan× [ Q (i, j)/P (i, j)]

(i, j) represents pixel coordinates of the smoothed image; p and Q are respectively denoted as filters, and

p (i, j), Q (i, j) are the results of smoothing the image after filtering by filters P and Q, respectively.

In some embodiments: the filtering the gray scale image by using the Gaussian filter to obtain a smooth image comprises the following steps: construction of Gaussian filter

And filtering the gray scale image by using the Gaussian filter to remove noise in the gray scale image. And obtaining a smooth image I (I, j) for the filtered gray level image. Or convolving the gray scale map with the gaussian filter to obtain a smoothed image I (I, j). Where σ is the standard deviation of the gaussian function for controlling the degree of smoothness.

And step S203, calculating edge points of each gray level image according to the smooth images, marking second key points of each face image according to the edge points, and obtaining average positions of the second key points according to marked positions of the second key points in each face image.

In one embodiment, in step S203, the acquiring the edge point of the gray scale map according to the gradient magnitude and the gradient direction includes: selecting a pixel point, constructing a K neighborhood by taking the pixel point as a central point according to the gradient direction of the pixel point, and calculating the gradient amplitude of the selected pixel point and the gradient amplitude of a pixel point adjacent to the selected pixel point; judging whether the gradient amplitude of the selected pixel point is larger than the gradient amplitude of the pixel point adjacent to the selected pixel point according to the calculated result; if yes, the selected pixel point is considered as an edge point, wherein K is a positive integer. Specifically, K may be equal to 3, referring to table one, taking the selected pixel point (i, j) as the center, constructing a 3*3 neighborhood (referring to 3*3 neighborhood table), calculating the gradient magnitudes M (i, j) of all the pixel points in the neighborhood, determining whether the gradient magnitudes of the neighborhood center point (i.e., the selected pixel point (i, j)) are greater than the gradient magnitudes of the adjacent points along the gradient direction θ (i, j) =arctan× [ Q (i, j)/P (i, j) ], if so, identifying the current neighborhood center point as an edge point, and assigning the corresponding M (i, j) as 1, otherwise, assigning the corresponding M (i, j) as 0.

Form one, 3*3 neighborhood table

[i-1,j-1]	[i-1,j]	[i-1,j+1]
			[i,j-1]	[i,j]	[i,j+1]
[i+1,j-1]	[i+1,j]	[i+1,j+1]

After the edge points of all the image sets are calculated, the edge points are combined to form the data set of the face image containing the anchor face. The data set of the face image containing the anchor face can be composed of edge points of the images, so that the marking position of the second key point in each face on each face image in the data set can be more accurate and has higher efficiency, and the average position of the second key point can be obtained according to the marking position of the second key point. In addition, the data set of the face image containing the anchor face can be processed more quickly and accurately in other steps, and the processing effect is improved. As in step S203, the principal component analysis of the data set of the face image containing the anchor face may be further performed quickly, and the set of principal components of the face image in the data set of the face image containing the anchor face may be determined quickly.

In one embodiment, a second key point marking is carried out on each face image according to the edge points, and the average position of the second key points is obtained according to the marking positions of the second key points in each face image; comprising the following steps: the face image dataset containing the anchor face may include a plurality of face images, and may further include a location marked on the plurality of face images for a second key point, and the location marked as a marked location. The noted locations are included with the edge points, e.g., the second keypoints may coincide with certain edge points. The plurality of face images are images containing the anchor face, the labeling positions of the second key points on the face images can be obtained by manual labeling, the positions of the second key points on the plurality of face images can be labeled manually to obtain labeling positions, and a face image dataset can be obtained after labeling is completed. The labeling position can be represented by position coordinates on the face image. In an actual scenario, the relevant person may complete the labeling of the position of the second key point on the terminal 110 to obtain a face image dataset, and then the terminal 110 transmits the face image dataset to the server 120 through the network, and the server 120 obtains the face image dataset. The face image data set obtained by the server 120 has the same number of second key points on each face image, the total number of face images contained in the face image data set is N, each face image has L second key points, and L is the total number of second key points corresponding to the face. In this step, the server 120 may calculate, for each of the L second key points, a respective average position of the L second key points according to the labeling positions of the second key points in each of the face images in the face image dataset.

Specifically, the effect of marking the set M on the blank image in the form of dots is shown in fig. 3, that is, the set M also includes the position information of the L second key points, where the average position of the i-th second key point in the set M can be expressed as:

wherein ,

and (5) representing the labeling position of the ith second key point on the jth face image.

Step S204, transforming each face image containing the anchor face to obtain transformed face images respectively, and obtaining the predicted position of the second key point on the face image containing the anchor face according to the average positions of the transformed face images and the second key point;

further, in some embodiments, step S204, the transforming each of the face images including the anchor face to obtain a transformed face image respectively includes: inputting the data set of the face image containing the anchor face into a second key point detection model to be trained, and transforming the face image containing the anchor face by using a first space transformation network by the second key point detection model to be trained to respectively obtain transformed face images; and obtaining a prediction fitting coefficient corresponding to each principal component according to the transformed face image by using a coefficient prediction network.

Specifically, this step is a process in which the server 120 inputs a face image including a anchor face into the second key point detection model to be trained, and performs a correlation process according to the input face image by the model. Specifically, referring to fig. 9, as shown in fig. 9, the server 120 inputs a face image including a face of a presenter into a second key point detection model to be trained, where the second key point detection model may include two spatial transformation networks (STN, spatial Transformer Network) respectively denoted as a first spatial transformation network and a second spatial transformation network, where the first spatial transformation network may be used to perform a transformation process related to the alignment of the face image, and the second spatial transformation network may be used to perform an inverse transformation process corresponding to the alignment of the predicted face second key point position, so that the predicted face second key point position may be mapped to the first key point to obtain a predicted position of the first key point; the second key point detection model also comprises a coefficient prediction network, which can be realized based on a ResNet-18 model structure, and can be used for predicting the prediction fitting coefficient of the face image after transformation corresponding to each principal component by extracting the image characteristics and finally using a full-connection layer. Based on this, after inputting the face image into the second key point detection model to be trained, the server 120 obtains a transformed face image from the face image by the first spatial transformation network, then transmits the transformed face image to the coefficient prediction network, obtains the prediction fitting coefficients corresponding to the principal components thereof from the transformed face image by the coefficient prediction network,

In some embodiments, the obtaining the predicted position of the second key point on the face image including the anchor face according to the average positions of the transformed face image and the second key point includes: obtaining the predicted position of the second key point of each face on the face image after transformation according to the predicted fitting coefficient, the principal components and the average position; and obtaining the predicted positions of the second key points of the faces on the face image containing the anchor face according to the predicted positions of the second key points of the faces on the face image after the transformation through a second space transformation network.

Specifically, in this step, the obtaining, by the server 120, the predicted position of the second key point of each face on the transformed face image according to the predicted fit coefficient and the average positions of the principal components and the second key point may be that the predicted position of the second key point on the transformed face image is obtained by using a second key point detection model according to the predicted fit coefficient and the average positions of the principal components and the second key point, the predicted position of the second key point on the transformed face image is input to a second spatial transformation network, and the predicted position of the second key point on the face image is obtained by using the second spatial transformation network according to the predicted position of the second key point on the transformed face image. The conversion processing of the alignment correlation may further be to obtain a central position of the display screen, determine whether a distance between an average position and the central position of each second key point is at a predetermined position, and if not, adjust the position of the second key point so that the distance between the average position and the central position of each second key point is at the predetermined distance, thereby achieving the alignment purpose. After the face image containing the anchor face is aligned, the second key point can be accurately corresponding to the first key point, so that the effect of synchronizing the anchor expression by the virtual anchor expression is improved.

In one embodiment, in step S204, the process of transforming each of the face images including the anchor face by the server 120 to obtain transformed face images may include: and determining a set of principal components of the face image in the data set of the face image containing the anchor face based on principal component analysis and labeling positions, wherein different principal components in the set of principal components respectively correspond to different morphological change dimensions of the face.

In this step, the server 120 may further obtain a fitting coefficient corresponding to each face image in each principal component based on principal component analysis and the labeling position.

In this step, the server 120 determines a principal component set of the face images in the face image dataset based on principal component analysis and a labeling position of the second key point on each face image in the face image dataset, and obtains a fitting coefficient corresponding to each principal component of each face image. Wherein, different principal components in the principal component set respectively correspond to different shape change dimensions of the face. Specifically, the principal component analysis (PCA, principal Component Analysis) may be performed to obtain a principal component set based on coordinates of a labeling position of the second key point on each face image in the face image dataset. The main component set comprises a plurality of main components, and different main components respectively correspond to different shape change dimensions of the face. If the total number of the second key points is L, the position of each second key point may be represented as two coordinate values (x, y), and then each principal component obtained by performing principal component analysis is a 2*L-dimensional vector, and the total number of principal components in the extracted principal component set is at most 2*L, and each principal component corresponds to different shape change dimensions of the face, or is referred to as controlling different shape change properties of the face. In this regard, referring to fig. 4, when the first five principal components in the principal component set are given different coefficients (+0.5, -0.5), the position of the second key point changes (the whole is a shape change of the face), where the first principal component may correspond to a shape change dimension of the face that rotates left and right, the second principal component may correspond to a shape change dimension of the face that is low in head, the third principal component may correspond to a shape change dimension of the face that is fat and thin, and so on, so it is seen that each principal component may be used to control different shape change properties of the face. Then, in this step, the server 120 may further fit the obtained principal component set based on the labeling position of the second key point on each face image in the face image data set, so as to obtain a fitting coefficient corresponding to each principal component in each face image in the face image data set. In the specific implementation, the principal component set can be fitted based on the labeling position of the second key point on the face image through a standard PCA process to obtain fitting coefficients respectively corresponding to all principal components of the face image in the principal component set. The total number of face images in the face image dataset is N, the number of principal components in the obtained principal component set is P, then a PCA coefficient matrix can be obtained through fitting, the matrix dimension of the PCA coefficient matrix is NxP (N rows and P columns), and matrix elements in the PCA coefficient matrix are fitting coefficients corresponding to the corresponding principal components.

Further, in some embodiments, the determining the principal component set of the face image in the face image dataset may include:

performing similar transformation on the labeling positions of the second key points on each face image based on the average positions of the second key points to obtain transformation positions of the second key points on each face image; and carrying out principal component analysis according to the transformation positions of the second key points on each face image to obtain a principal component set.

In this embodiment, the server 120 may perform a similarity transformation on the labeling position of the second key point on each face image in the face image dataset based on the average position of the second key point, so as to obtain the labeling position of the second key point on each face image after the similarity transformation, record the labeling position as a transformation position, and then perform principal component analysis according to the coordinates of the transformation position of the second key point on each face image, so as to obtain a principal component set, so as to improve the accuracy and reliability of the trained model for detecting the second key point. For the processing of the similarity transformation, in a specific implementation, for all face images in the face image dataset, the server 120 may calculate a similarity transformation matrix according to the position of the second key point that is manually marked, that is, the set M of the marked position of the second key point and the average position of the second key point, as shown in fig. 5, and according to the similarity transformation matrix, the server 120 may map each face image in the face image dataset with virtual anchor face image data one by one, so that the virtual anchor image and the face image act and position are consistent.

Based on this, in one embodiment, the principal component analysis according to the transformation position of the second key point on each face image in the foregoing embodiment obtains a principal component set, and further includes:

normalizing the transformation positions of the second key points on each face image relative to the center of the face image to obtain normalized transformation positions of the second key points on each face image; and carrying out principal component analysis according to the normalized transformation positions of the second key points on each face image to obtain a principal component set.

In this embodiment, the server 120 may perform the similarity transformation on the labeling position of the second key point on each face image according to the similarity transformation manner, obtain the transformation position of the second key point on each face image, normalize the transformation position of the second key point on each face image on this basis, so as to obtain the normalized transformation position of the second key point on each face image, and then the server 120 performs principal component analysis according to the normalized transformation position of the second key point on each face image, so as to obtain a principal component set, so as to further improve the accuracy and reliability of the trained model for detecting the second key point. For normalization processing, in a specific implementation, the server 120 may perform normalization processing on the transformed position of the second key point on each face image with respect to the center of the face image, so that the position coordinate range is-1 to 1, and obtain a normalized transformed position of the second key point on each face image. As an example, assuming that the width and height of the face image are 100 and 200, respectively, and assuming that the coordinates of the transformation position of the second key point thereon are (50, 200), the normalized transformation position obtained by normalizing the face image with respect to the center of the face image is (0, 1), when the actual anchor face position changes, the virtual anchor can still follow the anchor expression change under normal conditions.

Based on the above embodiment, in one embodiment, the acquiring the fitting coefficient corresponding to each face image in each principal component in step S204 may further include: fitting the principal component set by using the normalized transformation positions of the second key points on each face image to obtain fitting coefficients corresponding to each principal component of each face image. In this embodiment, specifically, the server 120 may use the coordinates of the normalized transformation position of the second key point on each face image obtained after the normalization processing to fit the principal component set obtained in the previous step, so as to obtain the fitting coefficient corresponding to each principal component of each face image in the face image data set, so as to improve the accuracy and reliability of the trained model in detecting the second key point, thereby accurately obtaining the expression data of the host, and meeting the requirement that the expression of the virtual host follows the expression of the real host to be consistent.

Step S204, transforming each face image containing the anchor face to obtain transformed face images respectively, and obtaining the predicted position of the second key point on the face image containing the anchor face according to the average positions of the transformed face images and the second key point.

Preferably, in an embodiment, the obtaining the predicted position of the second key point of each face on the transformed face image may further be the obtaining the predicted position of the second key point on the transformed face image according to the predicted fitting coefficient, each principal component and the average position, which specifically may include:

obtaining the predicted position change of the second key point on the transformed face image according to the predicted fitting coefficient corresponding to each principal component of the transformed face image and each principal component; and obtaining the predicted position of the second key point on the face image after transformation according to the predicted position change and the average position.

In this embodiment, after the face image is input into the second key point detection model to be trained, the face image is sequentially processed through the first spatial transformation network and the coefficient prediction network, and the coefficient prediction network outputs the prediction fitting coefficients corresponding to the principal components of the transformed face image. Then, the server 120 may obtain the predicted position change of the second key point on the transformed face image according to the predicted fitting coefficient corresponding to each principal component of the transformed face image and each principal component. The predicted position change may be position change information or position offset information of the second key point on the transformed face image predicted by the model relative to the average position of the second key point, so that the server 120 may obtain the predicted position of the second key point on the transformed face image according to the predicted position change and the average position of the second key point, so as to achieve the purpose of efficiently and accurately predicting the position of the second key point on the transformed face image. Specifically, with reference to fig. 8, after obtaining the prediction fitting coefficient of the transformed face image corresponding to each principal component output by the coefficient prediction network, the model may multiply and sum the prediction fitting coefficient of the transformed face image corresponding to each principal component with each principal component to obtain a predicted position change of the second key point on the transformed face image, and then add the predicted position change of the second key point on the transformed face image to the average position of the second key point to obtain the predicted position of the second key point on the transformed face image.

Step S205, obtaining the predicted position of the first key point according to the predicted position of the second key point; and controlling the virtual main broadcasting expression according to the predicted position of the first key point.

Preferably, in one embodiment, the obtaining the predicted position of the first keypoint according to the predicted position of the second keypoint in step S205 includes: defining each first key point and each second key point respectively, establishing a corresponding relation between nodes with the same definition, and predicting the positions of the first key points with the same definition according to the predicted positions of the second key points and the corresponding relation to obtain the predicted positions of the first key points. The second key point may be defined by referring to the first key point, so that the first key point and the second key point correspond to each other. Each first keypoint may be defined when building the model. Referring to fig. 6, in mapping the first keypoint with the second keypoint, a second keypoint digital model may be built for the second keypoint. The second key point model includes a face node I1, an eye node I2, a mouth node I3, a nose node I4, and an eyebrow node I5. For example, as shown in fig. 3 to 7, the first key point defined as the face node i1 may correspond to leaf nodes of sequence numbers 1'-100' of the avatar tree model. Referring to fig. 6 and 7, the second keypoint defined as the face node I1 may correspond to leaf nodes numbered 1-33 of the second keypoint tree model I. So that nodes with the same semantics or the same definition can be mapped. The face nodes I1, the eyebrow nodes I2, the nose nodes I3, the eye nodes I4 and the mouth nodes I5 are mapped in one-to-one correspondence with the virtual face nodes I1, the virtual eyebrow nodes I2, the virtual nose nodes I3, the virtual eye nodes I4 and the virtual mouth nodes I5.

Preferably, in one embodiment, the number of first keypoints is greater than or equal to the number of second keypoints as described in connection with step S205. It may be realized that the second key points can each find one or more corresponding to the first key points. For example, a first key point with a sequence number 1'-2' corresponds to a second key point with a sequence number 1, defined as twice the number of first key points of a face node than the second key points. Preferably, a second key point richer than the first key point can be defined in the tree model structure, so that the expression data of the host can be completely corresponding to the first key point corresponding to the expression data of the host in the process of controlling the expression of the virtual host by the expression data of the host, the condition that the data loss of the corresponding first key point is not found by the occurrence data of the second key point in the conversion process is avoided, and the effect of synchronizing the expression of the virtual host and the expression of the real host is further improved.

In one embodiment, the controlling the virtual anchor expression according to the predicted position of the first key point and the principal component set may further include: judging the principal component content in the principal component set after the predicted position of the first key point is obtained, for example, judging different morphological change dimensions of corresponding faces in a face image comprising the anchor face, or judging morphological change attributes of each principal component for respectively controlling the faces, and adjusting the position of the first key point according to the principal component content and the predicted position of the first key point to control the expression of the virtual anchor. When the main component content comprises a form change dimension of the face rotating left and right, the first key point moves or changes at the predicted position of the first key point corresponding to the first key point according to the form change dimension of the face rotating left and right, so that the face image of the virtual anchor can also rotate left and right. The main component content comprises a form change dimension corresponding to the fat and thin face, and the face shape of the virtual anchor face can be adjusted according to the predicted position of the first key point and the form change dimension, so that the virtual anchor is more vivid in the expression display process, and the requirement of immersive interaction effect is met.

According to the virtual anchor expression control method, a virtual anchor model is built, and the position of a first key point on the face of the virtual anchor model is obtained; acquiring a data set of face images containing anchor faces in a video stream, and marking positions of second key points in faces on each face image in the data set to obtain average positions of the second key points; determining a set of principal components of the face image in the data set of the face image containing the anchor face based on principal component analysis and labeling positions, wherein different principal components in the set of principal components correspond to different morphological change dimensions of the face respectively; transforming each face image containing the anchor face to obtain a transformed face image respectively, and obtaining a predicted position of a second key point on the face image containing the anchor face according to the average positions of the transformed face image and the second key point; obtaining a predicted position of the first key point according to the predicted position of the second key point; and controlling the virtual main broadcasting expression according to the predicted position of the first key point and the main component set. The scheme is characterized in that face images containing the anchor face are intensively transformed based on the face images containing the anchor face to respectively obtain transformed face images, and the predicted position of a second key point on the face images containing the anchor face is obtained according to the average positions of the transformed face images and the second key point; obtaining a predicted position of the first key point according to the predicted position of the second key point; and controlling the expression of the virtual anchor according to the predicted position of the first key point and the principal component set, so that the expression of the virtual anchor can be the same as or almost the same as the expression of the real anchor, and the interactivity of the virtual anchor is improved.

In addition, the second key point detection model can be trained based on the principal component analysis of the face image dataset and the fitting coefficients of each principal component corresponding to each face image, and the labeling positions and the corresponding average positions of the second key points on each face image in the dataset, so that the fitting coefficients of the face images can be accurately predicted, the positions of the second key points on the face images can be accurately predicted based on the fitting coefficients, the model structure is simple and efficient, the accuracy and the efficiency of the second key point detection can be considered, and the real-time requirements of the virtual anchor on actions or expressions in the real interaction process in network live broadcast can be met while the accurate detection of the second key points is realized.

The method of the embodiment can be particularly applied to facial expression control of a virtual anchor in network live broadcast, can meet the requirements of a live broadcast scene on richer and more diversity of the virtual anchor expression while synchronizing the virtual anchor expression with a real anchor, can provide accuracy of correspondence between the first key point of the virtual anchor face and the second key point of the real anchor face by acquiring the predicted position of the second key point of the real anchor face and finding the predicted position of the first key point through a mapping relation, and further can realize the situation that the expression of the virtual anchor can be consistent with the real anchor expression, and can realize that the interactivity between the virtual anchor and audiences is still high under the scene of the virtual anchor.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiments of the present application also provide a related apparatus for implementing the related method referred to above. The implementation of the solution provided by the apparatus is similar to that described in the above method, so the specific limitation of one or more embodiments of the related apparatus provided below may be referred to the limitation of the related method hereinabove, and will not be repeated herein.

In one embodiment, as shown in fig. 10, a virtual anchor expression control device is provided, and the virtual anchor expression control device may perform the virtual anchor expression control method described in any of the above embodiments, so as to control a virtual anchor expression. The apparatus 100 includes:

the model construction module 101 is configured to construct a virtual anchor model, and obtain a position of a first key point on a face of the virtual anchor model;

the acquisition module 102 is used for acquiring a face image set containing a anchor face in a video stream, extracting gray level images of the face images, and filtering the gray level images by using a Gaussian filter to obtain a smooth image;

the labeling module 103 is configured to calculate edge points of each gray level image according to the smoothed image, label second key points of each face image according to the edge points, and obtain an average position of the second key points according to the labeled positions of the second key points in each face image;

the transformation module 104 is configured to transform each of the face images including the anchor face to obtain a transformed face image, and obtain a predicted position of the second key point on the face image including the anchor face according to an average position of the transformed face image and the second key point;

A control module 105, configured to obtain a predicted position of the first key point according to the predicted position of the second key point; and controlling the virtual main broadcasting expression according to the predicted position of the first key point.

In one embodiment, the model building module 101 is further configured to form a tree model structure according to a connection relationship between nodes of a human face, where the tree model structure includes first key points of a virtual anchor face, where the first key points are respectively combined to form a plurality of model nodes, and the model nodes include a virtual face node, a virtual eye node, a virtual mouth node, a virtual nose node, and a virtual eyebrow node; setting skin grids for the virtual face nodes, the virtual eye nodes, the virtual mouth nodes, the virtual nose nodes and the virtual eyebrow nodes to form a virtual anchor model comprising a virtual face, a virtual eye, a virtual mouth, a virtual nose and a virtual eyebrow; and acquiring each first key point in the tree model structure.

In one embodiment, the obtaining module 102 is further configured to obtain a face image set containing a anchor face in the video stream, and extract a gray level map of each image in the image set; calculating edge points of each gray level map; and acquiring a data set of the face image containing the anchor face according to the edge points.

In one embodiment, the location acquisition module 102 is further used to calculate edge points for each gray map, comprising: filtering the gray level image by using a Gaussian filter to obtain a smooth image; calculating the gradient amplitude M (i, j) and gradient direction theta (i, j) of each pixel of the smooth image; acquiring edge points of the gray scale image according to the gradient amplitude M (i, j) and the gradient direction theta (i, j); wherein,

θ(i,j)＝arctan×[Q(i，j)/P(i，j)](i, j) represents pixel coordinates of the smoothed image; p and Q are denoted as filters, respectively, and +.>

In one embodiment, the location obtaining module 102 is further configured to obtain an edge point of the gray scale map according to the gradient magnitude M (i, j) and the gradient direction θ (i, j), including: selecting a pixel point, constructing a K neighborhood by taking the pixel point as a central point according to the gradient direction of the pixel point, and calculating the gradient amplitude of the selected pixel point and the gradient amplitude of a pixel point adjacent to the selected pixel point; judging whether the gradient amplitude of the selected pixel point is larger than the gradient amplitude of the pixel point adjacent to the selected pixel point according to the calculated result; if yes, the selected pixel point is regarded as an edge point; wherein K is a positive integer.

In one embodiment, the labeling module 103 is further configured to perform normalization processing on the transformed positions of the second key points on each face image with respect to the center of the face image, so as to obtain normalized transformed positions of the second key points on each face image; and carrying out principal component analysis according to the normalized transformation positions of the second key points on each face image to obtain the principal component set.

In one embodiment, the labeling module 103 is further configured to fit the principal component set by using the normalized transformation position of the second key point on each face image, so as to obtain a fitting coefficient corresponding to each principal component of each face image.

In one embodiment, the transforming module 104 is further configured to input the face image into a second keypoint detection model to be trained, so that the second keypoint detection model to be trained obtains a transformed face image according to the face image through a first spatial transformation network, obtains a prediction fitting coefficient corresponding to each principal component according to the transformed face image through a coefficient prediction network, obtains a predicted position of each second keypoint on the transformed face image according to the prediction fitting coefficient and each principal component and the average position, and obtains a predicted position of each second keypoint on the face image through a second spatial transformation network according to the predicted position of each second keypoint on the transformed face image.

In one embodiment, the transforming module 104 is further configured to obtain a predicted position change of each second key point on the transformed face image according to the predicted fitting coefficient corresponding to each principal component of the transformed face image and each principal component;

and obtaining the predicted position of each second key point on the face image after transformation according to the predicted position change and the average position.

In one embodiment, the transforming module 104 is further configured to perform a similar transformation on the labeling positions of the second key points on each face image based on the average positions of the second key points, so as to obtain transformed positions of the second key points on each face image; and carrying out principal component analysis according to the transformation positions of the second key points on each face image to obtain the principal component set.

In one embodiment, the transforming module 104 is further configured to normalize the transformed positions of the second key points on each face image with respect to the center of the face image, so as to obtain normalized transformed positions of the second key points on each face image; and carrying out principal component analysis according to the normalized transformation positions of the second key points on each face image to obtain the principal component set.

In one embodiment, the transforming module 104 is further configured to fit the principal component set by using the normalized transformation positions of the second key points on each face image, so as to obtain a fitting coefficient corresponding to each principal component of each face image.

In one embodiment, the control module 105 is further configured to define each of the first key point and the second key point, and establish a correspondence between nodes having the same definition. And predicting the position of the first key point with the same definition according to the predicted position of the second key point and the corresponding relation to obtain the predicted position of the first key point. Wherein the number of the first key points is greater than or equal to the number of the second key points.

In one embodiment, the control module 105 is further configured to form a tree model structure according to a connection relationship between nodes of the human face, where the tree model structure includes first key points of the virtual anchor face, where the first key points are respectively combined to form a plurality of model nodes, and the model nodes include a virtual face node, a virtual eye node, a virtual mouth node, a virtual nose node, and a virtual eyebrow node; setting skin grids for the virtual face nodes, the virtual eye nodes, the virtual mouth nodes, the virtual nose nodes and the virtual eyebrow nodes to form a virtual anchor model comprising a virtual face, a virtual eye, a virtual mouth, a virtual nose and a virtual eyebrow; and acquiring each first key point in the tree model structure.

In one embodiment, the control module 105 is further configured to determine principal component content in the principal component set after obtaining the predicted position of the first key point, adjust the position of the first key point according to the principal component content and the predicted position of the first key point, and control the expression of the virtual anchor. For example, judging different form change dimensions of a corresponding face in a face image comprising a main broadcasting face, or judging form change properties of each main component to respectively control the face, when the main component content comprises the form change dimension of the face rotating left and right, moving or changing the first key point at a predicted position of the first key point corresponding to the first key point according to the form change dimension of the face rotating left and right, so that the face image of the virtual broadcasting can also rotate left and right. The main component content comprises a form change dimension corresponding to the fat and thin of the face, and the face shape of the virtual anchor face can be adjusted according to the predicted position of the first key point and the form change dimension corresponding to the fat and thin. Therefore, the virtual anchor is more vivid in the expression display process, and the requirement of immersive interaction effect is met.

In one embodiment, an electronic device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 11. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the electronic equipment is used for storing data such as a face image data set and the like. The network interface of the electronic device is used for communicating with an external device through a network connection. The computer program is executed by a processor to realize a virtual main broadcasting expression control method and a live network face image processing method.

In one embodiment, an electronic device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 12. The electronic device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the electronic device is used for conducting wired or wireless communication with an external device, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program when executed by a processor is to implement a live face image processing method. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the electronic equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structures shown in fig. 11 and 12 are merely block diagrams of portions of structures related to the aspects of the present application and do not constitute a limitation of the electronic device to which the aspects of the present application apply, and that a particular electronic device may include more or fewer components than shown in the drawings, or may combine certain components, or may have different arrangements of components.

In an embodiment, there is also provided an electronic device including a memory and a processor, the memory storing a computer program, the processor implementing the steps of the method embodiments described above when executing the computer program.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

It should be noted that, user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method for controlling a virtual anchor expression, the method comprising:

Constructing a virtual anchor model, and acquiring the position of a first key point on the face of the virtual anchor model;

2. The method of claim 1, wherein said computing edge points of each gray scale map from said smoothed image comprises:

calculating the gradient amplitude M (i, j) and gradient direction theta (i, j) of each pixel of the smooth image;

Acquiring edge points of the gray scale image according to the gradient amplitude M (i, j) and the gradient direction theta (i, j);

wherein ,

θ(i,j)＝arctan×[Q(i，j)/P(i，j)]

(i, j) represents pixel coordinates of the smoothed image;

p (i, j), Q (i, j) are the results of smoothing the image after filtering by filters P and Q,

p and Q are respectively denoted as filters, and

3. the method according to claim 2, wherein the acquiring the edge points of the gray scale map according to the gradient magnitude M (i, j) and the gradient direction θ (i, j) includes:

selecting pixel points;

constructing a K neighborhood according to the gradient direction of the pixel point by taking the selected pixel point as a central point,

calculating the gradient amplitude of the selected pixel point and the gradient amplitude of the neighborhood pixel point corresponding to the selected pixel point;

judging whether the gradient amplitude of the selected pixel point is larger than the gradient amplitude of the pixel point adjacent to the selected pixel point in the neighborhood according to the calculated result;

if yes, the selected pixel point is regarded as an edge point;

wherein K is a positive integer.

4. A method according to any one of claims 1-3, wherein said deriving the predicted position of the first keypoint from the predicted position of the second keypoint comprises:

defining each first key point and each second key point, establishing corresponding relation between nodes with the same definition,

And predicting the position of the first key point with the same definition according to the predicted position of the second key point and the corresponding relation to obtain the predicted position of the first key point.

5. A method according to any one of claims 1 to 3, wherein said constructing a virtual anchor model to obtain the location of the first key point on the face of the virtual anchor model comprises:

the tree-shaped model structure comprises first key points of a virtual anchor face, wherein the first key points are respectively combined to form a plurality of model nodes, and the model nodes comprise virtual face nodes, virtual eye nodes, virtual mouth nodes, virtual nose nodes and virtual eyebrow nodes;

setting skin grids for the virtual face nodes, the virtual eye nodes, the virtual mouth nodes, the virtual nose nodes and the virtual eyebrow nodes to form a virtual anchor model comprising a virtual face, a virtual eye, a virtual mouth, a virtual nose and a virtual eyebrow;

and acquiring each first key point in the tree model structure.

6. A method according to any one of claims 1 to 3, wherein transforming each of the face images including the anchor face to obtain a transformed face image, and obtaining the predicted position of the second key point on the face image including the anchor face according to the average positions of the transformed face image and the second key point, comprises:

Inputting the data set of the face image containing the anchor face into a second key point detection model to be trained, and transforming the face image containing the anchor face by the second key point detection model to be trained by using a first space transformation network to obtain a transformed face image;

obtaining a prediction fitting coefficient corresponding to each principal component according to the transformed face image by using a coefficient prediction network;

obtaining the predicted position of the second key point of each face on the face image after transformation according to the predicted fitting coefficient, each principal component and the average position;

and obtaining the predicted positions of the second key points of the faces on the face image containing the anchor face according to the predicted positions of the second key points of the faces on the face image after the transformation through a second space transformation network.

7. The method of claim 6, wherein the obtaining the predicted location of the second keypoint on the transformed face image based on the prediction fit coefficients and the principal components and average locations comprises:

obtaining the predicted position change of the second key point on the transformed face image according to the predicted fitting coefficient corresponding to each principal component of the transformed face image and each principal component;

And obtaining the predicted position of the second key point on the face image after transformation according to the predicted position change and the average position.

8. A virtual anchor expression control apparatus, the apparatus comprising:

9. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.