CN115966194A

CN115966194A - Voice mouth shape synchronous generation method and device, electronic equipment and storage medium

Info

Publication number: CN115966194A
Application number: CN202211296169.7A
Authority: CN
Inventors: 余国军
Original assignee: Xiaoduo Intelligent Technology Beijing Co ltd
Current assignee: Xiaoduo Intelligent Technology Beijing Co ltd
Priority date: 2022-10-21
Filing date: 2022-10-21
Publication date: 2023-04-14

Abstract

The application discloses a method and a device for synchronously generating a voice mouth shape, electronic equipment and a storage medium. The method comprises the steps of obtaining phoneme texts of a virtual human, carrying out face tracking and registration on the virtual human, and extracting a face expression coefficient; respectively extracting a mouth shape characteristic point sequence of the facial expression coefficients and a mouth shape characteristic point sequence of the phoneme texts based on the facial expression coefficients and the phoneme texts; according to the two groups of mouth shape characteristic point sequences obtained by extraction, a transfer function for transferring the mouth shape characteristic point sequence of the phoneme text to a mouth shape set space consistent with the mouth shape characteristic point sequence of the expression coefficient is obtained; obtaining a mouth shape characteristic point sequence after the migration of any audio frequency according to the migration function and any audio frequency; and selecting a human face image which is consistent with the mouth shape set space from the virtual person according to the transferred mouth shape characteristic point sequence, and generating a real person voice mouth shape animation sequence. The method and the device can solve the problems that a large number of mouth shapes need to be collected again and the expansibility is poor in the prior art.

Description

Voice mouth shape synchronous generation method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of speech image technologies, and in particular, to a method and an apparatus for synchronously generating a speech mouth shape, an electronic device, and a storage medium.

Background

At present, under the heat tide of the Yuanzhou, AI digital persons also begin to relate to a plurality of fields, including entertainment, service, education, marketing and the like, and the AI digital persons appearing on the market comprise functional AI digital persons such as virtual assistants, virtual guide, virtual customer service and the like; companion-type AI digital persons, such as virtual companions, virtual family members, and the like; social AI digital people such as virtual anchor, virtual idol, virtual teacher, virtual doctor, virtual shopping guide, etc.

Most of the current methods rely on deep neural networks, and the demand on data volume is large. In order to generate voice mouth shape synchronization for Digital mouth shape and real person voice synthesis, a large amount of data is often required to be collected for a single person, for example, an Audio2Face, where a 3D character model named "Digital Mark" has been loaded in advance, and the model can be used to make animation according to a music track, so that the use mode is very simple, and only Audio signals need to be selected and uploaded. Then, the audio is inputted into the pre-trained deep neural network, the output content will drive the 3D vertex of the character mesh to build the facial animation in real time, and the article A deep learning approach for generated speech animation uses 8 hours of the refined mouth shape training neural network model. The advantage of this type of method is the high quality of the results produced, but the disadvantage is that for each individual person, a large number of mouth shapes need to be reacquired, which is less extensive.

Disclosure of Invention

Based on this, the embodiment of the application provides a method and an apparatus for synchronously generating a voice mouth shape, an electronic device and a storage medium, which can solve the problem that in the prior art, a large number of mouth shapes need to be collected again, and the expansibility of the mouth shapes is poor.

In a first aspect, a method for generating a voice mouth shape synchronously is provided, where the method includes:

acquiring a phoneme text of a virtual human, performing face tracking and registration on the virtual human, and extracting a face expression coefficient; wherein the phoneme text is a voice file of wav suffix;

respectively extracting a mouth shape characteristic point sequence of the facial expression coefficients and a mouth shape characteristic point sequence of the phoneme texts based on the facial expression coefficients and the phoneme texts; acquiring a mouth shape characteristic point sequence according to a preset extraction model;

according to the two groups of extracted mouth shape characteristic point sequences, obtaining a transfer function of transferring the mouth shape characteristic point sequence of the phoneme text to a mouth shape set space consistent with the mouth shape characteristic point sequence of the expression coefficient;

acquiring a mouth shape characteristic point sequence after the migration of any audio according to the migration function and any audio;

and selecting a human face image which is consistent with the mouth shape set space from the virtual person according to the transferred mouth shape characteristic point sequence, and generating a real-person voice mouth shape animation sequence, wherein BS data of each frame are generated, and assignment is performed according to the BS data of each frame in Unity to generate the real-person voice mouth shape animation sequence.

Optionally, performing face tracking and registration on the virtual human, and extracting a facial expression coefficient, including:

performing face tracking and registration on the virtual human, and fitting a human face three-dimensional model on each frame of human face;

and extracting the three-dimensional face posture information and the expression coefficient according to the three-dimensional face model.

Optionally, the extracting the mouth shape feature point sequence of the facial expression coefficient and the mouth shape feature point sequence of the phoneme text based on the facial expression coefficient and the phoneme text respectively includes:

and the expression coefficients and the phoneme texts of the virtual human are respectively input into a face animation driving system based on audio-visual elements and Blendshape interpolation, and mouth shape characteristic point sequences of the expression coefficients and mouth shape characteristic point sequences of the phoneme texts of the virtual human are respectively extracted.

Optionally, obtaining a migration function for migrating the mouth shape feature point sequence of the phoneme text to a mouth shape set space consistent with the mouth shape feature point sequence of the expression coefficient according to the two extracted mouth shape feature point sequences includes:

obtaining a transformation function of the mouth shape feature point of each frame according to a histogram matching principle and a discrete approximation estimation method; and recording the transformation functions of all the mouth shape characteristic points as migration functions.

Optionally, obtaining a mouth shape feature point sequence after the migration of any audio according to the migration function and any audio, including:

T(M)＝{T(M _k )|1≤k≤N；M _k ∈R ^18×3 ；T(M _k )∈R ^18×3 }；

wherein T is a migration function; m is a mouth shape characteristic point sequence of any audio frequency; t (M) is a mouth shape characteristic point sequence after migration; k is a natural number; m is a group of _k 、T(M _k ) Respectively M and T (M).

Optionally, selecting, from the virtual person, a face image consistent with the mouth shape set space according to the migrated mouth shape feature point sequence, and generating a real-person voice mouth shape animation sequence includes:

calculating the Euclidean distance between the mouth shape feature point of each frame in the transferred mouth shape feature point sequence and the mouth shape feature point sequence of the expression coefficient;

screening out the mouth shape characteristic point sequence of the expression coefficient with the Euclidean distance smaller than a threshold value, and executing a Viterbi algorithm to obtain the mouth shape characteristic point sequence of the shortest path;

and arranging the face images corresponding to the mouth shape characteristic point sequence of the shortest path to obtain a real-person voice mouth shape animation sequence.

In a second aspect, an apparatus for generating a speech mouth shape synchronization is provided, the apparatus comprising:

the acquisition module is used for acquiring phoneme texts of the virtual human, performing face tracking and registration on the virtual human and extracting a face expression coefficient; wherein the phoneme text is a voice file of wav suffix;

the extraction module is used for respectively extracting a mouth shape characteristic point sequence of the facial expression coefficient and a mouth shape characteristic point sequence of the phoneme text based on the facial expression coefficient and the phoneme text;

the first calculation module is used for acquiring a migration function of the mouth shape characteristic point sequence of the phoneme text migrating to be consistent with a mouth shape set space in the mouth shape characteristic point sequence of the expression coefficient according to the two groups of mouth shape characteristic point sequences obtained by extraction;

the second calculation module is used for obtaining a mouth shape characteristic point sequence after the migration of any audio frequency according to the migration function and any audio frequency;

and the generating module is used for selecting a human face image which is consistent with the mouth shape set space from the virtual person according to the migrated mouth shape characteristic point sequence and generating a real-person voice mouth shape animation sequence, wherein BS data of each frame are generated, and assignment is carried out according to the BS data of each frame in Unity to generate the real-person voice mouth shape animation sequence.

In a third aspect, an electronic device is provided, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the method for generating a speech mouth shape synchronization according to any one of the first aspect when executing the computer program.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the method for generating a speech mouth shape synchronously according to any one of the above first aspects.

In the technical scheme provided by the embodiment of the application, a phoneme text of a virtual human is obtained, the virtual human is subjected to face tracking and registration, and a face expression coefficient is extracted; respectively extracting a mouth shape characteristic point sequence of the facial expression coefficients and a mouth shape characteristic point sequence of the phoneme texts based on the facial expression coefficients and the phoneme texts; according to the two groups of extracted mouth shape characteristic point sequences, a transfer function that the mouth shape characteristic point sequence of the phoneme text is transferred to a mouth shape set space consistent with the mouth shape characteristic point sequence of the expression coefficient is obtained; obtaining a mouth shape characteristic point sequence after the migration of any audio frequency according to the migration function and any audio frequency; and selecting a human face image which is consistent with the mouth shape set space from the virtual person according to the transferred mouth shape characteristic point sequence, and generating a real person voice mouth shape animation sequence. The technical scheme provided by the embodiment of the application has the beneficial effects that the synchronization of voice and mouth shape can be finished only by one section of text content and one virtual portrait, and the method can be mainly applied to virtual anchor, virtual man teaching and large-screen virtual man interaction.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary and that other implementation drawings may be derived from the drawings provided to one of ordinary skill in the art without inventive effort.

Fig. 1 is a flowchart illustrating steps of a method for generating a speech mouth shape in synchronization according to an embodiment of the present application;

fig. 2 is a block diagram of a speech mouth shape synchronization generating apparatus according to an embodiment of the present application;

fig. 3 is a schematic view of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In the description of the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements specifically listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus or additional steps or elements based on further optimization of the inventive concept.

For the understanding of the present embodiment, a method for generating a speech mouth shape synchronization disclosed in the embodiments of the present application will be described in detail first.

Referring to fig. 1, a flowchart of a method for generating a voice mouth shape synchronously provided by an embodiment of the present application is shown, where the method may include the following steps:

s1, acquiring a phoneme text of a virtual human, tracking and registering the face of the virtual human, and extracting a facial expression coefficient.

Wherein, the phoneme text is a voice file of wav suffix.

In the embodiment of the application, the virtual human is tracked and registered, and each frame of human face is fitted with a human face three-dimensional model; and extracting the three-dimensional face posture information and the expression coefficient according to the three-dimensional face model.

And S2, respectively extracting a mouth shape characteristic point sequence of the facial expression coefficient and a mouth shape characteristic point sequence of the phoneme text based on the facial expression coefficient and the phoneme text.

And acquiring a mouth shape characteristic point sequence according to a preset extraction model.

In the embodiment of the application, the expression coefficients and the phoneme texts of the virtual human are respectively input into a human face animation driving system based on the audio-visual element and Blendshape interpolation, and a mouth shape feature point sequence of the expression coefficients and a mouth shape feature point sequence of the phoneme texts of the virtual human are respectively extracted.

The extraction model in the present application specifically includes: training a target: data acquisition using a di4D system, using 9 HD cameras, 30hz, finally obtain head 4D data of the same topology

Training data: the data contains two parts for each actor: 3-5 minute data for clusters and in-character materials

> > Pangrams: this set is intended to cover possible facial movements in normal speech of a given target language.

In-charaterials: personalized data, or special performances related to games or films

Loss function: comprises three parts of position term, motion term and regularisation term

> > Position term: our primary error indicator is the least square error between the desired output and the output produced by the network

> > Motion term: ensuring motion consistency between frames before and after animation, we define the operator m [. Cndot. ] as the finite difference between paired frames

> > Regularizationterm: ensuring that the network correctly attributes short-term effects to the audio signal and long-term effects to the described emotional state (while taking into account the trivial solution):

wherein: normalization to balance the three loss functions, a similar Normalization is performed separately for each loss term during Adam optimization.

Data enhancement: to improve temporal stability and reduce overfitting, random time shifts were used for the training samples. When a minimatch is presented to the network, the input audio window is randomly moved in either direction for 16.6ms (at 30 frames/second + -0.5 frames). To compensate, the same displacement is applied to the desired output pose by linear interpolation.

Training and setting: the anao and Lasagne frameworks were used. The network was trained using Adam (default parameters), 500epochs. All samples were trained per epchs (random order input), 50 data pairs per batch.

And S3, according to the two groups of mouth shape characteristic point sequences obtained by extraction, obtaining a transfer function for transferring the mouth shape characteristic point sequence of the phoneme text to a mouth shape set space consistent with the mouth shape characteristic point sequence of the expression coefficient.

According to the two groups of mouth shape feature point sequences, obtaining a transfer function for transferring the mouth shape feature point sequence of the phoneme text of the virtual human to a mouth shape set space consistent with the mouth shape feature point sequence of the expression coefficient;

specifically, the mouth shape feature point sequence of the phoneme text of the virtual human is represented as:

the mouth shape feature point sequence of the expression coefficients is recorded as:

wherein, the first and the second end of the pipe are connected with each other,

a mouth shape feature point of a certain frame in the mouth shape feature point sequence of the expression coefficients; />

A mouth shape feature point of a certain frame in the mouth shape feature point sequence of the phoneme text of the virtual human; r ^18×3 A matrix formed by mouth shape characteristic points; n is a radical of ^(tgt) 、N ^(src) Respectively representing the mouth shape number of the mouth shape characteristic point sequence of the expression coefficient and the mouth shape number of the mouth shape characteristic point sequence of the phoneme text of the virtual human;

obtaining a transformation function of the mouth shape feature point of each frame according to a histogram matching principle and a discrete approximation estimation method;

the transformation functions of all the mouth shape feature points are denoted as migration functions.

In the present embodiment, one die is composed of 18 three-dimensional feature points, and thus is composed of 54 variables in total. For each of the 54 variables, a unit nonlinear mapping function is constructed, so that the transfer mapping of the mouth shape feature points is completed. In order to ensure topological consistency of the mouth shape movement, the unit mapping function must satisfy monotonicity and continuity. At the same time, the function can approximate the mouth shape from M ^(src) Probabilistic spatial transformationM ^(tgt) A probability space. We use histogram matching to construct such a cell mapping function, as described in detail below. Assuming that X to X are one defined in [ a, b ]]The unit continuous probability distribution of (1) has a probability density function of f _X (x) In that respect Y to Y are defined in [ c, d]The unit continuous probability distribution of (1) has a probability density function of f _Y (y) is carried out. The objective of histogram matching is to construct a transformation function t with monotonously non-decreasing unit variables X that can transform from X probability distribution to Y probability distribution, i.e. satisfy

t(x)～Y

Where-represents t (x) obeys the Y probability distribution.

According to the histogram matching principle, the variable upper limit integral function is utilized to construct the following two unit transformation functions t ₁ And t ₂

It is easy to prove that the results obtained by both transformation functions obey a uniform distribution between 0,1

t ₁ (x),t ₂ (y)～U(0,1)

Wherein-represents t ₁ (x),t ₂ (y) obeys a probability distribution of U (0, 1), U (0, 1) representing [0,1]]Are evenly distributed in between.

For mouth shape characteristic point M ^(tgt) 、M ^(src) The 54 variables in (1) can be calculated by using discrete approximation estimation to obtain 54 transformation functions. For simplicity of representation, the well-constructed 54 transformation functions are abbreviated as T, so that the migration of the mouth shape feature points is completed.

And S4, obtaining a mouth shape characteristic point sequence after the migration of any audio according to the migration function and any audio.

For a mouth shape feature point sequence M obtained from any audio input, a mouth shape feature point sequence T (M) which is consistent with an M (tgt) mouth shape set space can be obtained through mouth shape migration, such as:

T(M)＝{T(M _k )|1≤k≤N；M _k ∈R ^18×3 ；T(M _k )∈R ^18×3 }；

wherein T is a migration function; m is a mouth shape characteristic point sequence of any audio frequency; t (M) is a mouth shape characteristic point sequence after migration; k is a natural number; m is a group of _k 、T(M _k ) The k-th frame mouth shape characteristic points in M and T (M) are respectively.

And S5, selecting a human face image which is consistent with the mouth shape set space from the virtual human according to the transferred mouth shape characteristic point sequence, and generating a real human voice mouth shape animation sequence.

And assigning values according to the BS data of each frame in Unity to generate a human voice mouth shape animation sequence.

Calculating the Euclidean distance between the mouth shape feature points of each frame in the mouth shape feature point sequence after the migration and the mouth shape feature point sequence of the expression coefficient;

screening out a mouth shape characteristic point sequence of the expression coefficient with the Euclidean distance smaller than a threshold value, and executing a Viterbi algorithm to obtain the mouth shape characteristic point sequence of the shortest path;

and arranging the face images corresponding to the mouth shape characteristic point sequence with the shortest path to obtain a real-person voice mouth shape animation sequence.

Assuming that the mouth shape feature point sequence of the shortest path is J, it is recorded as:

J＝{j _k |1≤k≤N,1≤j _k ≤N ^(tgt) }；

wherein j is _k The mouth shape sequence number to be solved for the kth frame is obtained; n is the number of J and is matched with the length of the input audio; n is a radical of ^(tgt) Representing the original mouth frame number.

The objective function of the joint optimization is as follows:

wherein epsilon ^shape A loss term, ε, representing the shape consistency of the die ^temporal A time-series consistency loss term representing the mouth shape, epsilon being a weighted constant term; epsilon ^shape 、ε ^temporal The formula of (c) will be developed in detail below.

First, the shape consistency loss term ε ^shape The calculation formula is as follows

Wherein epsilon ^shape The shape consistency loss term of the mouth shape is represented, e represents a natural constant, rho is a fixed weighting constant, | |. k denotes the kth frame of the generated sequence, j _k The kth frame representing the generated sequence is from the jth of the original mouth shape _k A frame; represents the j-th frame in the original type frame _k Mouth shape feature points of the frame; the loss term constrains the j-th of the final selection _k Shape of the mouth of the original frame of the frame, and the driving mouth T (M) of the k-th frame input _k ) The shape consistency between them.

Then, the time sequence consistency loss term is calculated as follows

Wherein epsilon ^temporal A time-sequence consistency loss term representing the mouth shape, k-1 and k respectively represent the k-1 and k frames of the generated sequence, and j _k-1 、j _k Respectively indicating that the k-1 th and k-th frames of the generated sequence are from the j-th of the original mouth shape _k-1 、j _k And (4) frame. Furthermore, C (j) _k-1 ,j _k ) Is a time sequence continuity measurement, represents the j th of the original mouth shape _k-1 、j _k The time sequence continuity of the frame, the calculation formula of the measurement is defined as follows

C(m,n)＝0.5+0.25×(cos(v _m ,v _n-1 )+cos(v _m+1 ,v _n ))

Wherein C (m, n) represents the time sequence continuity of the m, n frames of the original mouth shape, v _i And (3) representing PCA characteristic vectors of the mouth shape extracted from the ith frame of the original mouth shape, wherein cos represents the vector cosine distance. When the two frames of m and n are continuous, the value of C (m, n) is 1, when the two frames are discontinuous, the value of C (m, n) is determined by the image similarity of the two frames of m and n, and the value is larger when the similarity is larger.

In conclusion, the mouth shape sequence optimization function is solved, so that the mouth shape feature point sequence with the shortest path can be obtained as J, frames are taken from the original frames and rearranged, and the real-person mouth shape animation sequence matched with the input audio can be obtained. For solving the mouth shape sequence optimization function, a viterbi algorithm (viterbi search) is employed. Specifically, for a frame to be solved of each frame, the euclidean distance of the feature points of the mouth shape is used first, the closest 80 frames are searched from the original mouth shape frames to serve as candidate frames, then the viterbi algorithm is executed, and a mouth shape sequence meeting the shortest path is obtained to serve as a final result.

Referring to fig. 2, a block diagram of a speech mouth shape synchronization generating apparatus 200 according to an embodiment of the present application is shown. As shown in fig. 2, the apparatus 200 may include:

the acquisition module 201 is configured to acquire a phoneme text of a virtual human, perform face tracking and registration on the virtual human, and extract a facial expression coefficient; wherein the phoneme text is a voice file of wav suffix;

the extraction module 202 is used for respectively extracting a mouth shape feature point sequence of the facial expression coefficient and a mouth shape feature point sequence of the phoneme text based on the facial expression coefficient and the phoneme text;

the first calculation module 203 is configured to obtain a migration function for migrating the mouth shape feature point sequence of the phoneme text to a mouth shape set space consistent with the mouth shape feature point sequence of the expression coefficient according to the two groups of mouth shape feature point sequences obtained through extraction;

the second calculating module 204 is configured to obtain a mouth shape feature point sequence after any audio migration according to the migration function and any audio;

and the generating module 205 is configured to select a face image consistent with the mouth shape set space from the virtual person according to the transferred mouth shape feature point sequence, and generate a real person voice mouth shape animation sequence.

For the specific limitation of the voice shape synchronization generating device, reference may be made to the above limitation of the voice shape synchronization generating method, which is not described herein again. The modules in the voice shape synchronization generation device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, an electronic device is provided, which may be a computer, and the internal structure thereof may be as shown in fig. 3. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the device is configured to provide computing and control capabilities. The memory of the device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The database of the computer device is used for synchronous generation of voice mouth shape data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech lip synchronization generation method.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment of the present application, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned speech lip synchronization generation method.

The implementation principle and technical effect of the computer-readable storage medium provided in this embodiment are similar to those of the above method embodiments, and are not described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in M forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SyMchliMk) DRAM (SLDRAM), raMbus (RaMus) direct RAM (RDRAM), direct RaMbus Dynamic RAM (DRDRAM), and RaMbus Dynamic RAM (RDRAM), among others.

All the technical features of the above embodiments can be arbitrarily combined (as long as there is no contradiction between the combinations of the technical features), and for brevity of description, all the possible combinations of the technical features in the above embodiments are not described; these examples, which are not explicitly described, should be considered to be within the scope of the present description.

The present application has been described in considerable detail with reference to certain embodiments and examples thereof. It should be understood that several conventional adaptations or further innovations of these specific embodiments may also be made based on the technical idea of the present application; however, such conventional modifications and further innovations can also fall into the scope of the claims of the present application as long as they do not depart from the technical idea of the present application.

Claims

1. A method for synchronously generating a speech lip shape, the method comprising:

according to the two groups of mouth shape characteristic point sequences obtained by extraction, a transfer function that the mouth shape characteristic point sequence of the phoneme text is transferred to a mouth shape set space consistent with the mouth shape characteristic point sequence of the expression coefficient is obtained;

obtaining a mouth shape characteristic point sequence after the migration of any audio according to the migration function and any audio;

selecting a human face image which is consistent with the mouth shape set space from the virtual human according to the transferred mouth shape characteristic point sequence, and generating a real human voice mouth shape animation sequence; and assigning values according to the BS data of each frame in Unity to generate a human voice mouth shape animation sequence.

2. The method of claim 1, wherein performing face tracking and registration on the virtual human, and extracting facial expression coefficients comprises:

performing face tracking and registration on the virtual human, wherein each frame of human face is fitted with a human face three-dimensional model;

3. The method of claim 1, wherein extracting mouth shape feature point sequences of facial expression coefficients and mouth shape feature point sequences of phoneme texts based on the facial expression coefficients and the phoneme texts respectively comprises:

4. The method of claim 1, wherein obtaining a transfer function for transferring the mouth shape feature point sequence of the phoneme text to be consistent with a mouth shape set space in the mouth shape feature point sequence of the expression coefficients according to the two extracted mouth shape feature point sequences comprises:

obtaining a transformation function of the mouth shape feature point of each frame according to a histogram matching principle and a discrete approximation estimation method; the transformation functions of all the mouth shape feature points are recorded as migration functions.

5. The method according to claim 1, wherein obtaining a mouth shape feature point sequence after arbitrary audio migration according to the migration function and arbitrary audio comprises:

T(M)＝{T(M _k )|1≤k≤N；M _k ∈R ^18×3 ；T(M _k )∈R ^18×3 }；

6. The method according to claim 1, wherein the step of selecting a face image spatially consistent with the mouth shape set from the virtual person according to the migrated mouth shape feature point sequence and generating a real person voice mouth shape animation sequence comprises the steps of:

calculating the Euclidean distance between the mouth shape feature point of each frame in the mouth shape feature point sequence after the migration and the mouth shape feature point sequence of the expression coefficient;

and arranging the face images corresponding to the mouth shape characteristic point sequence with the shortest path to obtain a real person voice mouth shape animation sequence.

7. A speech lip synchronization generation apparatus, characterized in that the apparatus comprises:

and the generating module is used for selecting a human face image which is consistent with the mouth shape set space from the virtual person according to the transferred mouth shape characteristic point sequence and generating a real-person voice mouth shape animation sequence, wherein the generating module generates BS data of each frame, and the generating module generates the real-person voice mouth shape animation sequence according to the assignment of the BS data of each frame in the Unity.

8. An electronic device, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, implements the speech lip synchronization generation method according to any one of claims 1 to 6.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method of synchronous generation of a speech lip shape according to any one of claims 1 to 6.