CN112668407A

CN112668407A - Face key point generation method and device, storage medium and electronic equipment

Info

Publication number: CN112668407A
Application number: CN202011462207.2A
Authority: CN
Inventors: 赵明瑶; 闫嵩
Original assignee: Beijing Dami Technology Co Ltd
Current assignee: Beijing Dami Technology Co Ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-04-16

Abstract

The application discloses a face key point generation method and device, a storage medium and electronic equipment, and belongs to the technical field of computers. The face key point generation method comprises the following steps: the method comprises the steps of extracting features of audio data to obtain sound domain features, extracting features of a template face to obtain face features, processing a face sequence to obtain sequence features, superposing the sound domain features, the face features and the sequence features to generate input features, and generating a face key point sequence according to the input features. Therefore, the method and the device generate the relevant characteristics mainly comprising the phoneme characteristics directly based on the audio data, and then process the relevant characteristics to obtain the human face key point relevant information of the naturally-changing virtual image, solve the problem that the prior art completely depends on the audio and has unstable performance of generating the lip change image for the sound under the noise environment and the sound of the speaking sound of different people, and improve the reality degree and the fluency of the mouth action of the virtual image.

Description

Face key point generation method and device, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for generating key points of a human face, a storage medium and electronic equipment.

Background

Currently, avatar synthesis can be applied in different situations, for example: in the online education process, the virtual teachers provide teaching services, the burden of teachers can be greatly reduced, the teaching cost can be reduced, and better teaching experience is achieved compared with simple recorded broadcast classes and the like. In addition to this, the avatar may be used in a wider range of situations, such as: artificial Intelligence (AI) news anchors, games, animations, and applications have great commercial value in practical commercial scenarios. In the prior art, the synthesis of the virtual image can generate a corresponding lip change image based on input sound data so as to simulate the mouth action during speaking, but the existing synthesized virtual image is not real enough, so that the interactive experience is reduced, the existing virtual image technology of converting the sound into the key point completely depends on the audio frequency, and the performance of generating the lip change image by the sound of the speech of different people under the noise environment is not stable enough. In response to this problem, a method is desired that can directly process input audio data and generate a high-quality avatar corresponding to natural changes in mouth movements and facial expressions.

Disclosure of Invention

The embodiment of the application provides a face key point generation method, a face key point generation device, a storage medium and electronic equipment, which can directly generate a naturally-changing virtual image based on audio data and improve the reality degree of a mouth. The technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a method for generating a face key point, including:

carrying out feature extraction on the audio data to obtain sound domain features; wherein the sound domain features comprise phoneme mouth features and sound coding features;

extracting the features of the template face to obtain face features;

processing the face sequence to obtain sequence characteristics; the face sequence comprises an angle sequence constraint feature and a boundary key point constraint feature;

superposing the sound domain feature, the face feature and the sequence feature to generate an input feature;

and generating a face key point sequence according to the input features.

In a second aspect, an embodiment of the present application provides a device for generating a face keypoint, where the device includes:

the first extraction module is used for extracting the characteristics of the audio data to obtain the sound domain characteristics; wherein the sound domain features comprise phoneme mouth features and sound coding features;

the second extraction module is used for extracting the characteristics of the template face to obtain face characteristics;

the processing module is used for processing the face sequence to obtain sequence characteristics; the face sequence comprises an angle sequence constraint feature and a boundary key point constraint feature;

the superposition module is used for superposing the sound domain features, the face features and the sequence features to generate input features;

and the generating module is used for generating a face key point sequence according to the input features.

In a third aspect, embodiments of the present application provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.

The beneficial effects brought by the technical scheme provided by some embodiments of the application at least comprise:

when the face key point generation method, the face key point generation device, the storage medium and the electronic equipment work, the audio data are subjected to feature extraction to obtain the sound domain features, the template face is subjected to feature extraction to obtain the face features, the face sequence is processed to obtain the sequence features, the sound domain features, the face features and the sequence features are overlapped to generate the input features, and the face key point sequence is generated according to the input features. According to the embodiment of the application, the relevant features mainly comprising phoneme features can be directly generated on the basis of audio data, and then the relevant information of the human face key points of the naturally-changed virtual image is obtained through processing, so that the reality degree and the fluency of the mouth action of the virtual image are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of a communication system architecture provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of a method for generating key points of a human face according to an embodiment of the present application;

fig. 3 is another schematic flow chart of a method for generating face key points according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a face keypoint generating device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The following description refers to the accompanying drawings in which like numerals refer to the same or similar elements throughout the different views, unless otherwise specified. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.

In the description of the present application, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

In order to solve the problems that the existing synthesized virtual image is not real enough and interaction experience is reduced, the existing virtual image technology of converting voice into key points completely depends on audio, and the performance of generating lip change images by the voice of different people and the voice under a noise environment is not stable enough, a human face key point generating method is specially provided. The computer system can be a computer system of a smart phone, a notebook computer, a tablet computer and the like.

Fig. 1 is a schematic diagram of a communication system architecture provided in the present application.

Referring to fig. 1, a communication system 01 includes a terminal apparatus 101, a network apparatus 102, and a server 103; when the communication system 01 includes a core network, the network device 102 may also be connected to the core network. The network device 102 may also communicate with an Internet Protocol (IP) network, such as the Internet (Internet), a private IP network, or other data network. The network device 102 provides services for the terminal device 101 and the server 103 within the coverage area. A user may use the terminal device 101 to interact with the server 103 through the network device 102 to receive or send a message, etc., the terminal device 101 may be installed with various communication client applications, such as a voice interaction application, an animation application, etc., the server 103 may be a server storing the face key point generation method provided in the embodiment of the present application and providing various services, and is configured to detect, store, and process files such as audio data and template faces uploaded by the terminal device 101, and send a processing result to the terminal device 101.

In the following method embodiments, for convenience of description, only the execution subject of each step is described as a computer.

The method for generating the face key points according to the embodiment of the present application will be described in detail below with reference to fig. 2 to 3.

Please refer to fig. 2, which is a flowchart illustrating a method for generating face key points according to an embodiment of the present application. The method may comprise the steps of:

s201, extracting the characteristics of the audio data to obtain the sound domain characteristics.

Generally, the sound domain features include phoneme mouth features and sound coding features, a computer firstly marks phonemes and a mouth in audio data, wherein the phonemes include Chinese phonemes and English phonemes, the mouth is represented by a mouth opening size, the phonemes and the mouth are classified through a clustering algorithm to obtain a classification result, and a mapping relation between the phonemes and the mouth is determined according to the classification result. And then, the computer calculates the central position on the time interval of the audio data based on a preset frame rate, traverses the time interval to extract the Mel cepstrum coefficient MFCC sound characteristics in sub-time intervals with preset lengths before and after the central position, and processes the MFCC sound characteristics through a convolutional neural CNN sound coding network and a full-connection network FC to obtain the sound coding characteristics. And processing the audio data by the computer to obtain a phoneme sequence, determining mouth characteristics corresponding to the phoneme sequence according to a mapping relation between phonemes and a mouth shape, wherein the mouth characteristics are aligned with the MFCC sound characteristics, processing the mouth characteristics through dimension conversion to obtain phoneme mouth characteristics, and finally splicing the sound coding characteristics and the mouth characteristics to generate sound domain characteristics.

S202, extracting the characteristics of the template face to obtain the face characteristics.

Generally, the face features refer to the coordinate information features of key points of a face, such as: the computer obtains 81 personal face key point coordinates or 68 personal face key point coordinates through a common face recognition algorithm. The method comprises the steps that a computer identifies a template face in a data set to obtain face key point coordinate information, counts all face key point coordinate information in the data set to obtain average face key point coordinate information, determines target face key point coordinate information, obtains initial input features based on the target face key point coordinate information and the average face key point coordinate information, processes the initial input features to obtain face features, and the data set refers to a template face set provided by a user, for example: a character face, an animated face, or an avatar, etc.

And S203, processing the face sequence to obtain sequence characteristics.

Generally, the face sequence includes an angle sequence constraint feature and a boundary key point constraint feature, a user may manually set or select a stored existing template in real time, the angle sequence constraint feature includes parameters in two directions x and y, and the boundary key point constraint feature includes 3 boundary point parameters. The computer obtains an angle sequence constraint characteristic and a boundary key point constraint characteristic, processes the angle sequence constraint characteristic to obtain an angle sequence constraint sequence, processes the boundary key point constraint characteristic to obtain a boundary key point constraint sequence, and superposes the angle sequence constraint sequence and the boundary key point constraint sequence to obtain a sequence characteristic.

And S204, overlapping the sound domain feature, the face feature and the sequence feature to generate an input feature.

Generally, superposition refers to combining a plurality of sets of vectors or arrays and the like into a set of vectors or arrays and the like, and mainly includes Cat superposition and Stack superposition, and a computer can directly call a corresponding function to process. And the computer performs Cat superposition on the sound domain features, the face features and the sequence features to obtain first superposition features, and performs Stack superposition on the first superposition features to obtain input features. Cat stacking can be understood as splicing without increasing dimensionality, Stack stacking can be understood as stacking, a new dimensionality is added, and the added dimensionality is determined according to the dimensionality of the input set.

And S205, generating a face key point sequence according to the input features.

Generally, the face key point sequence includes a sequence size and audio data length association parameter, a face key point number and corresponding coordinates. The computer processes the input features to obtain face key point related features, and processes the face key point related features through a multilayer full-connection network to obtain a face key point sequence, wherein the processing of the input features to obtain the face key point related features is through a Long Short Term Memory (LSTM) neural network.

According to the content, the audio data are subjected to feature extraction to obtain sound domain features, the template face is subjected to feature extraction to obtain face features, the face sequence is processed to obtain sequence features, the sound domain features, the face features and the sequence features are overlapped to generate input features, and a face key point sequence is generated according to the input features. According to the embodiment of the application, the relevant features mainly comprising phoneme features can be directly generated on the basis of audio data, and then the relevant information of the human face key points of the naturally-changed virtual image is obtained through processing, so that the reality degree and the fluency of the mouth action of the virtual image are improved.

Referring to fig. 3, another flow chart of a method for generating face key points is provided in the present application. The face key point generating method can comprise the following steps:

s301, marking the phonemes and the mouth shapes in the audio data, and classifying the phonemes and the mouth shapes through a clustering algorithm to obtain a classification result.

Generally, a phoneme is a minimum voice unit divided according to natural attributes of a voice, and is analyzed according to pronunciation actions in a syllable, one action constitutes one phoneme, the phoneme includes a chinese phoneme and an english phoneme, and the mouth shape is represented by an opening size of the mouth, for example: 32 phonemes b, p, m, f.. in chinese, 48 phonemes in english, with 20 vowel phonemes and 28 consonant phonemes. The computer marks phonemes and mouth shapes in the audio data, for example: marking the phoneme of the mandarin chinese in the audio data as p, u, t, o, ng, h, u, a, wherein the phoneme is 8 phonemes, the corresponding mouth shape is 2, 28, 9, 24, 22, 21, 28, 23, wherein the mouth shape 2 corresponds to the mouth opening size of 1cm, the mouth shape 28 corresponds to the mouth opening size of 2cm, and the like, dividing the mouth opening size into 10 groups of 0-0.5cm, 0.5cm-1cm, the.., 4.5cm-5cm according to actual conditions, so that the mouth shape 2 is in the 2 nd group, the mouth shape 28 is in the 4 th group, and the like, clustering the phonemes into different groups according to the size of the mouth shape by using a clustering algorithm such as a category mean Kmeans, and obtaining a classification result, for example: phonemes p, u, t are in group 1, phonemes o, ng, h are in group 2, and so on.

S302, determining the mapping relation between the phonemes and the mouth shape according to the classification result.

Generally, after obtaining the classification result, the computer determines the mapping relationship between the phonemes and the mouth shape according to the classification result, for example: phoneme p corresponds to mouth type 1, phoneme o corresponds to mouth type 2, phoneme u corresponds to mouth type 1, etc.

And S303, calculating the center position of the audio data in the time interval based on the preset frame rate.

Generally, frame rate refers to the frequency/rate at which bitmap images appear on the display in units of frames in succession. The computer generates a face key point sequence based on the audio data, where the face key point sequence corresponds to face key point coordinates of consecutive frames, and therefore, it is necessary to determine a frame rate of the generated avatar video, for example: determining that the frame rate is 25 frames per second, the audio data is 3 minutes, generating 4500 frames of video, the number of MFCC features of the audio data per second is 100, 100 MFCC features in 1 second are related to the audio time length, 25 corresponding generated faces in 1 second, dividing 100 MFCC features into 25 parts, and determining the MFCC features of the center position of each part in the 25 parts as the features of the center position of the generated faces.

S304, traversing the time interval to extract the Mel cepstrum coefficient MFCC sound features in the sub-time interval with the preset length before and after the central position, and processing the MFCC sound features through a convolutional neural CNN sound coding network and a full-connection network FC to obtain the sound coding features.

Generally, the sub-time interval with the preset length refers to the duration of the generated audio data corresponding to each frame of face image, for example: the sub-time interval of the preset length is 150 ms. After the computer determines the central position on the time interval of the audio data, traversing the time interval to extract the Mel cepstrum coefficient (MFCC) sound features in sub-time intervals with preset lengths before and after the central position, and processing the MFCC sound features to obtain sound coding features, for example: extracting 13-dimensional MFCC features, first-order derivative features (12-dimensional) and second-order derivative features (11-dimensional) of MFCC from audio data in a sub-time interval, combining Cat to form first sound features of (1, 36) -dimensional dimensions of 36 MFCC relevant dimensions, traversing the audio data to obtain second sound features of (30,36) -dimensional dimensions corresponding to 300ms of corresponding time before and after the central position, and extracting the second sound features by using a convolution network and a full-connection network as a sound feature encoder to obtain sound encoding features, wherein the first sound features and the second sound features are expressed by arrays, and the sound encoding features are expressed by one-dimensional vectors.

S305, processing the audio data to obtain a phoneme sequence, and determining mouth shape characteristics corresponding to the phoneme sequence according to a mapping relation between phonemes and mouth shapes.

Generally, a computer labels audio data as a phoneme sequence according to Automatic Speech Recognition (ASR), or labels phonemes obtained by performing a read-align process on a text answer script generated in a man-machine conversation or a text script edited manually by a Natural Language Processing (NLP). The phoneme sequence includes information such as phoneme category, start time, and end time, and a mouth shape feature corresponding to the phoneme sequence is determined according to a mapping relationship between phonemes and a mouth shape, for example: in order to synchronize with the sound characteristics, the computer marks the mouth type of the phoneme according to the starting time T0 and the ending time T1 every 0.01 second to form a one-dimensional mouth type characteristic array, the size of the mouth type characteristic array is (T1-T0)/0.01, and the mouth type characteristics of the same phoneme have uniform mouth type numbers. And finally, converting the one-dimensional mouth-shaped feature array into a two-dimensional one-hot mouth-shaped feature array, and aligning the two-dimensional one-hot mouth-shaped feature array with MFCC sound features extracted from audio data, wherein the phoneme sequence and the mouth-shaped features are expressed by arrays.

S306, processing the mouth shape features through dimension conversion to obtain phoneme mouth shape features, and splicing the voice coding features and the mouth shape features to generate voice domain features.

Generally, after determining the mouth features corresponding to the phoneme sequences, a computer directly unidimensionalizes the mouth features with one-hot features to obtain one-dimensional mouth features arranged in sequence without processing through a fully connected neural network, for example: two consecutive one-hot mouth features before conversion are [ [0,1,0, … 0,0], [0,0, … 0,1] ], and the mouth feature after dimension conversion is [0,1,0, …,0,0,0,0, … 0,1], the voice coding feature outputs a voice coding feature vector through a multilayer full-connection network by using MFCC voice features, and then performs Cat connection with the one-hot mouth features to generate voice domain features, for example: the sound coding feature is obtained as [0,1,2,3], the mouth feature is obtained as [4,5,6,7,8], and the generated sound domain feature is obtained as [0,1,2,3,4,5,6,7,8 ]. Wherein the mouth shape feature is represented by a two-dimensional one-hot array, and the phoneme mouth shape feature and the sound field feature are represented by one-dimensional vectors

S307, recognizing a template face in the data set to obtain face key point coordinate information, and counting all face key point coordinate information in the data set to obtain average face key point coordinate information.

Generally, a data set is a collection of multiple template faces pre-provided by a user in a computer. Before the computer identifies the template face in the data set to acquire the coordinate information of the key points of the face, the method further comprises the following steps: the method comprises the steps of detecting a face image in an original image data set based on a face detection algorithm, obtaining a detection result file, analyzing the detection result file to generate a template face, wherein the face detection algorithm comprises a key point extraction algorithm in a face recognition dlib face image, a face key point positioning model algorithm preset by a user or an artificial intelligent open platform calling algorithm (such as Baidu platform), and information in the detection result file comprises a plurality of face key point coordinates of cheek coordinates, eyebrow coordinates, eye coordinates, mouth coordinates and nose coordinates. After obtaining the voice features, the computer identifies a template face in the data set to obtain face key point coordinate information, and counts all the face key point coordinate information in the data set to obtain average face key point coordinate information, for example: the method comprises the steps of identifying 68 face key point coordinate information in a first template face by using a neural network such as Dlib feature identification or a depth network, wherein the 68 face key point coordinate information in a second template face is ((73,25), (85,30), (90,34), (9.)), and the 68 face key point coordinate information in the first template face is ((65,20), (87,32), (92, 30.)), and the average face key point coordinate information is ((69,22.5), (86,31), (91, 32.)).

S308, determining coordinate information of target face key points, obtaining initial input features based on the coordinate information of the target face key points and the coordinate information of the average face key points, and processing the initial input features to obtain face features.

Generally, after obtaining the average face key point coordinate information, a computer determines the target face key point coordinate information, obtains an initial input feature based on the target face key point coordinate information and the average face key point coordinate information, and processes the initial input feature to obtain a face feature, for example: the computer determines 68 pieces of face key point coordinate information ((73,25), (85,30), (90,34),..) in the template face I as target face key point coordinate information, and subtracts the average face key point coordinate information ((69,22.5), (86,31), (91,32),. -) from the target face key point coordinate information to obtain initial input features which can be expressed as ((4,2.5), (-1, -1), (-1,2),. -). And processing the initial input features through a face key point feature extraction module (composed of a plurality of layers of fully-connected networks) to obtain face features, wherein the initial input features are expressed by arrays, and the face features are expressed by vectors.

S309, obtaining the angle sequence constraint characteristics and the boundary key point constraint characteristics, and processing the angle sequence constraint characteristics to obtain an angle sequence constraint sequence.

Generally, the face sequence includes an angle sequence constraint feature and a boundary key point constraint feature, a user may manually set or select a template, the angle sequence constraint feature includes parameters in two directions x and y, and the boundary key point constraint feature includes 3 boundary point parameters, for example: the angle sequence constraint feature is (30, 60), 30 represents that the x coordinate axis direction is rotated by 30 degrees, 60 represents that the y coordinate axis direction is rotated by 60 degrees, and the boundary key point constraint feature is ((35,70), (55,120), (75,70)), which represents the coordinates of three points of the left boundary point, the lower boundary point and the right boundary point of the generated face. And the computer processes the angle sequence constraint characteristic dimension (N,2) through a sequence characteristic extraction module (consisting of a single-layer full-connection network) to obtain an angle sequence constraint sequence dimension (N,12), wherein the angle sequence constraint characteristic and the boundary key point constraint characteristic are expressed by arrays, and the angle sequence constraint sequence is expressed by vectors. For example: and obtaining the dimensionality of the angle sequence constraint characteristic as (N,2) for N continuous face computers, wherein 2 represents the parameters of the angle x and y directions, and the dimensionality of the angle sequence constraint sequence after passing through a sequence characteristic extraction module (composed of a plurality of layers of fully-connected networks) is (N, 12).

S310, processing the boundary key point constraint characteristics to obtain a boundary key point constraint sequence, and overlapping the angle sequence constraint sequence and the boundary key point constraint sequence to obtain sequence characteristics.

Generally, after obtaining an angle sequence constraint sequence, a computer processes the boundary key point constraint feature through a sequence feature extraction module (composed of a single-layer fully-connected network) to obtain a boundary key point constraint sequence, where the boundary key point constraint sequence is represented by a vector, and the angle sequence constraint sequence and the boundary key point constraint sequence are superimposed to obtain a sequence feature, for example: and (2) obtaining the boundary key point constraint sequence of (N, 6) for N continuous face computers, wherein the dimension of the boundary key point constraint sequence after passing through a sequence feature extraction module (consisting of a single-layer fully-connected network) is (N, 36). The sequence of constraint sequences of the boundary key points in this example is superimposed with the sequence of constraint sequences of the angles in S309 to obtain a sequence feature (N,48), which is expressed by a vector.

S311, performing Cat superposition on the sound domain features, the face features and the sequence features to obtain first superposition features, and performing Stack superposition on the first superposition features to obtain input features.

Generally, the sound domain feature corresponds to a sound domain feature corresponding to a frame generated face, the face feature represents a face feature corresponding to a frame generated face, and the sequence feature represents a sequence feature corresponding to a frame generated face, for example: assuming that the mouth type classification category number C is 32, the sound domain feature dimension corresponding to a frame generated face is (1, 256+30 ×) i.e. (1, 1216), the face feature dimension is (1,512), the sequence feature dimension is (1,48), and the computer performs Cat superposition on one of the sound domain feature, the face feature and the sequence feature to obtain a first superposition feature, for example: and N is an integer larger than 1, the sound domain features of the N frames are superposed to obtain a first sound domain superposition feature with a dimensionality of (N,1216), the face features of the N frames are superposed to obtain a first face superposition feature with a dimensionality of (N,512), and the sequence features of the N frames are superposed to obtain a first sequence superposition feature with a dimensionality of (N, 8). After obtaining the first superposition characteristic, the computer performs Stack superposition on the first superposition characteristic to obtain an input characteristic, for example: the first sound field superposition feature dimension is (N,1216), the first face superposition feature dimension is (N,512), the first sequence superposition feature dimension is (N,48), and the computer performs Stack superposition to obtain an input feature dimension (N, 1776).

And S312, processing the input features to obtain face key point related features, and processing the face key point related features through a multilayer full-connection network to obtain a face key point sequence.

Generally, after a computer obtains input features, a long-short term memory (LSTM) neural network is used for processing the input features to obtain face key point related features, wherein the LSTM neural network is provided with 256 hidden nodes and 3 layers, and the face key point related features are expressed by vectors. After the computer obtains the relevant features of the face key points, the relevant features of the face key points are processed through a multilayer full-connection network to obtain a face key point sequence, wherein the face key point sequence comprises a sequence size and audio data length correlation parameter S, the number P of the face key points and corresponding coordinates N, N is always equal to 2, the face key point sequence is represented by an array (S, P, N), for example, the computer obtains the face key point sequence as (1,50,2) to represent that a frame of face is generated, 50 face key points and x-axis and y-axis coordinates ((125,75), (130,80), (140, 83)) of each face key point are generated. And processing the related features of the face key points through a multilayer full-connection network to obtain a face key point sequence, and representing the face features by using (S,512) dimensional vectors after coding, for example, the 50 face key point features are encoded into (1,512) representation through the multilayer full-connection network to generate a frame of face.

When the scheme of the embodiment of the application is executed, the audio data are subjected to feature extraction to obtain the sound domain features, the template face is subjected to feature extraction to obtain the face features, the face sequence is processed to obtain the sequence features, the sound domain features, the face features and the sequence features are overlapped to generate the input features, and the input features are used for generating the face key point sequence. According to the embodiment of the application, the relevant features mainly comprising phoneme features can be directly generated on the basis of audio data, and then the relevant information of the human face key points of the naturally-changed virtual image is obtained through processing, so that the reality degree and the fluency of the mouth action of the virtual image are improved.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Please refer to fig. 4, which shows a schematic structural diagram of a face keypoint generating apparatus according to an exemplary embodiment of the present application, and is hereinafter referred to as the generating apparatus 4 for short. The generating means 4 may be implemented by software, hardware or a combination of both as all or part of a terminal. The method comprises the following steps:

the first extraction module 401 is configured to perform feature extraction on the audio data to obtain a sound domain feature; wherein the sound domain features comprise phoneme mouth features and sound coding features;

a second extraction module 402, configured to perform feature extraction on the template face to obtain face features;

a processing module 403, configured to process the face sequence to obtain a sequence feature; the face sequence comprises an angle sequence constraint feature and a boundary key point constraint feature;

a superposition module 404, configured to superpose the sound domain feature, the face feature, and the sequence feature to generate an input feature;

and a generating module 405, configured to generate a face key point sequence according to the input features.

Optionally, the first extraction module 401 further includes:

the splicing unit is used for extracting the characteristics of the audio data to obtain the voice coding characteristics; processing the audio data to obtain phoneme mouth shape characteristics; and splicing the voice coding features and the mouth shape features to generate voice domain features.

The traversal unit is used for calculating the central position of the audio data in the time interval based on a preset frame rate; traversing the time interval to extract the Mel cepstrum coefficient (MFCC) sound characteristics in sub time intervals with preset lengths before and after the central position; and processing the MFCC sound characteristics through a convolutional neural CNN sound coding network and a full-connection network FC to obtain the sound coding characteristics.

The alignment unit is used for processing the audio data to obtain a phoneme sequence; determining mouth shape characteristics corresponding to the phoneme sequence according to the mapping relation between the phonemes and the mouth shape; wherein the mouth-shaped feature is aligned with the MFCC sound feature; and processing the mouth shape characteristic through dimension conversion to obtain the phoneme mouth shape characteristic.

A classification unit for labeling phonemes and mouth shapes in the audio data; wherein the phonemes comprise Chinese phonemes and English phonemes, and the mouth shape is represented by the opening size of the mouth; classifying the phonemes and the mouth shape through a clustering algorithm to obtain a classification result; and determining the mapping relation between the phonemes and the mouth shape according to the classification result.

Optionally, the second extracting module 402 further includes:

the identification unit is used for identifying a template face in the data set to acquire the coordinate information of key points of the face; counting all the face key point coordinate information in the data set to obtain average face key point coordinate information; determining coordinate information of target face key points, and obtaining initial input features based on the coordinate information of the target face key points and the coordinate information of the average face key points; and processing the initial input features to obtain the human face features.

Optionally, the processing module 403 further includes:

the acquiring unit is used for acquiring the angle sequence constraint characteristic and the boundary key point constraint characteristic; processing the angle sequence constraint characteristics to obtain an angle sequence constraint sequence; processing the boundary key point constraint characteristics to obtain a boundary key point constraint sequence; and overlapping the angle sequence constraint sequence and the boundary key point constraint sequence to obtain sequence characteristics.

Optionally, the stacking module 404 further includes:

the merging unit is used for performing Cat superposition on the sound domain feature, the face feature and the sequence feature to obtain a first superposition feature; and performing Stack superposition on the first superposition characteristics to obtain input characteristics.

Optionally, the generating module 405 further includes:

the processing unit is used for processing the input features to obtain relevant features of key points of the human face; processing the relevant features of the face key points through a multilayer full-connection network to obtain a face key point sequence; the face key point sequence comprises sequence size and audio data length association parameters, face key point number and corresponding coordinates.

The embodiment of the present application and the method embodiments of fig. 2 to 3 are based on the same concept, and the technical effects brought by the embodiment are also the same, and the specific process may refer to the description of the method embodiments of fig. 2 to 3, and will not be described again here.

The device 4 may be a field-programmable gate array (FPGA), an application-specific integrated chip, a system on chip (SoC), a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an embedded Neural Network Processing Unit (NPU), a Tensor Processing Unit (TPU), or similar image processors of a service end and a mobile end, a Neural Network acceleration processor, a Network Processor (NP), a digital signal Processing circuit, a Microcontroller (MCU), or a programmable logic controller (PLD) or other integrated chips.

An embodiment of the present application further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are suitable for being loaded by a processor and executing the above method steps, and a specific execution process may refer to specific descriptions of the embodiment shown in fig. 2 or fig. 3, which is not described herein again.

The present application further provides a computer program product, which stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the template control method according to the above embodiments.

Please refer to fig. 5, which is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 5, the electronic device 5 may include: at least one processor 501, at least one network interface 504, a user interface 503, memory 505, at least one communication bus 502.

Wherein a communication bus 502 is used to enable connective communication between these components.

The user interface 503 may include a Display (Display) and a Microphone (Microphone), and the optional user interface 503 may also include a standard wired interface and a wireless interface.

The network interface 504 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.

Processor 501 may include one or more processing cores, among other things. The processor 501 connects various parts throughout the terminal 500 using various interfaces and lines, and performs various functions of the terminal 500 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 505, and calling data stored in the memory 505. Optionally, the processor 501 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 501 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for performing all tensor operations in the deep learning network and rendering and drawing contents required to be displayed by the display screen; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 501, but may be implemented by a single chip.

The Memory 505 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 505 includes a non-transitory computer-readable medium. The memory 505 may be used to store instructions, programs, code sets, or instruction sets. The memory 505 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 505 may alternatively be at least one memory device located remotely from the processor 501. As shown in fig. 5, the memory 505, which is a type of computer storage medium, may include an operating system, a network communication module, a user interface module, and a face keypoint generation application.

In the electronic device 500 shown in fig. 5, the user interface 503 is mainly used as an interface for providing input for a user, and acquiring data input by the user; the processor 501 may be configured to call the face keypoint generation application stored in the memory 505, and specifically perform the following operations:

extracting the features of the template face to obtain face features;

and generating a face key point sequence according to the input features.

In one embodiment, the processor 501 performs the feature extraction on the audio data to obtain the sound domain features, including:

carrying out feature extraction on the audio data to obtain a sound coding feature;

processing the audio data to obtain phoneme mouth shape characteristics;

and splicing the voice coding features and the mouth shape features to generate voice domain features.

In one embodiment, the processor 501 performs the feature extraction on the audio data to obtain the vocoded features, including:

calculating a center position on a time interval of the audio data based on a preset frame rate;

traversing the time interval to extract the Mel cepstrum coefficient (MFCC) sound characteristics in sub time intervals with preset lengths before and after the central position;

and processing the MFCC sound characteristics through a convolutional neural CNN sound coding network and a full-connection network FC to obtain the sound coding characteristics.

In one embodiment, the processor 501 performs the processing on the audio data to obtain the phoneme mouth shape, including:

processing the audio data to obtain a phoneme sequence;

determining mouth shape characteristics corresponding to the phoneme sequence according to the mapping relation between the phonemes and the mouth shape; wherein the mouth-shaped feature is aligned with the MFCC sound feature;

and processing the mouth shape characteristic through dimension conversion to obtain the phoneme mouth shape characteristic.

In one embodiment, before the processor 501 performs the feature extraction on the audio data to obtain the sound domain feature, the method further includes:

tagging phonemes and mouth shapes in the audio data; wherein the phonemes comprise Chinese phonemes and English phonemes, and the mouth shape is represented by the opening size of the mouth;

classifying the phonemes and the mouth shape through a clustering algorithm to obtain a classification result;

and determining the mapping relation between the phonemes and the mouth shape according to the classification result.

In one embodiment, the processor 501 performs the feature extraction on the template face to obtain the face feature, including:

identifying a template face in a data set to acquire coordinate information of key points of the face;

counting all the face key point coordinate information in the data set to obtain average face key point coordinate information;

determining coordinate information of target face key points, and obtaining initial input features based on the coordinate information of the target face key points and the coordinate information of the average face key points;

and processing the initial input features to obtain the human face features.

In one embodiment, the processor 501 performs the processing on the face sequence to obtain sequence features, including:

acquiring an angle sequence constraint characteristic and a boundary key point constraint characteristic;

processing the angle sequence constraint characteristics to obtain an angle sequence constraint sequence;

processing the boundary key point constraint characteristics to obtain a boundary key point constraint sequence;

and overlapping the angle sequence constraint sequence and the boundary key point constraint sequence to obtain sequence characteristics.

In one embodiment, processor 501 performs the superimposing of the sound domain feature, the face feature, and the sequence feature to generate an input feature, including:

performing Cat superposition on the sound domain features, the face features and the sequence features to obtain first superposition features;

and performing Stack superposition on the first superposition characteristics to obtain input characteristics.

In one embodiment, the processor 501 performs the generating of the face key point sequence according to the input features, including:

processing the input features to obtain face key point related features;

processing the relevant features of the face key points through a multilayer full-connection network to obtain a face key point sequence; the face key point sequence comprises sequence size and audio data length association parameters, face key point number and corresponding coordinates.

The technical concept of the embodiment of the present application is the same as that of fig. 2 or fig. 3, and the specific process may refer to the method embodiment of fig. 2 or fig. 3, which is not described herein again.

In the embodiment of the application, the audio data are subjected to feature extraction to obtain sound domain features, the template face is subjected to feature extraction to obtain face features, the face sequence is processed to obtain sequence features, the sound domain features, the face features and the sequence features are superposed to generate input features, and a face key point sequence is generated according to the input features. According to the embodiment of the application, the relevant features mainly comprising phoneme features can be directly generated on the basis of audio data, and then the relevant information of the human face key points of the naturally-changed virtual image is obtained through processing, so that the reality degree and the fluency of the mouth action of the virtual image are improved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method for generating key points of a human face is characterized by comprising the following steps:

extracting the features of the template face to obtain face features;

and generating a face key point sequence according to the input features.

2. The method of claim 1, wherein the performing feature extraction on the audio data to obtain sound domain features comprises:

processing the audio data to obtain phoneme mouth shape characteristics;

3. The method of claim 2, wherein the extracting the features of the audio data to obtain the vocoded features comprises:

4. The method of claim 2, wherein the processing the audio data to obtain a phonemic mouth shape comprises:

processing the audio data to obtain a phoneme sequence;

5. The method of claim 4, wherein before the extracting the features of the audio data to obtain the sound domain features, the method further comprises:

6. The method of claim 1, wherein the extracting the features of the template face to obtain the face features comprises:

and processing the initial input features to obtain the human face features.

7. The method of claim 1, wherein the processing the face sequence to obtain the sequence feature comprises:

8. The method of claim 1, wherein the superimposing the voice domain features, the face features, and the sequence features to generate input features comprises:

9. The method of claim 1, wherein the generating a face keypoint sequence from the input features comprises:

processing the input features to obtain face key point related features;

10. The method of claim 9, wherein the processing the input features to obtain face keypoint related features is processing the input features to obtain face keypoint related features using a long-short term memory (LSTM) neural network.

11. A face keypoint generation apparatus, comprising:

12. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to carry out the method steps according to any one of claims 1 to 9.

13. An electronic device, comprising: a memory and a processor; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1 to 9.