WO2023284435A1 - Procédé et appareil permettant de générer une animation - Google Patents
Procédé et appareil permettant de générer une animation Download PDFInfo
- Publication number
- WO2023284435A1 WO2023284435A1 PCT/CN2022/096773 CN2022096773W WO2023284435A1 WO 2023284435 A1 WO2023284435 A1 WO 2023284435A1 CN 2022096773 W CN2022096773 W CN 2022096773W WO 2023284435 A1 WO2023284435 A1 WO 2023284435A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- training
- data
- key points
- frames
- model
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 196
- 238000012545 processing Methods 0.000 claims abstract description 101
- 230000001815 facial effect Effects 0.000 claims abstract description 86
- 210000003128 head Anatomy 0.000 claims abstract description 61
- 210000001508 eye Anatomy 0.000 claims abstract description 58
- 230000002996 emotional effect Effects 0.000 claims abstract description 30
- 238000012549 training Methods 0.000 claims description 708
- 230000006870 function Effects 0.000 claims description 204
- 230000033001 locomotion Effects 0.000 claims description 192
- 230000015654 memory Effects 0.000 claims description 97
- 230000009471 action Effects 0.000 claims description 93
- 238000013528 artificial neural network Methods 0.000 claims description 68
- 238000004891 communication Methods 0.000 claims description 51
- 230000008451 emotion Effects 0.000 claims description 41
- 230000008569 process Effects 0.000 claims description 35
- 210000000887 face Anatomy 0.000 claims description 27
- 238000007781 pre-processing Methods 0.000 claims description 23
- 238000003062 neural network model Methods 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 12
- 230000001902 propagating effect Effects 0.000 claims 1
- 206010011878 Deafness Diseases 0.000 abstract description 54
- 230000014509 gene expression Effects 0.000 abstract description 25
- 230000008921 facial expression Effects 0.000 abstract description 15
- 230000036544 posture Effects 0.000 description 41
- 238000011176 pooling Methods 0.000 description 32
- 230000001360 synchronised effect Effects 0.000 description 31
- 239000011159 matrix material Substances 0.000 description 30
- 238000013527 convolutional neural network Methods 0.000 description 26
- 239000013598 vector Substances 0.000 description 21
- 210000001331 nose Anatomy 0.000 description 20
- 230000004886 head movement Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 16
- 210000004709 eyebrow Anatomy 0.000 description 14
- 239000000284 extract Substances 0.000 description 11
- 230000001537 neural effect Effects 0.000 description 11
- 210000002569 neuron Anatomy 0.000 description 11
- 210000000744 eyelid Anatomy 0.000 description 9
- 230000015572 biosynthetic process Effects 0.000 description 8
- 238000013500 data storage Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 8
- 238000003786 synthesis reaction Methods 0.000 description 8
- 230000003068 static effect Effects 0.000 description 7
- 230000004913 activation Effects 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 6
- 230000008859 change Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 210000001747 pupil Anatomy 0.000 description 5
- 238000013480 data collection Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 238000010606 normalization Methods 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 241000989913 Gunnera petaloidea Species 0.000 description 3
- 241000220225 Malus Species 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 210000003205 muscle Anatomy 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 238000001308 synthesis method Methods 0.000 description 3
- 208000009119 Giant Axonal Neuropathy Diseases 0.000 description 2
- MHABMANUFPZXEB-UHFFFAOYSA-N O-demethyl-aloesaponarin I Natural products O=C1C2=CC=CC(O)=C2C(=O)C2=C1C=C(O)C(C(O)=O)=C2C MHABMANUFPZXEB-UHFFFAOYSA-N 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000000386 athletic effect Effects 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 201000003382 giant axonal neuropathy 1 Diseases 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000037361 pathway Effects 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 210000000216 zygoma Anatomy 0.000 description 2
- 235000021016 apples Nutrition 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000006735 deficit Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 210000000214 mouth Anatomy 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2200/00—Indexing scheme for image data processing or generation, in general
- G06T2200/04—Indexing scheme for image data processing or generation, in general involving 3D image data
Definitions
- the present application relates to the field of computer technology, and more specifically, to a method and device for generating animation.
- the present application provides a method and device for generating animation, by optimizing the conversion between voice and animation, the user experience of the deaf-mute is improved, and the communication between the deaf-mute and ordinary people is further promoted.
- a method for generating an animation including: acquiring the voice to be processed and the video to be processed; obtaining first pre-processing data according to the voice to be processed and the video to be processed, the first pre-processing data including the The data of the key points of the human face corresponding to the plurality of audio frames of the speech to be processed, the key points of the human face include the key points of the first feature, wherein the first feature corresponding to at least two audio frames in the plurality of audio frames The position of at least one of the key points of a feature is different, and the first feature includes at least one of eyes, head posture, and lip shape; multiple image frames are obtained according to the first preprocessing data and the video to be processed , the first characteristic is different in at least two image frames among the plurality of image frames; an animation is obtained according to the plurality of image frames.
- the data of the key points of the human face includes the data of the key points of each part of the face.
- it may also include the data of the key points of the eyebrows, nose, chin, cheeks, etc. data.
- video to be processed here may also be multiple pictures to be processed, which is not limited in this application.
- the above solution optimizes the facial animation generated based on speech, enriches the facial expressions in the facial animation, and displays the emotional information of the audio more vividly, so that the deaf-mute can understand the meaning of the audio expression more accurately.
- the matching degree between the facial animation and the audio is higher; through the natural swing of the head in the animation, the face in the animation is more realistic and improves
- the user experience of the deaf-mute promotes communication and communication between the deaf-mute and ordinary people, and facilitates the deaf-mute to better integrate into society.
- the content of the video to be processed further includes arm and hand movements
- the first preprocessed data further includes audio frames corresponding to the plurality of audio frames of the speech to be processed
- the data of the key points of the arm and the hand wherein the multiple image frames are obtained according to the first preprocessing data and the video to be processed, further comprising: extracting the video to be processed to obtain images of the arm and the hand; And the data of key point of hand, the image input second sub-model of this arm and hand obtains a plurality of second image frames, and this second image frame is the image frame of hand and arm motion, and this second sub-model is based on arm and hand training data of key points of the hand and training images of the arm and the hand, wherein the plurality of image frames includes the plurality of first image frames and the plurality of second image frames.
- the above scheme generates sign language movements and facial animations based on speech, and on the basis of translating the content of speech expressions through sign language movements, vividly displays the emotional information of speech through facial animations, so that deaf-mute people can more accurately understand the meaning of audio expressions .
- the matching degree between the facial animation and the audio is higher, and through the natural shaking of the head of the characters in the animation when they speak, the animation is more realistic and the characters are more real.
- Improve the user experience of the deaf-mute promote communication and communication between the deaf-mute and ordinary people, and facilitate the better integration of the deaf-mute into society.
- the first sub-model is trained for the purpose of reducing the value of the second loss function and the fourth loss function
- the second sub-model is obtained by reducing the value of the second loss function
- Three loss function values the fourth loss function value is obtained by the purpose training
- the second loss function value is based on the difference between the faces in the first training video frames and the faces in the plurality of training video frames
- the difference is calculated
- the multiple first training video frames are obtained by inputting the training data of the key points of the human face and the training image of the human face into the first sub-model
- the third loss function value is obtained according to the multiple first
- the difference between the hands and arms in the second training video frame and the hands and arms in the plurality of training video frames is calculated
- the plurality of second training video frames are the training data of key points of the hands and arms
- the training image of the arm and the hand is input into the second sub-model
- the fourth loss function value is calculated according to the difference between the first similarity and the second similarity
- the first sub-model is trained based on data of key points including eyes, head posture, lip shape and other features for the purpose of reducing the value of the loss function, so that the generated facial movements can present changes in eyes, head Features such as movement posture and lip shape changes make the images of human face movements more vivid and real, improving user experience.
- the synchronization degree of the generated facial movements and sign language movements is higher, and user experience is further improved.
- obtaining first preprocessed data according to the speech to be processed and the video to be processed includes: inputting the speech to be processed and the video to be processed into the first model to obtain The first preprocessing data, the first model is obtained based on the training of the first training data, the first training data includes a plurality of training audio frames and a plurality of training video frames, wherein the training audio frames corresponding to the same semantic
- the playback error of the training video frame is less than or equal to the first threshold, and the semantic similarity corresponding to the face action and sign language action in the same training video frame is greater than or equal to the second threshold.
- the training audio frame and the training video frame are synchronized, and the facial movements and sign language movements in the training video frame are also synchronized.
- the first threshold and the second The threshold is used to limit the degree of synchronization, which may be configured by the system or set manually, which is not limited in this application.
- obtaining a plurality of image frames according to the first preprocessing data and the video to be processed includes: extracting the video to be processed to obtain a face image; The data of the key points of the face and the face image are input into the first sub-model to obtain a plurality of first image frames, the first image frames are image frames of the face action, and the first sub-model is based on the key points of the face training data and face training images.
- a method for generating speech including: obtaining data to be processed, the data to be processed includes multiple image frames of human face movements and multiple image frames of sign language movements; according to the data to be processed, the The emotion information corresponding to the facial movement and the first text corresponding to the sign language movement; the voice data with emotion information is obtained according to the emotion information and the first text.
- the above scheme combines the emotional information expressed by facial animation in the video with the voice generated according to sign language movements to generate emotional voice, so that the voice translated into sign language can more accurately express the meaning of deaf-mute people, and promote the development of deaf-mute people.
- the exchange and communication between people and ordinary people facilitates the deaf-blind people to better integrate into society.
- a method for training a neural network including: obtaining first training data, the first training data includes a plurality of training audio frames and a plurality of training video frames, wherein the training audio corresponding to the same semantic The playing error of the frame and the training video frame is less than or equal to the first threshold, and the semantic similarity corresponding to the face action and sign language action in the same training video frame is greater than or equal to the second threshold; the first training data is extracted to obtain The first training speech data; input the first training speech data into the first model to obtain the training data of the key points of the human face, the key points of the human face include the key points of the first feature, and the first feature includes eyes, head At least one of posture, lip shape; Calculate the first loss function value according to the difference between the training data of the key points of the human face and the data of the key points of the human face in the plurality of training video frames; the first loss function value Perform backpropagation and adjust the parameters that need to be trained.
- the key point data of the human face corresponding to the voice data is obtained based on the voice data, especially the key point data of eyes, head posture, and lip shape corresponding to the voice data, so that the voice-based generation
- the information of the key points of the human face is richer, and it can better reflect the emotion of the voice.
- a method for training a neural network including: obtaining first training data, the first training data includes a plurality of training audio frames and a plurality of training video frames, wherein the training audio corresponding to the same semantic The playback error of the frame and the training video frame is less than or equal to the first threshold, and the semantic similarity corresponding to the face action and sign language action in the same training video frame is greater than or equal to the second threshold; the first training data is processed to obtain The training data of the key point of people's face, the key point of this people's face comprises the key point of first feature, and this first feature comprises at least one in eyes, head posture, lip shape; This first training data is extracted and obtained The training image of people's face; For the purpose of reducing the second loss function value, the training data of the key points of the people's face, the training image of the people's face, input the first sub-model, obtain a plurality of first training video frames, the The first training video frame is a training video frame of human face action
- the first sub-model is trained based on data of key points including eyes, head posture, lip shape and other features for the purpose of reducing the value of the loss function, so that the generated facial movements can present changes in eyes, head Features such as movement posture and lip shape changes make the images of human face movements more vivid and real, improving user experience.
- the method further includes: processing the first training data to obtain training data of key points of the arm and hand; extracting the first training data to obtain arm and the training image of the hand; with the purpose of reducing the third loss function value, the training data of the key points of the arm and the hand, the training image of the arm and the hand are input into the second sub-model to obtain a plurality of second training video frames,
- the second training video frame is a training video frame of hand and arm movements
- the third model includes a first sub-model and a second sub-model.
- the training model generates video frames of facial movements and sign language movements respectively, so that the model can generate sign language movements based on speech, and at the same time generate facial movements that can reflect the emotion of the speech, present the contents of the speech expression more vividly and accurately, and improve user experience. experience.
- obtaining multiple first training video frames includes: for the purpose of reducing the value of the second loss function and the value of the fourth loss function, the key of the face The training data of the point, the training image of the face, input the first sub-model to obtain a plurality of first training video frames; the obtaining of a plurality of second training video frames includes: to reduce the third loss function value, the fourth loss Function value purpose, the training data of the key point of this arm and hand, the training image input of this arm and hand second sub-model, obtain a plurality of second training video frames;
- this second loss function value is the multi- The difference between the human face in the first training video frame and the human face in the plurality of training video frames is calculated, and the third loss function value is calculated according to the hand and arm in the plurality of second training video frames and the difference between the human face in the plurality of training video frames The difference between the hand and the arm in the plurality of training video frames is calculated, and the fourth loss function value is
- the degree of synchronization between the generated facial movements and sign language movements is higher, and the user experience is further improved.
- processing the first training data to obtain data of key points of the human face includes: extracting the first training data to obtain first training voice data; Input the first training voice data into the first model to obtain the key points of the human face; calculate the first loss function value according to the data difference between the key points of the human face and the key points of the human face in the plurality of training video frames; The first loss function value is backpropagated to adjust the parameters to be trained in the first model.
- the key point data of the human face corresponding to the voice data is obtained based on the voice data, especially the key point data of eyes, head posture, and lip shape corresponding to the voice data, so that the voice-based generation
- the information of the key points of the human face is richer, and it can better reflect the emotion of the voice.
- a device for generating animation including: an acquisition unit, configured to acquire speech to be processed and video to be processed; a processing unit, to obtain first preprocessed data according to the speech to be processed and the video to be processed , the first preprocessing data includes the data of the key points of the human face corresponding to the multiple audio frames of the speech to be processed, and the key points of the human face include the key points of the first feature, wherein, with the multiple audio frames The position of at least one of the key points of the first feature corresponding to at least two audio frames in the audio frame is different, and the first feature includes at least one of eyes, head posture, and lip shape; the processing unit is also used for A plurality of image frames are obtained according to the first preprocessing data and the video to be processed, and the first feature in at least two image frames among the plurality of image frames is different; the processing unit is further configured to obtain the plurality of image frames according to the plurality of image frames get animated.
- the above solution by optimizing the facial animation generated based on the voice, vividly displays the emotional information of the audio, so that the deaf-mute can understand the meaning of the audio expression more accurately.
- the matching degree between the face animation and the audio is higher; through the natural swing of the head in the animation, the face in the animation is more real, and the deaf-mute is improved.
- Human user experience promote communication and communication between the deaf-mute and ordinary people, and facilitate the deaf-mute to better integrate into society.
- the processing unit is further configured to input the speech to be processed and the video to be processed into a first model to obtain the first preprocessed data, and the first model is Obtained based on the first training data training, the first training data includes a plurality of training audio frames and a plurality of training video frames, wherein the playback error of the training audio frames and the training video frames corresponding to the same semantics is less than or equal to the first A threshold, the semantic similarity corresponding to the face action and the sign language action in the same training video frame is greater than or equal to the second threshold.
- the acquisition unit is also used to extract the video to be processed to obtain a face image; the processing unit is also used to extract the key points of the face
- the data and the face image are input into the first sub-model to obtain a plurality of first image frames, the first image frame is the image frame of the face action, and the first sub-model is based on the training data of the key points of the face, the image of the face Trained on training images.
- the content of the video to be processed further includes arm and hand movements
- the first pre-processed data further includes audio frames corresponding to the plurality of audio frames of the speech to be processed
- the data of the key points of the arm and hand wherein the processing unit is also specifically used to extract the video to be processed to obtain the image of the arm and hand; the data of the key point of the arm and hand, the image of the arm and hand Input the second sub-model to obtain a plurality of second image frames, the second image frame is the image frame of the hand and the arm movement, the second sub-model is based on the training data of the key points of the arm and the hand, the training image of the arm and the hand obtained through training, wherein the multiple image frames include the multiple first image frames and the multiple second image frames.
- the first sub-model is trained for the purpose of reducing the second loss function value and the fourth loss function value
- the second sub-model is obtained by reducing the second loss function value
- Three loss function values the fourth loss function value is obtained by the purpose training
- the second loss function value is based on the difference between the faces in the first training video frames and the faces in the plurality of training video frames
- the difference is calculated
- the multiple first training video frames are obtained by inputting the training data of the key points of the human face and the training image of the human face into the first sub-model
- the third loss function value is obtained according to the multiple first
- the difference between the hands and arms in the second training video frame and the hands and arms in the plurality of training video frames is calculated
- the plurality of second training video frames are the training data of key points of the hands and arms
- the training image of the arm and the hand is input into the second sub-model
- the fourth loss function value is calculated according to the difference between the first similarity and the second similarity
- a sixth aspect provides a device for generating speech, characterized in that it includes:
- An acquisition unit configured to acquire data to be processed, the data to be processed includes a plurality of image frames of facial movements and a plurality of image frames of sign language movements;
- a processing unit configured to obtain emotional information corresponding to the facial movement and first text corresponding to the sign language movement according to the data to be processed;
- the processing unit is further configured to obtain voice data with emotional information according to the emotional information and the first text.
- the above scheme combines the emotional information expressed by facial animation in the video with the voice generated according to sign language movements to generate emotional voice, so that the voice translated into sign language can more accurately express the meaning of deaf-mute people, and promote the development of deaf-mute people.
- the exchange and communication between people and ordinary people facilitates the deaf-blind people to better integrate into society.
- a device for training a neural network which is characterized in that it includes: an acquisition unit, configured to acquire first training data, the first training data including a plurality of training audio frames and a plurality of training video frames, wherein , the playback error of the training audio frame and the training video frame corresponding to the same semantics is less than or equal to the first threshold, and the semantic similarity corresponding to the face action and sign language action in the same training video frame is greater than or equal to the second threshold;
- the acquisition unit is also used to extract the first training data to obtain the first training speech data;
- the processing unit is used to input the first training speech data into the first model to obtain the training data of the key points of the human face, the person The key points of the face include the key points of the first feature, and the first feature includes at least one of eyes, head posture, and lip shape; Calculate the first loss function value based on the data difference of the key points of the face in each training video frame; the processing unit is also used to backpropagate the first loss function value
- the key point data of the human face corresponding to the voice data is obtained based on the voice data, especially the key point data of eyes, head posture, and lip shape corresponding to the voice data, so that the voice-based generation
- the information of the key points of the human face is richer, and it can better reflect the emotion of the voice.
- a device for training a neural network which is characterized in that it includes: an acquisition unit, configured to acquire first training data, the first training data including a plurality of training audio frames and a plurality of training video frames, wherein , the playback error of the training audio frame and the training video frame corresponding to the same semantics is less than or equal to the first threshold, and the semantic similarity corresponding to the face action and sign language action in the same training video frame is greater than or equal to the second threshold;
- a processing unit configured to process the first training data to obtain training data of key points of the human face, the key points of the human face include key points of a first feature, and the first feature includes eyes, head posture, lip shape At least one of them;
- the acquisition unit is also used to extract the first training data to obtain the training image of the face;
- the processing unit is also used to reduce the value of the second loss function, the key of the face
- the training data of the point and the training image of the human face are input into the first sub-model to obtain a plurality of
- the first sub-model is trained based on data of key points including eyes, head posture, lip shape and other features for the purpose of reducing the value of the loss function, so that the generated facial movements can present changes in eyes, head Features such as movement posture and lip shape changes make the images of human face movements more vivid and real, improving user experience.
- the processing unit is also used to process the first training data to obtain training data of key points of the arm and hand; the acquisition unit is also used to The first training data is extracted to obtain the training images of the arm and the hand; the processing unit is also used to reduce the value of the third loss function, the training data of the key points of the arm and the hand, the training of the arm and the hand.
- the image is input into the second sub-model to obtain a plurality of second training video frames, the second training video frames are training video frames of hand and arm movements, and the third model includes the first sub-model and the second sub-model.
- the processing unit is further configured to, for the purpose of reducing the value of the second loss function and the value of the fourth loss function, use the training data of the key points of the face , the training image of the human face, and input the first sub-model to obtain a plurality of first training video frames; the processing unit is also specifically used for: in order to reduce the third loss function value and the fourth loss function value, the The training data of the key points of the arm and the hand, the training image of the arm and the hand are input into the second sub-model to obtain a plurality of second training video frames; wherein, the second loss function value is in the plurality of first training video frames The difference between the human face and the human face in the plurality of training video frames is calculated, and the third loss function value is obtained according to the difference between the hands and arms in the plurality of second training video frames and the plurality of training video frames The difference between the hand and the arm is calculated, the fourth loss function value is calculated according to the difference between the first similarity
- the acquiring unit is further configured to extract the first training data to obtain the first training voice data; the processing unit is also configured to specifically extract the first training data
- the training voice data is input into the first model to obtain the key points of the face; the processing unit is also used to calculate the first loss according to the data difference between the key points of the face and the key points of the face in the plurality of training video frames function value; the processing unit is further configured to backpropagate the first loss function value to adjust parameters to be trained in the first model.
- a device for generating animation including a processor and a communication interface, the processor receives or sends data through the communication interface, and the processor is configured to invoke program instructions stored in a memory to execute the first On the one hand the method.
- a device for generating speech including a processor and a communication interface, the processor receives or sends data through the communication interface, and the processor is configured to invoke program instructions stored in a memory to execute the first There are two ways to do this.
- a training device for a neural network model including a processor and a memory, the memory is used to store program instructions, and the processor is used to call the program instructions to implement the third aspect or the fourth aspect of the claim. method, or the processor is used to call the program instructions to execute the method of claim 7.
- a computer-readable storage medium stores program code for device execution, and when the program code runs on a computer or a processor, the computer or the processor Execute the method as described in any one of the first aspect to the fourth aspect.
- a computer program product containing instructions is provided, and when the computer program product is run on a computer or a processor, the computer or the processor executes any one of the first to fourth aspects The method.
- a chip in a fourteenth aspect, includes a processor and a data interface, and the processor reads the instructions stored on the memory through the data interface to execute any one of the first to fourth aspects of claims. aspect of the method.
- FIG. 1 is a schematic structural diagram of a system architecture provided by an embodiment of the present application.
- FIG. 2 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application.
- FIG. 3 is a schematic diagram of a hardware structure of a chip provided by an embodiment of the present application.
- FIG. 4 is a schematic diagram of a system architecture provided by an embodiment of the present application.
- FIG. 5 is a schematic block diagram of a method 500 for generating animation provided by the present application.
- FIG. 6 is a schematic block diagram of a method 600 for generating speech provided by the present application.
- FIG. 7 is a schematic block diagram of neural network model training methods 710 and 720 provided by the present application.
- FIG. 8 is a schematic flowchart of a training method 800 for a neural network model provided by the present application.
- FIG. 9 is a schematic flowchart of a method 900 for generating animation provided by the present application.
- FIG. 10 is a schematic flowchart of a neural network model training method 1000 provided by the present application.
- FIG. 11 is a schematic flowchart of a method 1100 for generating speech provided by the present application.
- Fig. 12 is a schematic block diagram of a device provided by an embodiment of the present application.
- Fig. 13 is a schematic block diagram of another device provided by an embodiment of the present application.
- the present application presents various aspects, embodiments or features in terms of a system that can include a number of devices, components, modules and the like. It is to be understood and appreciated that the various systems may include additional devices, components, modules, etc. and/or may not include all of the devices, components, modules etc. discussed in connection with the figures. In addition, combinations of these schemes can also be used.
- a subscript such as W 1 may be a clerical error into a non-subscript form such as W1.
- the network architecture and business scenarios described in the embodiments of the present application are for more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute limitations on the technical solutions provided by the embodiments of the present application.
- the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.
- references to "one embodiment” or “some embodiments” or the like in this specification means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
- appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically stated otherwise.
- the terms “including”, “comprising”, “having” and variations thereof mean “including but not limited to”, unless specifically stated otherwise.
- At least one means one or more, and “multiple” means two or more.
- “And/or” describes the association relationship of associated objects, indicating that there can be three types of relationships, for example, A and/or B, which can mean: A exists alone, A and B exist at the same time, and B exists alone, where A, B can be singular or plural.
- the character “/” generally indicates that the contextual objects are an “or” relationship.
- “At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items.
- At least one item (piece) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple .
- sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application.
- the implementation process constitutes any limitation.
- a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be:
- W s is the weight of x s
- b is the bias of the neuron unit.
- f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
- the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
- a neural network is a network formed by connecting multiple above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit.
- the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field.
- the local receptive field can be an area composed of several neural units.
- the training model (training) in machine learning means to learn (determine) the ideal values of all weights (Weights) and deviations (Bias) through labeled samples. What a machine learning algorithm does during training is: it examines multiple samples and tries to find a model that minimizes the loss; the goal is to minimize the loss.
- model (prediction function): takes one or more features as input, and returns a prediction (y') as output.
- y' a prediction
- Calculate the loss Calculate the loss under the sub-parameters (bias, weight) through the loss function.
- Calculation parameter update detect the value of the loss function, and generate new values for parameters such as bias and weight to minimize the loss.
- Deep neural network also known as multi-layer neural network
- DNN can be understood as a neural network with multiple hidden layers.
- DNN is divided according to the position of different layers, and the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
- the first layer is the input layer
- the last layer is the output layer
- the layers in the middle are all hidden layers.
- the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
- DNN looks very complicated, it is actually not complicated in terms of the work of each layer.
- it is the following linear relationship expression: in, is the input vector, is the output vector, Is the offset vector, W is the weight matrix (also called coefficient), and ⁇ () is the activation function.
- Each layer is just an input vector After such a simple operation to get the output vector Due to the large number of DNN layers, the coefficient W and the offset vector The number is also higher.
- DNN The definition of these parameters in DNN is as follows: Take the coefficient W as an example: Assume that in a three-layer DNN, the linear coefficient from the fourth neuron of the second layer to the second neuron of the third layer is defined as The superscript 3 represents the layer number where the coefficient W is located, and the subscript corresponds to the output index 2 of the third layer and the input index 4 of the second layer.
- the coefficient from the kth neuron of the L-1 layer to the jth neuron of the L layer is defined as
- the input layer has no W parameter.
- more hidden layers make the network more capable of describing complex situations in the real world. Theoretically speaking, a model with more parameters has a higher complexity and a larger "capacity", which means that it can complete more complex learning tasks.
- Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
- Convolutional neural network is a deep neural network with a convolutional structure.
- a convolutional neural network consists of a feature extractor consisting of a convolutional layer and a subsampling layer.
- the feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or convolutional feature map.
- the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
- a neuron can only be connected to some adjacent neurons.
- a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
- Neural units of the same feature plane share weights, and the shared weights here are convolution kernels.
- Shared weights can be understood as a way to extract image information that is independent of location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. That means that the image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used.
- multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
- the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
- the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
- the loss function loss function
- objective function objective function
- the training of the deep neural network becomes a process of reducing the loss as much as possible.
- the smaller the loss the higher the training quality of the deep neural network, and the larger the loss, the lower the training quality of the deep neural network.
- the smaller the loss fluctuation the more stable the training; the larger the loss fluctuation, the more unstable the training.
- the neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial model during the training process, so that the reconstruction error loss of the model becomes smaller and smaller. Specifically, passing the input signal forward until the output will generate an error loss, and updating the parameters in the initial super-resolution model by backpropagating the error loss information, so that the error loss converges.
- the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the parameters of the optimal super-resolution model, such as the weight matrix.
- GAN Generative adversarial networks
- GAN is a machine learning architecture proposed by Ian Goodfellow of the University of Montreal in 2014, which can be as creative and imaginative as artificial intelligence.
- GAN mainly includes two types of networks, a generator G (generator) and a discriminator D (discriminator).
- G is responsible for generating the picture, that is, after inputting a random code (random code) z, it will output a fake picture G(z) automatically generated by the neural network.
- D is responsible for accepting the image output by G as input, and then judging whether the image is true or false, output 1 if it is true, and output 0 if it is false.
- GAN can be simply regarded as a game process of two networks.
- D trains a binary classification neural network through the data of real and fake pictures.
- G can fabricate a "fake picture" based on a string of random numbers, and then G uses the fabricated fake picture to deceive D.
- D is responsible for distinguishing whether it is a real picture or a fake picture, and gives a score. For example, G generates a picture, and D has a high score, indicating that G's generation ability is very successful; if D's score is not high, the effect of G is not very good, and parameters need to be adjusted.
- GANs pix2pix, CycleGAN, pix2pixHD, etc.
- the "picture sequence” in this application is composed of multiple pictures, and each picture is a frame image in the video. Playing the multiple pictures in sequence can produce animation effects.
- the sequence pictures here can also be called animation sequence (animation sequence).
- a sequence of pictures can be multiple pictures with the same name but different serial numbers, for example, name001. When sequence pictures, these 480 pictures will be played continuously to produce animation effect.
- Sign language synthesis technology involves many disciplines such as natural language processing, computer animation, and pattern recognition. At present, it mainly focuses on the following three aspects of research: the analysis and processing from text to sign language, the realization method of computer-synthesized sign language, and the expression of synthetic sign language.
- sign language synthesis methods can be divided into text-driven sign language synthesis, speech-driven sign language synthesis, and speech-text-driven sign language synthesis.
- the research on sign language synthesis methods in China started relatively late, and most of them are based on text-driven Sign language synthesis, for a given sentence of natural language text, after text analysis, the method of natural language processing is used to convert it into an unambiguous standard text, divided into sign language words, and then combined with the pre-established sign language movement database to find the corresponding gesture , displayed in the form of video or virtual human animation.
- voice-driven method first use voice recognition technology to recognize and convert voice into text, and then perform the aforementioned operations. Or extract prosodic information from speech, assist the basic semantics provided by the text, and enhance the realism of sign language expression.
- the sign language animation synthesis method based on 3D virtual human first establishes a 3D virtual human model, and displays sign language by controlling the movement of the virtual human.
- the sign language animation synthesis method based on 3D virtual human first establishes a 3D virtual human model, and displays sign language by controlling the movement of the virtual human.
- several video clips of sign language words are recombined into a new sign language video according to text grammar rules.
- the method of sign language animation synthesis based on 3D virtual human is applied in sign language application, the purpose is to convert natural language expression into sign language expression, and display it by virtual human, so that people with hearing and language impairment can receive and understand information more conveniently.
- the embodiment of the present application provides a system architecture 100 .
- the data collection device 160 is used to collect training data.
- the training data may include a plurality of training audio frames and a plurality of training video frames.
- the data collection device 160 After collecting the training data, the data collection device 160 stores the training data in the database 130, and the training device 120 obtains the target model/rule 101 based on the training data maintained in the database 130.
- the object model/rule 101 can be used to implement the animation generation method of the embodiment of the present application, that is, input the image to be processed into the object model/rule 101, and then obtain the detection result of the object of interest in the image to be processed.
- the training data maintained in the database 130 may not all be collected by the data collection device 160, but may also be received from other devices.
- the training device 120 does not necessarily perform the training of the target model/rules 101 based entirely on the training data maintained by the database 130, and it is also possible to obtain training data from the cloud or other places for model training. Limitations of the Examples.
- the target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. , augmented reality (augmented reality, AR) AR/virtual reality (virtual reality, VR), vehicle terminal, etc., can also be a server or cloud, etc.
- the execution device 110 configures an input/output (input/output, I/O) interface 112 for data interaction with external devices, and the user can input data to the I/O interface 112 through the client device 140, and the input data
- I/O input/output
- the execution device 110 When the execution device 110 preprocesses the input data, or in the calculation module 111 of the execution device 110 performs calculation and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the correspondingly processed data and instructions may also be stored in the data storage system 150 .
- the I/O interface 112 returns the processing result, such as the detection result obtained above, to the client device 140, so as to provide it to the user.
- the client device 140 may be a planning control unit in an automatic driving system.
- the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above-mentioned goals or complete the above-mentioned task to provide the user with the desired result.
- the user can manually specify the input data, and the manual specification can be operated through the interface provided by the I/O interface 112 .
- the client device 140 can automatically send the input data to the I/O interface 112 . If the client device 140 is required to automatically send the input data to obtain the user's authorization, the user can set the corresponding authority in the client device 140 .
- the user can view the results output by the execution device 110 on the client device 140, and the specific presentation form may be specific ways such as display, sound, and action.
- the client device 140 can also be used as a data collection terminal, collecting the input data input to the I/O interface 112 as shown in the figure and the output results of the output I/O interface 112 as new sample data, and storing them in the database 130 .
- the client device 140 may not be used for collection, but the I/O interface 112 directly uses the input data input to the I/O interface 112 as shown in the figure and the output result of the output I/O interface 112 as a new sample.
- the data is stored in database 130 .
- FIG. 1 is only a schematic diagram of a system architecture provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation.
- the data The storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
- a target model/rule 101 is obtained by training according to a training device 120 .
- CNN is a very common neural network
- a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture.
- a deep learning architecture refers to an algorithm that is updated through a neural network model. Multiple levels of learning are performed at different levels of abstraction.
- CNN is a feed-forward artificial neural network in which individual neurons can respond to images input into it.
- a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a fully connected layer (fully connected layer) 230 .
- the input layer 210 can obtain the video to be processed and the speech to be processed, and hand over the obtained video to be processed and the speech to be processed to the convolution layer/pooling layer 220 and the subsequent fully connected layer 230 for processing, and the image can be obtained processing results.
- the internal layer structure of CNN200 in Fig. 2 will be introduced in detail below.
- the convolutional layer/pooling layer 220 may include layers 221-226 as examples, for example: in one implementation, the 221st layer is a convolutional layer, the 222nd layer is a pooling layer, and the 223rd layer is a volume Layers, 224 are pooling layers, 225 are convolutional layers, and 226 are pooling layers; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers Layer, 226 is a pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
- the convolution layer 221 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter for extracting specific information from the input image matrix.
- the convolution operators are essentially It can be a weight matrix. This weight matrix is usually pre-defined. During the convolution operation on the image, the weight matrix is usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image. ...It depends on the value of the stride) to complete the work of extracting specific features from the image.
- the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
- the weight matrix will be extended to The entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolutional output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (row ⁇ column) are applied, That is, multiple matrices of the same shape.
- the output of each weight matrix is stacked to form the depth dimension of the convolution image, where the dimension can be understood as determined by the "multiple" mentioned above.
- Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to filter unwanted noise in the image.
- the multiple weight matrices have the same size (row ⁇ column), and the convolutional feature maps extracted by the multiple weight matrices of the same size are also of the same size, and then the extracted multiple convolutional feature maps of the same size are combined to form The output of the convolution operation.
- weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .
- shallow convolutional layers such as 221 often extract more general features, which can also be referred to as low-level features;
- the features extracted by the later convolutional layers such as 226) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
- pooling layer can be followed by one layer of convolutional layers.
- the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
- the sole purpose of pooling layers is to reduce the spatial size of the image.
- the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling an input image to obtain an image of a smaller size.
- the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
- the maximum pooling operator can take the pixel with the largest value within a specific range as the result of maximum pooling. Also, just like the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
- the size of the image output after being processed by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
- the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned earlier, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to use the fully connected layer 230 to generate one or a group of outputs with the required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 2 ) and an output layer 240, and the parameters contained in the multi-layer hidden layers may be based on the specific task type The related training data is pre-trained. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, and so on.
- the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error.
- the backpropagation (as shown in Fig. 2, the propagation from 240 to 210 direction is back propagation) will Start to update the weights and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200, that is, the error between the output of the convolutional neural network 200 through the output layer and the ideal result.
- the convolutional neural network shown in FIG. 2 is only an example of a possible convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.
- FIG. 3 is a hardware structure of a chip provided by an embodiment of the present application, and the chip includes a neural network processor 30 .
- the chip can be set in the execution device 110 shown in FIG. 1 to complete the computing work of the computing module 111 .
- the chip can also be set in the training device 120 shown in FIG. 1 to complete the training work of the training device 120 and output the target model/rule 101 .
- the method in the embodiment of the present application can be implemented in the chip shown in FIG. 3 .
- the neural network processor NPU 30 is mounted on the main central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and tasks are assigned by the main CPU.
- the core part of the NPU is the operation circuit 303, and the controller 304 controls the operation circuit 303 to extract data in the memory (weight memory or input memory) and perform operations.
- the operation circuit 303 includes multiple processing units (process engine, PE).
- arithmetic circuit 303 is a two-dimensional systolic array.
- the arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
- arithmetic circuitry 303 is a general purpose matrix processor.
- the operation circuit fetches the data corresponding to the matrix B from the weight memory 302, and caches it in each PE in the operation circuit.
- the operation circuit fetches the data of matrix A from the input memory 301 and performs matrix operation with matrix B, and the obtained partial results or final results of the matrix are stored in the accumulator (accumulator) 308 .
- the vector computing unit 307 can perform further processing on the output of the computing circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
- the vector calculation unit 307 can be used for network calculations of non-convolution/non-FC layers in neural networks, such as pooling (pooling), batch normalization (batch normalization, BN), local response normalization (local response normalization) )Wait.
- the vector computation unit can 307 store the processed output vectors to the unified buffer 306 .
- the vector computing unit 307 may apply a non-linear function to the output of the computing circuit 303, such as a vector of accumulated values, to generate activation values.
- vector computation unit 307 generates normalized values, binned values, or both.
- the vector of processed outputs can be used as an activation input to the arithmetic circuit 303, for example for use in a subsequent layer in a neural network.
- the operation of the neural network provided in the embodiment of the present application may be performed by the operation circuit 303 or the vector calculation unit 307 .
- the unified memory 306 is used to store input data and output data.
- the weight data directly transfers the input data in the external memory to the input memory 301 and/or unified memory 306 through the storage unit access controller 305 (direct memory access controller, DMAC), stores the weight data in the external memory into the weight memory 302, And store the data in the unified memory 306 into the external memory.
- DMAC direct memory access controller
- a bus interface unit (bus interface unit, BIU) 310 is configured to implement interaction between the main CPU, DMAC and instruction fetch memory 309 through the bus.
- An instruction fetch buffer (instruction fetch buffer) 309 connected to the controller 304 is used to store instructions used by the controller 304;
- the controller 304 is configured to invoke instructions cached in the memory 309 to control the operation process of the computing accelerator.
- the unified memory 306, the input memory 301, the weight memory 302, and the fetch memory 309 are all on-chip (On-Chip) memories
- the external memory is a memory outside the NPU
- the external memory can be a double data rate synchronous dynamic random Memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
- DDR SDRAM double data rate synchronous dynamic random access memory
- HBM high bandwidth memory
- the executing device 110 in FIG. 1 or the chip in FIG. 3 introduced above can execute each step of the method for generating animation in the embodiment of the present application.
- the training device 120 in FIG. 1 or the chip in FIG. 3 introduced above can execute each step of the neural network training method of the embodiment of the present application.
- the embodiment of the present application provides a system architecture 400 .
- the system architecture includes a local device 401, a local device 402, an execution device 410, and a data storage system 350, wherein the local device 401 and the local device 402 are connected to the execution device 410 through a communication network.
- Execution device 410 may be implemented by one or more servers.
- the execution device 410 may be used in cooperation with other computing devices, such as data storage, routers, load balancers and other devices.
- Execution device 410 may be arranged on one physical site, or distributed on multiple physical sites.
- the execution device 410 can use the data in the data storage system 450, or call the program code in the data storage system 450 to implement the neural network model training method of the embodiment of the present application.
- the execution device 410 may perform the following process:
- the first training data includes a plurality of training audio frames and a plurality of training video frames, wherein the playback error of the training audio frames and the training video frames corresponding to the same semantics is less than or equal to the first threshold, The semantic similarity corresponding to the face action and the sign language action in the same training video frame is greater than or equal to the second threshold;
- the first loss function value is backpropagated to adjust the training data of the key points of the face.
- a trained neural network that is, the target neural network, which can be used to generate the key points of the human face corresponding to the voice and so on according to the voice.
- the execution device 410 may perform the following process:
- the first training data includes a plurality of training audio frames and a plurality of training video frames, wherein the playback error of the training audio frames and the training video frames corresponding to the same semantics is less than or equal to the first threshold, The semantic similarity corresponding to the face action and the sign language action in the same training video frame is greater than or equal to the second threshold;
- the first training data is extracted to obtain training images of faces and training images of arms and hands;
- the training video frame is the training video frame of human face action
- the training data of the key points of the arm and hand, the training images of the arm and the hand into the second sub-model to obtain a plurality of second training video frames
- the first The second training video frame is a training video frame of hand and arm movements.
- a trained neural network that is, the target neural network, which can be used to generate image frames of human face movements according to the associated points of the human face corresponding to the voice and the key points of the hands and arms And the image frame of sign language action and so on.
- Each local device can represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, game console, etc.
- Each user's local device can interact with the execution device 410 through any communication mechanism/communication standard communication network, and the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
- the local device 401 and the local device 402 obtain relevant parameters of the target neural network from the execution device 410, deploy the target neural network on the local device 401 and the local device 402, and use the target neural network to perform the above processing process.
- the target neural network can be directly deployed on the execution device 410.
- the execution device 410 obtains the video to be processed and the voice to be processed from the local device 401 and the local device 402, and uses the target neural network model to process the video and the audio to be processed. Process speech for classification or other types of processing.
- the execution device 410 may also be implemented by a local device.
- the local device 401 implements the functions of the device 410 and provides services for its own users, or provides services for the users of the local device 402 .
- the above execution device 410 may also be a cloud device. In this case, the execution device 410 may be deployed on the cloud; or, the above execution device 410 may also be a terminal device. In this case, the execution device 410 may be deployed on the user terminal side. This is not limited.
- the following methods can be used (1) to obtain the sign language action video to be identified, and at the same time obtain a preset computer animation image, ( 2) Process the sign language action video to be recognized to obtain the semantic text expressed by the sign language action video, (3) analyze the semantic text to obtain the corresponding response text of the semantic text, (4) combine the computer animation image to convert the response text Sign language animation is displayed to the user.
- This method requires professional computer animators to pre-set animation images, and the selected or designed preset animation images need to have human body structure parts for sign language expression, such as hands, faces, arms and other elements.
- this method can only translate selected sign language videos, and cannot achieve real-time translation from speech to sign language.
- the following methods can be used when generating sign language animation based on speech: (1) acquire the audio to be translated, (2) identify the audio signal to be translated and convert it into corresponding text information, (3) perform text processing on the text information Word segmentation processing; (4) Extract the sign language action corresponding to each word segmentation from the sign language database, and finally generate the animation sequence of sign language.
- This method needs to consume a lot of labor costs in advance and collect a large number of sign language databases. For sign language actions whose input word segmentation is not in the database, the corresponding sign language actions cannot be retrieved, which seriously affects the interpretation of sign language videos for deaf-mute people, and then causes a series of serious problems. s consequence.
- the animation sequence of sign language based on speech generation only includes sign language movements, and lacks information such as eyes and facial expressions, which may affect deaf-mute people's understanding of speech. For example, deaf-mute people cannot understand the emotional information expressed by speech.
- FIG. 5 is a schematic flowchart of a method 500 of the present application.
- the video to be processed here may also be a plurality of pictures to be processed, or a sequence of pictures to be processed.
- the faces here can be dynamic or static, or the faces here can be expressive or expressionless, shaking their heads or looking straight ahead.
- the first pre-processed data includes data of key points of the human face corresponding to multiple audio frames of the speech to be processed.
- the key points of the human face include the first Key points of features, wherein at least one key point of the key points of the first feature corresponding to at least two audio frames in the plurality of audio frames is different in position, the first feature includes eyes, head posture, lip shape at least one.
- the key points of the face are the key points (landmarks) used to identify the position, state, etc. of each feature on the face, or the key points used to identify the position, state, etc. of each part of the face.
- the parts or features here may be, for example, eyebrows, eyes, nose, mouth and the like.
- Each part or feature can be identified by one or more feature points. For a specific example, refer to S802 in the method 800.
- the "data of key points of human faces corresponding to multiple audio frames" here may be understood as data of key points of human faces corresponding to the semantics expressed by voice content in multiple audio frames.
- the voice content of m audio frames in the plurality of audio frames expresses the semantic meaning of "happy”
- the data of the key points of the face corresponding to the m audio frames can be each The data of the position change of the key points of the part.
- the obtained first pre-processed data includes face key data corresponding to each semantic expression of the content of the to-be-processed voice.
- the data of the key points of the eyes may reflect that the gaze is erratic or affectionately gazed.
- the semantics expressed by the voice is "uneasy", in order to express the erratic eyes, it can be reflected by the key point of the eyes - the position of the pupil changes left and right.
- the data of the key points of the head position can reflect the natural swing of the head or the action of nodding and shaking the head when the person speaks.
- some points on the horizontal central axis and longitudinal central axis can be taken as the key points of head movement, and the movement trajectory of these key points can be used to reflect the head movement posture .
- the movement trajectory of these key points can be used to reflect the head movement posture .
- the lip language corresponding to the voice or the emotion corresponding to the semantics expressed by the voice can be embodied through the key point data of the mouth.
- the lip movements of ordinary people when speaking the corresponding voice can be presented by controlling the positions of the chin, the upper middle part of the inner lip, the lower middle part of the inner lip, the left inner lip corner, and the right inner lip corner.
- you can control the position of each key point to present actions such as raising the corners of the mouth, drooping the corners of the mouth, or grinning.
- the position of at least one of the key points of the first feature corresponding to at least two audio frames among the plurality of audio frames is different
- the positions of the key points of the feature are different to reflect that the key points of the first feature are included in the first preprocessing data, and the key points of the first feature are changed, so that the data of the key points of the face included in the first preprocessing data can reflect
- the facial animation generated by the first preprocessing data after subsequent processing includes the action of the first feature corresponding to the speech to be processed.
- obtaining the first preprocessed data according to the voice to be processed and the video to be processed can be specifically implemented in the following manner:
- the first model is trained based on the first training data, the first training data includes a plurality of training audio frames and a plurality of The training video frame, wherein, the playing error of the training audio frame and the training video frame corresponding to the same semantics is less than or equal to the first threshold, and the semantic similarity corresponding to the face action and the sign language action in the same training video frame is greater than or equal to the second threshold.
- the multiple training audio frames here can also be a section of training voice data
- the training video frames here can also be a section of training video data, or it can also be a sequence of multiple pictures, or it can also be in other forms. There is no limit to this.
- the definition of "the playback error of the training audio frame and the training video frame corresponding to the same semantics is less than or equal to the first threshold” is to require the audio and video in the training data to be synchronized
- the semantic similarity between the facial movements and sign language movements in the video frame is greater than or equal to the second threshold” is to require that the facial movements and sign language movements in the training data be synchronized.
- S801 in the method 800 which will not be repeated here.
- obtaining a plurality of image frames according to the first preprocessing data and the video to be processed can be specifically implemented in the following manner:
- Extract the video to be processed to obtain a face image input the data of the key points of the face and the face image into the first sub-model to obtain a plurality of first image frames, and the first image frames are image frames of human face actions , the first sub-model is obtained through training based on training data of key points of a human face and training images of a human face.
- the emotional information of the audio is vividly displayed, so that the deaf-mute can understand the meaning of the audio expression more accurately.
- the matching degree between the face animation and the audio is higher; through the natural swing of the head in the animation, the face in the animation is more real, and the deaf-mute is improved.
- Human user experience promote communication and communication between the deaf-mute and ordinary people, and facilitate the deaf-mute to better integrate into society.
- Method 500 also includes:
- the content of the to-be-processed video further includes arm and hand movements
- the first pre-processed data also includes data of key points of the arm and hand corresponding to the plurality of audio frames of the to-be-processed voice.
- the video to be processed here may also be a plurality of pictures to be processed, or a sequence of pictures to be processed.
- the "data of key points of arms and hands corresponding to multiple audio frames of the speech to be processed” is the data of key points when the arms and hands perform sign language movements corresponding to the semantics of speech expressions to be processed.
- the speech to be processed is the sentence "today's work is very difficult, I'm so tired”
- the data of the key points of the hand and arm when translating this sentence with sign language is the key point of the arm and hand corresponding to the speech to be processed point data.
- the specific implementation of "obtaining multiple image frames according to the first preprocessing data and the video to be processed” may also include:
- This video to be processed is extracted to obtain the image of the arm and the hand; the data of the key points of the arm and the hand, the image of the arm and the hand are input into the second sub-model to obtain a plurality of second image frames, and the second image frame is The image frame of the hand and the arm movement, the second sub-model is obtained based on the training data of the key points of the arm and the hand, and the training images of the arm and the hand, wherein the multiple image frames include the multiple first image frames and the plurality of second image frames.
- sign language movements and facial animations are generated based on speech, and on the basis of translating the content of speech expressions through sign language movements, the emotional information of speech is vividly displayed through facial animations, so that deaf-mute people can understand audio expressions more accurately meaning.
- the matching degree between the facial animation and the audio is higher, and through the natural shaking of the head of the characters in the animation when they speak, the animation is more realistic and the characters are more real.
- the first sub-model is obtained by training for the purpose of reducing the value of the second loss function and the value of the fourth loss function
- the second sub-model is obtained by training for the purpose of reducing the value of the third loss function and the value of the fourth loss function
- the second loss function value is calculated according to the difference between the human face in the plurality of first training video frames and the human face in the plurality of training video frames, and the plurality of first training video frames are the human faces
- the training data of the key points of the face and the training image of the face are input into the first sub-model to obtain,
- the third loss function value is calculated according to the difference between the hands and arms in the plurality of second training video frames and the hands and arms in the plurality of training video frames, and the plurality of second training video frames are the The training data of the key points of the hand and the arm, and the training images of the arm and the hand are input into the second sub-model to obtain.
- the fourth loss function value is calculated according to the difference between the first similarity and the second similarity, and the first similarity is the training video played within the first time period among the plurality of first training video frames
- the similarity of the semantics corresponding to the frame and the semantics corresponding to the training video frames played in the first time period by the plurality of second training video frames, the second similarity is that the plurality of training video frames are in the first time period
- the images of facial movements and sign language movements generated according to the first sub-model and the second sub-model are more vivid and real, And the synchronization degree of the image of the face movement and the sign language movement is higher.
- FIG. 6 is a schematic flowchart of a method 600 of the present application.
- the data to be processed here may be a plurality of first images and a plurality of second images, the plurality of first images showing human face movements; the plurality of second images showing sign language movements.
- the data to be processed here may be a plurality of third images, and the plurality of third images simultaneously present human face movements and sign language movements.
- the data to be processed here may also be a video showing facial movements and sign language movements, or may be in other forms, which is not limited in this application.
- the facial movements may include facial expressions, head movements, lip language, and the like.
- the voice with emotion is generated, so that the voice translated into the sign language can express the meaning of the deaf-mute more accurately, and promote The exchange and communication between the deaf and ordinary people facilitates the deaf and mute to better integrate into society.
- FIG. 7 is a schematic flowchart of a method 710 of the present application.
- the first training data includes a plurality of training audio frames and a plurality of training video frames, wherein the playback error of the training audio frames and training video frames corresponding to the same semantics is less than or equal to the first threshold, and the same The semantic similarity corresponding to the face action and the sign language action in a training video frame is greater than or equal to the second threshold.
- the definition of "the playback error of the training audio frame and the training video frame corresponding to the same semantics is less than or equal to the first threshold” is to require the audio and video in the training data to be synchronized
- the semantic similarity between the facial movements and sign language movements in the video frame is greater than or equal to the second threshold” is to require that the facial movements and sign language movements in the training data be synchronized.
- S801 in the method 800 which will not be repeated here.
- the key point data of the human face corresponding to the voice data is obtained based on the voice data, especially the key point data of eyes, head posture, and lip shape corresponding to the voice data, so that based on The information of the key points of the face generated by the voice is richer, and it can better reflect the emotion of the voice.
- FIG. 7 is a schematic flowchart of the method 720 of the present application.
- the first training data includes a plurality of training audio frames and a plurality of training video frames, wherein the playback error of the training audio frames and training video frames corresponding to the same semantics is less than or equal to the first threshold, and the same The semantic similarity corresponding to the face action and the sign language action in a training video frame is greater than or equal to the second threshold.
- the definition of "the playback error of the training audio frame and the training video frame corresponding to the same semantics is less than or equal to the first threshold” is to require the audio and video in the training data to be synchronized
- the semantic similarity between the facial movements and sign language movements in the video frame is greater than or equal to the second threshold” is to require that the facial movements and sign language movements in the training data be synchronized.
- S801 in the method 800 which will not be repeated here.
- S722 process the first training data to obtain the training data of the key points of the human face, the key points of the human face include the key points of the first feature, and the first feature includes the eyes, head posture, and lip shape at least one.
- processing the first training data to obtain the training data of the key points of the face can be specifically implemented in the following manner:
- the second loss function value is obtained by calculating the difference between the human faces in the plurality of first training video frames and the human faces in the plurality of training video frames.
- the first sub-model is trained based on the data of key points including eyes, head posture, lip shape and other features for the purpose of reducing the value of the loss function, so that the generated facial movements can show changes in eyes,
- Features such as head movement posture and lip shape changes make the images of human face movements more vivid and real, improving user experience.
- Method 720 also includes:
- processing the first training data to obtain training data of key points of the arm and hand;
- the training data of the key points of the arm and hand, the training images of the arm and the hand are input into the second sub-model to obtain a plurality of second training video frames, and the second training video frame is the hand and the arm
- the training video frame of the action, the third model includes the first sub-model and the second sub-model.
- the third loss function value is calculated according to the difference between the hands and arms in the plurality of second training video frames and the hands and arms in the plurality of training video frames.
- the training model generates video frames of facial movements and sign language movements respectively, so that the model generates facial movements that can reflect the emotion of the voice while generating sign language movements based on the voice, and presents the content of the voice expression more vividly and accurately. Improve user experience.
- obtaining a plurality of first training video frames includes: for the purpose of reducing the second loss function value and the fourth loss function value, the training data of the key points of the face, the training image of the face, Input the first sub-model to obtain a plurality of first training video frames.
- Obtaining a plurality of second training video frames includes: for the purpose of reducing the third loss function value and the fourth loss function value, inputting the training data of the key points of the arm and the hand and the training images of the arm and the hand into the second The sub-model obtains a plurality of second training video frames.
- the fourth loss function value is calculated according to the difference between the first similarity and the second similarity
- the first similarity is the first training video frame played in the first time period
- the second similarity is the plurality of training video frames in the first time period
- the generated facial movements and sign language movements are more synchronized, further improving user experience.
- processing the first training data to obtain the training data of the key points of the arm and hand can be specifically implemented in the following ways:
- the key points of the arm and hand need to reflect the three-dimensional human body structure of the hand and arm, for example, multiple key points are required for the palm and the back of the hand, and multiple key points are required for the outer side of the arm and the inner side of the arm.
- the key points of the palm and the back of the hand are not distinguished in three dimensions, but only two-dimensional key points are extracted, then the key point information obtained from the extraction cannot be determined. The palms or the backs of the hands were clasped to the chest.
- the key point data of the human face corresponding to the voice data is obtained based on the voice data, especially the key point data of eyes, head posture, and lip shape corresponding to the voice data, so that based on The information of the key points of the face generated by the voice is richer, and it can better reflect the emotion of the voice.
- FIG. 8 is a schematic flowchart of a method 800 of the present application.
- the video includes a human face, a pair of left and right hands and arms, and voice, wherein the facial movements, hand and arm movements, and voice are synchronized.
- the synchronization here can be understood as when the voice plays a certain vocabulary, the hands and arms are performing the sign language movements corresponding to the vocabulary, and the facial expressions of the human face also correspond to the vocabulary, which further assists in conveying the emotional information corresponding to the vocabulary.
- the eyebrows can be furrowed or stretched accordingly, the eyes can be erratic or affectionately gazed accordingly, the nose can be wrinkled or nostrils can be dilated accordingly, mouth movements can be related to Ordinary people have the same lip shape when they say the word, or they can also grin or pucker accordingly.
- the synchronization here can also be understood as, the error of the playing time of one or more audio frames corresponding to the same vocabulary and one or more video frames is less than the first threshold, and the one or more video frames include
- the audio frame of the one or more voices is the pronunciation of the vocabulary for expressing the facial movements, hand and arm movements of a certain vocabulary.
- the synchronization here can also be understood as that one or more voice audio frames correspond to one or more video frames in the same time period.
- the movement of the human face may also include head movement gestures, such as nodding or shaking the head, or natural head swings when a person speaks.
- head movement gestures such as nodding or shaking the head, or natural head swings when a person speaks.
- the video data may be one or more continuous videos, and the faces in the one or more continuous videos belong to the same person.
- the hands and arms in the video may or may not belong to the same person as the face; the voice in the video may be spoken by the person corresponding to the face, or it may be someone else said.
- the speech rate of the speech in the video data is the speech rate of daily communication, and the content of the speech involves vocabulary in daily communication.
- the duration of the video data may be about 4 minutes.
- the video data here can be understood as a data set for training a neural network model.
- the video data here can be understood as a data set composed of each frame of pictures in the video and the corresponding audio, or the video data here can be understood as the data composed of video frames and audio frames corresponding to each word in the video Set, or the video data here can be understood as a data set composed of video frames and audio frames corresponding to a unit time period in the video.
- the voice input into model 1 can be the voice data itself, or a feature or feature vector extracted from the voice data; the face input into model 1 can be the face in the video, and it can also be extracted from the face The features or eigenvectors of .
- the characteristics of speech include: sound intensity, sound intensity level, loudness, pitch, gene period, gene frequency, signal-to-noise ratio, harmonic-to-noise ratio, frequency jitter, amplitude jitter (shimmer) , normalized noise energy, Mel frequency cepstrum coefficient (Mel frequency cepstrum coefficient, MFCC), phoneme, etc.
- the features of the human face here may be key points (landmarks) used to identify the positions of various parts on the human face, wherein the various parts on the human face include eyebrows, eyes, nose, mouth, cheeks, etc.;
- the key points of the eyes can be the upper eyelid, lower eyelid, inner eye corner, outer eye corner, pupil, pupil, etc.
- the key points of the eyebrows can be the inner side of the eyebrows, the middle point of the eyebrows, the outer side of the eyebrows, etc.
- the key points of the mouth can be the chin, inner lips The upper middle part, the lower middle part of the inner lip, the left inner lip angle, the right inner lip angle, etc.
- the key points of the nose can be the middle of the bridge of the nose, the left side of the nose, the right side of the nose wing, the tip of the nose, etc.
- the key points of the cheeks can be the cheekbones, apples, etc. Muscle etc.
- the features of the human face may also include key points for identifying head movement. As an example, by determining the horizontal axis and the vertical axis of the face, some points on the horizontal axis and the longitudinal axis can be taken as the key points of the head movement, and the trajectory of these key points can be used to represent the head. Athletic posture.
- the central axis of the bridge of the nose can be used as the vertical central axis of the face
- the intersection point of the line connecting the inner eyelids of the left and right eyes with the vertical central axis of the face can be the central point of the face
- the central point intersects with the vertical central axis of the face
- the line of is the horizontal central axis of the face, and several points are taken on the horizontal central axis and the vertical central axis as the key points of head movement.
- the facial movement information here includes information such as facial expressions, lip movements, and head posture changes when a person speaks, and can be embodied as a movement trajectory of key points that identify the positions of various parts of the human face.
- facial expressions include eyebrows, eyes, nose, mouth, cheeks, etc. when speaking
- lip movements are mouth movements corresponding to the voice when speaking
- head postures can be nodding or shaking the head to express emotions when speaking , can also be the natural head bobbing when speaking.
- the face action information here can be used to identify the face of each frame in the video in the form of key points, and the movement of the face can be reflected through the change of the position of the key points.
- the discriminator 1 in model 1 compares the face action information generated by the generator in model 1 with the face action information in the video to obtain the loss function value_1. Backpropagate the loss function value_1, and adjust the parameters that model 1 needs to train.
- Extract speech and key points of arms and hands from the video database convert the speech into text, input the text and key points of arms and hands into model 2, and train model 2 to predict the information of sign language movements.
- the sign language action information here is the change process of the position of the arm skeleton posture and the key points of the hand, or the movement trajectory of the arm skeleton posture and the key points of the hand.
- the sign language action information here includes the arm skeleton posture of the character in each video frame and the position of the key points of the hand. It should be understood that changes in the position of key points on multiple video frames visually form sign language actions. .
- the key points of the arm and hand need to reflect the three-dimensional human body structure of the hand and arm, for example, multiple key points are required for the palm and the back of the hand, and multiple key points are required for the outer side of the arm and the inner side of the arm.
- the key points of the palm and the back of the hand are not distinguished in three dimensions, but only two-dimensional key points are extracted, then the key point information obtained from the extraction cannot be determined. The palms or the backs of the hands were clasped to the chest.
- the discriminator 2 in model 2 combines the arm skeleton pose and the key point position of the hand generated by the generator in model 2 with the arm skeleton pose and key point position information of the character in each video frame in the video Compare and get the loss function value_2. Backpropagate the loss function value_2, and adjust the parameters that model 2 needs to train.
- Model 3 is trained to generate real face sequence pictures and sign language sequence pictures.
- Model 3 includes Model 3-1 and Model 3-2, which will be described in detail below through S804 and S805 respectively.
- the face image here is extracted from the video data collected in S801, and the information of the face action can be understood as the movement track of the key point position of the face, and the information of the face action and the face image are input into the model 3-1.
- the facial movements shown in the output face sequence pictures are real human facial movements. Therefore, the "real” here can be understood as the facial movement obtained after the movement trajectory of the key point is combined with the face image, which is more real and more real than the movement trajectory of the key point position.
- the discriminator 3-1 in the model 3-1 compares the real face sequence pictures generated by the generator in the model 3-1 with the face actions in the video, and obtains the loss function value_3-1.
- the arm and hand images here are extracted from the video data collected in S801, and the information of sign language movements can be understood as the movement trajectory of the key point positions of the arms and hands, and the information of sign language movements and the arm and hand
- the internal image is input to the model 3-2, and the sign language movements presented in the sign language sequence pictures output after training the model 3-2 are real hand and arm movements. Therefore, the "real” here can be understood as the movement of the hand and arm obtained after the movement trajectory of the key point is combined with the image of the arm and hand. Compared with the movement trajectory of the key point position, it is more like the hand and arm of a real person, and more real .
- the discriminator 3-2 in the model 3-2 compares the real sign language sequence pictures generated by the generator in the model 3-2 with the sign language actions in the video, and obtains the loss function value_3-2.
- the discriminator 3-3 in model 3 synchronizes the synchronization degree of the real face sequence pictures and sign language sequence pictures generated by model 3-1 and model 3-2 with the synchronization of face movements and sign language movements in the video data in S801
- the degree of comparison is obtained to obtain a loss function value of 3-3.
- the discriminator 3-3 is based on the difference between the two The similarity between them is used to judge whether they are synchronized.
- the discriminator 3-3 gives a probability value related to the similarity after discrimination. The higher the probability value, the higher the similarity, the more synchronous. The lower the probability value, the lower the similarity, the less synchronous.
- a second threshold can be set artificially. If the probability value is higher than the second threshold, it can be judged as synchronous, otherwise it is not synchronous.
- m face sequence pictures corresponding to the semantics in the face sequence pictures and n face sequence pictures corresponding to the semantics in the sign language sequence pictures are respectively extracted, m and n is a positive integer, and the discriminator 3-3 judges whether they are synchronized according to the error of the playing time between the two.
- the discriminator 3-3 gives a probability value related to the error after discrimination. The higher the probability value, the larger the error and the more out of sync. The lower the probability value, the smaller the error and the more synchronous.
- a third threshold can be set artificially. If the probability value is lower than the third threshold, it can be judged as synchronous, otherwise it is not synchronous.
- model 3-1 it is also possible to train the model 3-1 with the aim of minimizing the loss function value 3-1 and the loss function value 3-3, perform backpropagation with the sum of the loss function value 3-1 and the loss function value 3-3, and adjust
- the parameters that model 3-1 needs to train can also be used to train model 3-2 with the aim of minimizing the loss function value 3-2 and loss function value 3-3, and the loss function value 3-2 and loss function value 3-3 And carry out backpropagation, adjust the parameters that model 3-2 needs to train.
- the method 800 also includes, S806, repeating all the above steps, and stopping the training until the loss function values of all models reach the minimum, or in other words, loss function value_1, loss function value_2, loss function value_3-1, loss function
- the training is stopped when the sum of the values_3-2 reaches the minimum, or it is specified that the training is stopped when the execution times of the above steps reach a certain threshold.
- a threshold can be considered to be set, and when the loss function values of all models are lower than the threshold, then can be considered to be the minimum.
- a sequence of pictures in which facial movements and sign language movements are synchronized is obtained by using videos of synchronous facial movements, sign language movements, and voice as training data. Further, by superimposing a discriminator for judging whether the face sequence pictures and sign language sequence pictures generated by the two sub-models are synchronized, namely the discriminator 3-3. Obtain a neural network capable of generating synchronous face sequence pictures and sign language sequence pictures.
- the user uploads the voice and character image video to be translated.
- the character image video needs to include movements of the face, hands and arms.
- the video here can also be multiple pictures. It should be noted that whether it is a picture or a video, it needs to be able to present the key points of the three-dimensional structure of the hand and arm. Specifically, it is necessary to show the palm and back of the hand, the inner side of the arm and the outer side of the arm, etc.
- the face here can be dynamic or static.
- the human image video here may only include the above-mentioned hand and arm movements, and the face image may be acquired from a database or other local acquisition methods, which are not limited in this application.
- the voice here may be the same as or different from the voice in the video data collected in method 800.
- the character image here may be the same as or different from that in the video data collected in method 800.
- the user may use a front-end interface to use the system composed of the model trained by the method 800, for example, the user may use the system through a webpage or an application on a mobile terminal.
- the key points of the arms and hands need to reflect the three-dimensional human body structure of the hands and arms.
- the palm and the back of the hand need multiple key points respectively
- the outside of the arm and the inside of the arm need multiple key points respectively.
- the key points of the palm and the back of the hand are not distinguished in three dimensions, but only the two-dimensional key points are extracted, then the palm of the hand cannot be determined based on the extracted key point information. Still with the back of his hand on his chest.
- the key points (landmarks) of the human face here may be key points for identifying the positions of various parts on the human face, wherein the various parts on the human face include eyebrows, eyes, nose, mouth, cheeks, etc. ;
- the key points of the eyes can be upper eyelid, lower eyelid, inner eye corner, outer eye corner, pupil, pupil, etc.;
- the key points of the nose can be the middle of the bridge of the nose, the left side of the nose, the right side of the nose wing, the tip of the nose, etc.
- the key points of the cheeks can be the cheekbones, Apple muscle, etc.
- the features of the human face may also include key points for identifying head movement.
- some points on the horizontal axis and the longitudinal axis can be taken as the key points of the head movement, and the trajectory of these key points can be used to represent the head.
- Athletic posture by determining the horizontal axis and the vertical axis of the face, some points on the horizontal axis and the longitudinal axis can be taken as the key points of the head movement, and the trajectory of these key points can be used to represent the head.
- Athletic posture is an example, by determining the horizontal axis and the vertical axis of the face, some points on the horizontal axis and the longitudinal axis can be taken as the key points of the head movement, and the trajectory of these key points can be used to represent the head.
- Athletic posture by determining the horizontal axis and the vertical axis of the face, some points on the horizontal axis and the longitudinal axis can be taken as the key points of the head movement, and the trajectory of these key points can be used to represent the head.
- the central axis of the bridge of the nose can be used as the vertical central axis of the face
- the intersection point of the line connecting the inner eyelids of the left and right eyes with the vertical central axis of the face can be the central point of the face
- the central point intersects with the vertical central axis of the face
- the line of is the horizontal central axis of the face, and several points are taken on the horizontal central axis and the vertical central axis as the key points of head movement.
- the arm and hand images here need at least two images to be able to present the palm and back of the hand, the outer side of the arm and the inner side of the arm, etc.
- the face image can be obtained from the database, or other
- the local acquisition method is not limited in this application.
- the sign language action information here is the change process of the position of the arm skeleton posture and the key points of the hand, or the movement trajectory of the arm skeleton posture and the key points of the hand.
- the sign language action information here includes the arm skeleton posture of the character in each video frame and the position of the key points of the hand. It should be understood that changes in the position of key points on multiple video frames visually form sign language actions. .
- the facial movement information here includes information such as facial expressions, lip movements, and head posture changes when a person speaks, and can be embodied as a movement trajectory of key points that identify the positions of various parts of the human face.
- facial expressions include eyebrows, eyes, nose, mouth, cheeks, etc. when speaking
- lip movements are mouth movements corresponding to the voice when speaking
- head postures can be nodding or shaking the head to express emotions when speaking , can also be the natural head bobbing when speaking.
- the face action information here can be used to identify the face of each frame in the video in the form of key points, and the movement of the face can be reflected through the change of the position of the key points.
- S910 based on the sign language action information and arm and hand images, according to model 3-2, generate real sign language sequence pictures.
- the facial animation corresponding to the audio is generated to vividly display the emotional information of the audio, so that deaf-mute people can more accurately understand the meaning of the audio expression. Furthermore, through the changes in the eyes and lip shape of the characters in the generated animation, the matching degree between the facial animation and the audio is higher, and through the natural swing of the head of the character in the generated animation when speaking, the animation is more realistic.
- the characters are more realistic, improving the user experience of the deaf-mute, promoting communication and communication between the deaf-mute and ordinary people, and facilitating the deaf-mute to better integrate into society.
- FIG. 10 is a schematic flowchart of a method 1000 of the present application.
- the video data here is similar to the video data in S801, and will not be repeated here.
- what is input into the model 4 here may be a facial animation, or a sequence of images of a human face, or features of facial expressions of a facial animation, where the features of facial expressions may be represented by facial feature points.
- the characteristics of the input model 4 can be that the key points of the left and right brows are slightly shifted to the sides, and the key points of the lower eyelids are moved up. The key points of the left and right mouth corners are moved up, the key points of the apple muscles are moved up, etc., the voice emotional information output by Model 4 can be "happy" or "joy".
- the features input into model 4 can be the key points of the upper eyelids moved down, the key points of the left and right mouth corners moved down, and the key points of the middle of the lips. Move the point up, etc., the voice emotion information output by model 4 can be "sad” or "lost".
- the discriminator 4 in model 4 compares the speech emotion information generated by the generator in model 4 with the speech emotion information in the video, and obtains the loss function value_4.
- the discriminator 5 in model 5 compares the text information generated by the generator in model 5 with the speech-converted text information in the video to obtain the loss function value_5.
- the emotional voice here can be understood as the tone in the generated voice can make the listener feel the speaker's emotional tendency, and use emotional information to assist the listener to understand the semantics expressed by the speaker. For example, if the voice content is "what are you doing”, if the voice emotion information generated in S404b is "joy” or “calm”, then “what are you doing” here can be understood as the speaker normally asking the listener what time What are you doing, and if the voice emotion information generated in S404b is "anger”, then the "what are you doing” here can be understood as the speaker berating the listener to stop his hand.
- the content of the voice is "haha”, if the voice emotion information generated in S404b is "joy”, then “haha” here can be understood as the speaker is laughing suddenly, if the voice emotion information generated in S404b is " Helpless”, then the “haha” here can be understood as the speaker's wry smile.
- the method 1000 also includes, S1005, repeating all the above steps until the loss function values of all models reach the minimum, or in other words, the sum of the loss function value_4 and the loss function value_5 reaches the minimum.
- a threshold can be considered to be set, and when the loss function values of all models are lower than the threshold, then can be considered to be the minimum.
- the model for generating speech emotion information based on facial animation by combining the model for generating text information based on sign language actions, and the model for generating speech with emotion based on speech emotion information and text information, it is possible to generate speech with Neural Networks for Emotional Speech.
- the user uploads a video of sign language and facial animation.
- the content presented in the video here needs to include sign language movements and the corresponding facial movements of the sign language movements.
- the "correspondence" mentioned here can be understood as, when the sign language gesture expresses a certain meaning, the content or expression presented by the facial gesture also corresponds to the voice.
- the expression on the human face is also sad, such as drooping eyebrows, drooping corners of the mouth, and the head may droop slightly.
- the sign language gesture expresses "disgust”
- the face may frown slightly and the head tilts back slightly.
- the user may use a front-end interface to use the system composed of the model trained by the method 1000, for example, the user may use the system through a webpage or an application on a mobile terminal.
- the voice with emotion is generated, so that the voice translated into the sign language can express the meaning of the deaf-mute more accurately, and promote The exchange and communication between the deaf and ordinary people facilitates the deaf and mute to better integrate into society.
- the device of the embodiment of the present application will be described below with reference to FIG. 12 to FIG. 13 . It should be understood that the device described below can execute the method of the aforementioned embodiment of the present application. In order to avoid unnecessary repetition, repeated descriptions are appropriately omitted when introducing the device of the embodiment of the present application below.
- Fig. 12 is a schematic block diagram of an apparatus according to an embodiment of the present application.
- the apparatus 4000 shown in FIG. 12 includes an acquisition unit 4010 and a processing unit 4020 .
- the device 4000 can be used as a neural network training device, and the acquisition unit 4010 and the processing unit 4020 can be used to execute the neural network training method of the embodiment of the present application. For example, they can be used to execute the method 710 or Method 720 or method 800 or method 1000.
- the obtaining unit 4010 is used to obtain the first training data
- the first training data includes a plurality of training audio frames and a plurality of training video frames, wherein the training audio frames corresponding to the same semantic
- the playback error of the training video frame is less than or equal to the first threshold, and the semantic similarity corresponding to the face action and sign language action in the same training video frame is greater than or equal to the second threshold; also used for the first training
- the data is extracted to obtain the first training voice data.
- the processing unit 4020 is configured to input the first training voice data into the first model to obtain training data of key points of the human face, the key points of the human face include key points of first features, and the first features include eyes, At least one of head posture and lip shape; also used to calculate the first loss function value according to the difference between the training data of the key points of the human face and the data of the key points of the human face in the plurality of training video frames; It is also used to backpropagate the first loss function value to adjust parameters that need to be trained.
- the obtaining unit 4010 is used to obtain the first training data
- the first training data includes a plurality of training audio frames and a plurality of training video frames, wherein the training audio corresponding to the same semantic
- the playback error of the frame and the training video frame is less than or equal to the first threshold, and the semantic similarity corresponding to the face action and sign language action in the same training video frame is greater than or equal to the second threshold;
- the processing unit 4020 is configured to process the first training data to obtain training data of key points of the human face, the key points of the human face include key points of first features, and the first features include eyes, head At least one of posture, lip shape;
- the acquisition unit 4010 is further configured to extract the first training data to obtain a face training image
- the processing unit 4020 is further configured to input the training data of key points of the human face and the training image of the human face into the first sub-model for the purpose of reducing the value of the second loss function to obtain a plurality of first sub-models.
- Training video frames, the first training video frame is a training video frame of human face action.
- the processing unit 4020 is further configured to process the first training data to obtain training data of key points of arms and hands;
- the acquiring unit 4010 is further configured to extract the first training data to obtain training images of arms and hands;
- the processing unit 4020 is further configured to input the training data of the key points of the arm and the hand and the training images of the arm and the hand into the second sub-model for the purpose of reducing the value of the third loss function, so as to obtain a plurality of first sub-models Two training video frames, the second training video frame is a training video frame of hand and arm movements, and the third model includes a first sub-model and a second sub-model.
- the processing unit 4020 is further configured to, for the purpose of reducing the value of the second loss function and the value of the fourth loss function, combine the training data of the key points of the face, the The training image of the face is input to the first sub-model to obtain a plurality of first training video frames;
- the processing unit 4020 is specifically further configured to: for the purpose of reducing the value of the third loss function and the value of the fourth loss function, input the training data of the key points of the arm and the hand and the training images of the arm and the hand into the first
- the second sub-model obtains a plurality of second training video frames;
- the second loss function value is obtained by calculating the difference between the human faces in the plurality of first training video frames and the human faces in the plurality of training video frames,
- the third loss function value is calculated according to the difference between the hands and arms in the plurality of second training video frames and the hands and arms in the plurality of training video frames,
- the fourth loss function value is calculated according to the difference between the first similarity and the second similarity, and the first similarity is the number of first training video frames played within the first time period.
- the similarity between the semantics corresponding to the training video frame of the training video frame and the semantics corresponding to the training video frame played in the first time period by the plurality of second training video frames, the second similarity is the The similarity of semantics corresponding to human face movements and hand and arm movements in the training video frames played within the first time period.
- the obtaining unit 4010 is further configured to extract the first training data to obtain the first training voice data;
- the processing unit 4020 is further configured to input the first training voice data into the first model to obtain the key points of the human face; according to the key points of the human face and the key points of the human face in the plurality of training video frames Calculate the first loss function value based on the difference of point data; perform backpropagation on the first loss function value, and adjust the parameters that need to be trained in the first model.
- the device 4000 may serve as a device for generating animation.
- the device includes an acquisition unit 4010 and a processing unit 4020 .
- the acquiring unit 4010 and the processing unit 4020 may be used to execute the method for generating an animation in the embodiment of the present application, for example, may be used to execute the method 500 or the method 900 .
- the acquiring unit 4010 is used to acquire the voice to be processed and the video to be processed;
- the processing unit 4020 is configured to obtain first pre-processed data according to the to-be-processed voice and the to-be-processed video, the first pre-processed data including key points of faces corresponding to multiple audio frames of the to-be-processed voice data, the key points of the human face include the key points of the first feature, wherein at least one of the key points of the first feature corresponding to at least two audio frames in the plurality of audio frames
- the first feature includes at least one of eyes, head posture, and lip shape; it is also used to obtain a plurality of image frames according to the first preprocessing data and the video to be processed, and the plurality of images
- the first feature is different in at least two image frames among the frames; and is also used to obtain animation according to the plurality of image frames.
- the processing unit 4020 is further configured to input the speech to be processed and the video to be processed into a first model to obtain the first preprocessed data, and the first model is based on the first training Data training, the first training data includes a plurality of training audio frames and a plurality of training video frames, wherein the playback error of the training audio frames and the training video frames corresponding to the same semantics is less than or equal to the first Threshold, the semantic similarity corresponding to the face action and the sign language action in the same training video frame is greater than or equal to the second threshold.
- the acquiring unit 4010 is further configured to extract the video to be processed to obtain a face image
- the processing unit 4020 is further configured to input the data of the key points of the face and the face image into the first sub-model to obtain a plurality of first image frames, and the first image frames are image frames of face movements , the first sub-model is obtained through training based on training data of key points of human faces and training images of human faces.
- the content of the to-be-processed video further includes arm and hand movements
- the first pre-processed data further includes arm and hand motions corresponding to multiple audio frames of the to-be-processed speech key point data
- the processing unit 4020 is further configured to extract the video to be processed to obtain images of arms and hands; input the data of key points of the arms and hands, and the images of the arms and hands into the second sub-
- the model obtains a plurality of second image frames, the second image frames are image frames of hands and arm movements, and the second sub-model is obtained based on training data of key points of arms and hands, training images of arms and hands , wherein the plurality of image frames includes the plurality of first image frames and the plurality of second image frames.
- the first sub-model is obtained by training for the purpose of reducing the second loss function value and the fourth loss function value
- the second sub-model is obtained by reducing the third loss function value
- the fourth loss function value is obtained from the purpose training
- the second loss function value is calculated according to the difference between the faces in the first training video frames and the faces in the training video frames Obtained
- the plurality of first training video frames are obtained by inputting the training data of the key points of the human face and the training image of the human face into the first sub-model
- the third loss function value is obtained according to The difference between the hands and arms in the plurality of second training video frames and the hands and arms in the plurality of training video frames is calculated
- the plurality of second training video frames is obtained by combining the hands and arms
- the training data of the key points, the training images of the arm and the hand are input into the second sub-model to obtain
- the fourth loss function value is calculated according to the difference between the first similarity and the second similarity, so
- the first similarity is the semantics
- the device 4000 may serve as a device for generating speech.
- the device includes an acquisition unit 4010 and a processing unit 4020 .
- the acquiring unit 4010 and the processing unit 4020 may be used to execute the method for generating speech in this embodiment of the present application, for example, may be used to execute the method 600 or the method 1100 .
- the obtaining unit 4010 is used to obtain data to be processed, the data to be processed includes multiple image frames of facial movements and multiple image frames of sign language movements;
- the processing unit 4020 is used to obtain the emotion information corresponding to the facial action and the first text corresponding to the sign language action according to the data to be processed; Speech data with emotional information.
- unit here may be implemented in the form of software and/or hardware, which is not specifically limited.
- a "unit” may be a software program, a hardware circuit or a combination of both to realize the above functions.
- the hardware circuitry may include application specific integrated circuits (ASICs), electronic circuits, processors (such as shared processors, dedicated processors, or group processors) for executing one or more software or firmware programs. etc.) and memory, incorporating logic and/or other suitable components to support the described functionality.
- ASICs application specific integrated circuits
- processors such as shared processors, dedicated processors, or group processors for executing one or more software or firmware programs. etc.
- memory incorporating logic and/or other suitable components to support the described functionality.
- the units of each example described in the embodiments of the present application can be realized by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
- FIG. 13 is a schematic diagram of a hardware structure of a device provided by an embodiment of the present application.
- the apparatus 6000 shown in FIG. 13 includes a memory 6001 , a processor 6002 , a communication interface 6003 and a bus 6004 .
- the memory 6001 , the processor 6002 , and the communication interface 6003 are connected to each other through a bus 6004 .
- the device 6000 can be used as a training device for a neural network.
- the memory 6001 may be a read only memory (read only memory, ROM), a static storage device, a dynamic storage device or a random access memory (random access memory, RAM).
- the memory 6001 may store a program, and when the program stored in the memory 6001 is executed by the processor 6002, the processor 6002 is configured to execute each step of the neural network training method of the embodiment of the present application. Specifically, the processor 6002 may execute steps S713, S714, S715, S722, and S724 in the method shown in FIG. 7 above, steps S802 to S805 in the method shown in FIG. Steps S1002 and S1003.
- the processor 6002 may be a general-purpose central processing unit (central processing unit, CPU), a microprocessor, an application specific integrated circuit (application specific integrated circuit, ASIC), a graphics processing unit (graphics processing unit, GPU) or one or more
- the integrated circuit is used to execute related programs to implement the neural network training method of the method embodiment of the present application.
- the processor 6002 may also be an integrated circuit chip with signal processing capabilities, for example, it may be the chip shown in the figure.
- each step of the neural network training method of the present application may be completed by an integrated logic circuit of hardware in the processor 6002 or instructions in the form of software.
- the above-mentioned processor 6002 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application-specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components.
- DSP digital signal processing
- ASIC application-specific integrated circuit
- FPGA field programmable gate array
- Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed.
- a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
- the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
- the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
- the storage medium is located in the memory 6001, and the processor 6002 reads the information in the memory 6001, and combines its hardware to complete the functions required by the units included in the training device in the embodiment of the application, or execute the diagram of the method embodiment of the application. 7.
- the training method of the neural network shown in Fig. 8 and Fig. 10 The training method of the neural network shown in Fig. 8 and Fig. 10 .
- the communication interface 6003 implements communication between the apparatus 6000 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver. For example, training data can be obtained through the communication interface 6003 .
- the bus 6004 may include pathways for transferring information between various components of the device 6000 (eg, memory 6001 , processor 6002 , communication interface 6003 ).
- the device 6000 may serve as a device for generating animation.
- the memory 6001 can be ROM, static storage device and RAM.
- the memory 6001 may store a program, and when the program stored in the memory 6001 is executed by the processor 6002, the processor 6002 and the communication interface 6003 are used to execute each step of the method for generating animation according to the embodiment of the present application. Specifically, the processor 6002 may execute steps S520 to S540 in the method shown in FIG. 5 above.
- the processor 6002 can be general-purpose, CPU, microprocessor, ASIC, GPU or one or more integrated circuits, for executing related programs, so as to realize the functions required by the units in the device for generating animation according to the embodiment of the present application , or execute the animation generation method of the method embodiment of the present application.
- the processor 6002 may also be an integrated circuit chip with signal processing capabilities, for example, it may be the chip shown in FIG. 4 .
- each step of the method for generating animation in the embodiment of the present application may be completed by an integrated logic circuit of hardware in the processor 6002 or instructions in the form of software.
- the above processor 6002 may also be a general processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
- a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
- the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
- the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in the field.
- the storage medium is located in the memory 6001, and the processor 6002 reads the information in the memory 6001, and combines its hardware to complete the functions required to be performed by the units included in the device for generating animation in the embodiment of the application, or execute the generation of animation in the method embodiment of the application.
- the animation method is located in the memory 6001, and the processor 6002 reads the information in the memory 6001, and combines its hardware to complete the functions required to be performed by the units included in the device for generating animation in the embodiment of the application, or execute the generation of animation in the method embodiment of the application.
- the device 6000 may serve as a device for generating voice.
- the memory 6001 can be ROM, static storage device and RAM.
- the memory 6001 may store a program, and when the program stored in the memory 6001 is executed by the processor 6002, the processor 6002 and the communication interface 6003 are used to execute each step of the method for generating speech in the embodiment of the present application. Specifically, the processor 6002 may execute steps S610 to S620 in the method shown in FIG. 6 above.
- Processor 6002 can be general-purpose, CPU, microprocessor, ASIC, GPU or one or more integrated circuits, used to execute related programs, so as to realize the functions required by the units in the device for generating speech in the embodiment of the present application , or execute the method for generating speech in the method embodiment of the present application.
- the processor 6002 may also be an integrated circuit chip with signal processing capabilities, for example, it may be the chip shown in FIG. 4 .
- each step of the method for generating speech in the embodiment of the present application may be completed by an integrated logic circuit of hardware in the processor 6002 or instructions in the form of software.
- the above processor 6002 may also be a general processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
- a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
- the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
- the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
- the storage medium is located in the memory 6001, and the processor 6002 reads the information in the memory 6001, and combines its hardware to complete the functions required to be performed by the units included in the device for generating speech in the embodiment of the application, or execute the generation of the method embodiment of the application. voice method.
- the communication interface 6003 implements communication between the apparatus 6000 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver.
- a transceiver device such as but not limited to a transceiver.
- the data to be processed can be obtained through the communication interface 6003 .
- the bus 6004 may include pathways for transferring information between various components of the device 6000 (eg, memory 6001 , processor 6002 , communication interface 6003 ).
- the device 6000 may also include other devices necessary for normal operation during specific implementation. Meanwhile, according to specific needs, those skilled in the art should understand that the apparatus 6000 may also include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the device 6000 may also only include the components necessary to realize the embodiment of the present application, and does not necessarily include all the components shown in FIG. 13 .
- An embodiment of the present application provides a computer-readable medium, the computer-readable medium is stored in program code executed by the device, and the program code includes relevant content for executing the method for generating animation as shown in Figure 5 or Figure 9, or, The program code includes relevant content for executing the method for generating speech as shown in FIG. 6 or FIG. 11 .
- the embodiment of the present application provides a computer-readable medium, the computer-readable medium is stored in the program code executed by the device, and the program code includes the method for executing the training method of the neural network as shown in Figure 7 or Figure 8 or Figure 10 related information.
- An embodiment of the present application provides a computer program product.
- the computer program product When the computer program product is run on a computer, the computer is made to execute the relevant content of the method for generating animation as shown in FIG. 5 or FIG. Or related content of the method for generating speech shown in FIG. 11 .
- An embodiment of the present application provides a computer program product.
- the computer program product When the computer program product is run on a computer, the computer is made to execute relevant content of the neural network training method as shown in FIG. 7 or 8 or 10 .
- An embodiment of the present application provides a chip, the chip includes a processor and a data interface, the processor reads the instructions on the memory through the data interface, and executes the method for generating animation as shown in Figure 5 or Figure 9 or as shown in Figure 9 6 or the method for generating speech shown in FIG. 11 .
- the embodiment of the present application provides a chip, the chip includes a processor and a data interface, the processor reads the instructions on the memory through the data interface, and executes the training of the neural network as shown in Figure 7 or Figure 8 or Figure 10 method.
- the chip may further include a memory, the memory stores instructions, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the animation generation method shown in FIG. 5 or FIG. 9 or the voice generation method shown in FIG. 6 or FIG. 11 or the neural network training method shown in FIG. 7 or FIG. 8 or FIG. 10 .
- the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
- a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
- the memory in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories.
- the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
- Volatile memory can be random access memory (RAM), which acts as external cache memory.
- RAM random access memory
- static random access memory static random access memory
- DRAM dynamic random access memory
- DRAM synchronous dynamic random access memory Access memory
- SDRAM synchronous dynamic random access memory
- double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
- enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
- serial link DRAM SLDRAM
- direct memory bus random access memory direct rambus RAM, DR RAM
- the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations.
- the above-described embodiments may be implemented in whole or in part in the form of computer program products.
- the computer program product comprises one or more computer instructions or computer programs.
- the processes or functions according to the embodiments of the present application will be generated in whole or in part.
- the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
- the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as infrared, wireless, microwave, etc.).
- the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media.
- the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media.
- the semiconductor medium may be a solid state drive.
- the disclosed systems, devices and methods may be implemented in other ways.
- the device embodiments described above are only illustrative.
- the division of the units is only a logical function division. In actual implementation, there may be other division methods.
- multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
- the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
- the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
- the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Processing Or Creating Images (AREA)
Abstract
La présente demande concerne un procédé et un appareil permettant de générer une animation. Le procédé permettant de générer une animation consiste : à traiter une parole acquise à traiter et une vidéo acquise à traiter de sorte à obtenir des données de points clés d'un visage correspondant à ladite parole, les points clés du visage comprenant des points clés d'une première caractéristique, la position d'au moins l'un des points clés de la première caractéristique correspondant à au moins deux trames audio d'une pluralité de trames audio étant différente, et la première caractéristique comprenant une expression dans les yeux et/ou la posture de la tête et/ou la forme des lèvres ; ensuite à obtenir une pluralité de trames d'image en fonction des données des points clés du visage et de ladite vidéo ; et, ensuite, à obtenir une animation en fonction de la pluralité de trames d'image. Au moyen du procédé et de l'appareil permettant de générer une animation fournis dans la présente demande, des expressions faciales dans une animation faciale sont enrichies de sorte à présenter de manière plus vive des informations émotionnelles de l'audio ; et le degré de correspondance entre l'animation faciale et la parole est augmenté de telle sorte qu'une personne sourde comprenne plus précisément la signification exprimée par l'audio, ce qui permet d'améliorer l'expérience d'utilisateur de la personne sourde.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110795095.0A CN115631267A (zh) | 2021-07-14 | 2021-07-14 | 生成动画的方法及装置 |
CN202110795095.0 | 2021-07-14 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023284435A1 true WO2023284435A1 (fr) | 2023-01-19 |
Family
ID=84902115
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/096773 WO2023284435A1 (fr) | 2021-07-14 | 2022-06-02 | Procédé et appareil permettant de générer une animation |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115631267A (fr) |
WO (1) | WO2023284435A1 (fr) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116342835A (zh) * | 2023-03-31 | 2023-06-27 | 华院计算技术(上海)股份有限公司 | 人脸三维表面网格生成方法、装置、计算设备及存储介质 |
CN116912373A (zh) * | 2023-05-23 | 2023-10-20 | 苏州超次元网络科技有限公司 | 一种动画处理方法和系统 |
CN117115315A (zh) * | 2023-07-12 | 2023-11-24 | 武汉人工智能研究院 | 语音驱动唇形生成方法、装置及存储介质 |
CN117635784A (zh) * | 2023-12-19 | 2024-03-01 | 世优(北京)科技有限公司 | 三维数字人脸部动画自动生成系统 |
CN117809002A (zh) * | 2024-02-29 | 2024-04-02 | 成都理工大学 | 一种基于人脸表情识别与动作捕捉的虚拟现实同步方法 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116071472B (zh) * | 2023-02-08 | 2024-04-30 | 华院计算技术(上海)股份有限公司 | 图像生成方法及装置、计算机可读存储介质、终端 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8566075B1 (en) * | 2007-05-31 | 2013-10-22 | PPR Direct | Apparatuses, methods and systems for a text-to-sign language translation platform |
CN104732590A (zh) * | 2015-03-09 | 2015-06-24 | 北京工业大学 | 一种手语动画的合成方法 |
CN109446876A (zh) * | 2018-08-31 | 2019-03-08 | 百度在线网络技术(北京)有限公司 | 手语信息处理方法、装置、电子设备和可读存储介质 |
CN110570877A (zh) * | 2019-07-25 | 2019-12-13 | 咪咕文化科技有限公司 | 手语视频生成方法、电子设备及计算机可读存储介质 |
CN111862277A (zh) * | 2020-07-22 | 2020-10-30 | 北京百度网讯科技有限公司 | 用于生成动画的方法、装置、设备以及存储介质 |
CN112329451A (zh) * | 2020-12-03 | 2021-02-05 | 云知声智能科技股份有限公司 | 手语动作视频生成方法、装置、设备及存储介质 |
CN113077537A (zh) * | 2021-04-29 | 2021-07-06 | 广州虎牙科技有限公司 | 一种视频生成方法、存储介质及设备 |
-
2021
- 2021-07-14 CN CN202110795095.0A patent/CN115631267A/zh active Pending
-
2022
- 2022-06-02 WO PCT/CN2022/096773 patent/WO2023284435A1/fr active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8566075B1 (en) * | 2007-05-31 | 2013-10-22 | PPR Direct | Apparatuses, methods and systems for a text-to-sign language translation platform |
CN104732590A (zh) * | 2015-03-09 | 2015-06-24 | 北京工业大学 | 一种手语动画的合成方法 |
CN109446876A (zh) * | 2018-08-31 | 2019-03-08 | 百度在线网络技术(北京)有限公司 | 手语信息处理方法、装置、电子设备和可读存储介质 |
CN110570877A (zh) * | 2019-07-25 | 2019-12-13 | 咪咕文化科技有限公司 | 手语视频生成方法、电子设备及计算机可读存储介质 |
CN111862277A (zh) * | 2020-07-22 | 2020-10-30 | 北京百度网讯科技有限公司 | 用于生成动画的方法、装置、设备以及存储介质 |
CN112329451A (zh) * | 2020-12-03 | 2021-02-05 | 云知声智能科技股份有限公司 | 手语动作视频生成方法、装置、设备及存储介质 |
CN113077537A (zh) * | 2021-04-29 | 2021-07-06 | 广州虎牙科技有限公司 | 一种视频生成方法、存储介质及设备 |
Non-Patent Citations (3)
Title |
---|
"Master Thesis", 3 June 2019, NORTHWEST NORMAL UNIVERSITY, CN, article NAN SONG: "Research on Sign Language-to-Mandarin/Tibetan Emotional Speech Conversion by Combining Facial Expression Recognition", pages: 1 - 45, XP093025249 * |
SONG NAN, PEI-WEN WU, HONG-WU YANG: "Gesture-to-emotional Speech Conversion Based on Gesture Recognigion and Facial Expression Recognition", SHENGXUE-JISHU : JIKAN = TECHNICAL ACOUSTICS, SHENG XUE JI SHU BIAN JI BU, CN, vol. 37, no. 4, 31 August 2018 (2018-08-31), CN , pages 372 - 379, XP093024719, ISSN: 1000-3630, DOI: 10.16300/j.cnki.1000-3630.2018.04.014 * |
XINYA JI; HANG ZHOU; KAISIYUAN WANG; WAYNE WU; CHEN CHANGE LOY; XUN CAO; FENG XU: "Audio-Driven Emotional Video Portraits", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 20 May 2021 (2021-05-20), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081955157 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116342835A (zh) * | 2023-03-31 | 2023-06-27 | 华院计算技术(上海)股份有限公司 | 人脸三维表面网格生成方法、装置、计算设备及存储介质 |
CN116912373A (zh) * | 2023-05-23 | 2023-10-20 | 苏州超次元网络科技有限公司 | 一种动画处理方法和系统 |
CN116912373B (zh) * | 2023-05-23 | 2024-04-16 | 苏州超次元网络科技有限公司 | 一种动画处理方法和系统 |
CN117115315A (zh) * | 2023-07-12 | 2023-11-24 | 武汉人工智能研究院 | 语音驱动唇形生成方法、装置及存储介质 |
CN117635784A (zh) * | 2023-12-19 | 2024-03-01 | 世优(北京)科技有限公司 | 三维数字人脸部动画自动生成系统 |
CN117635784B (zh) * | 2023-12-19 | 2024-04-19 | 世优(北京)科技有限公司 | 三维数字人脸部动画自动生成系统 |
CN117809002A (zh) * | 2024-02-29 | 2024-04-02 | 成都理工大学 | 一种基于人脸表情识别与动作捕捉的虚拟现实同步方法 |
CN117809002B (zh) * | 2024-02-29 | 2024-05-14 | 成都理工大学 | 一种基于人脸表情识别与动作捕捉的虚拟现实同步方法 |
Also Published As
Publication number | Publication date |
---|---|
CN115631267A (zh) | 2023-01-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7557055B2 (ja) | 目標対象の動作駆動方法、装置、機器及びコンピュータプログラム | |
WO2023284435A1 (fr) | Procédé et appareil permettant de générer une animation | |
CN110688911B (zh) | 视频处理方法、装置、系统、终端设备及存储介质 | |
WO2022048403A1 (fr) | Procédé, appareil et système d'interaction multimodale sur la base de rôle virtuel, support de stockage et terminal | |
CN111145322B (zh) | 用于驱动虚拟形象的方法、设备和计算机可读存储介质 | |
US8224652B2 (en) | Speech and text driven HMM-based body animation synthesis | |
KR101558202B1 (ko) | 아바타를 이용한 애니메이션 생성 장치 및 방법 | |
Fan et al. | A deep bidirectional LSTM approach for video-realistic talking head | |
CN113421547B (zh) | 一种语音处理方法及相关设备 | |
CN114144790A (zh) | 具有三维骨架正则化和表示性身体姿势的个性化语音到视频 | |
US20230082830A1 (en) | Method and apparatus for driving digital human, and electronic device | |
CN111967334B (zh) | 一种人体意图识别方法、系统以及存储介质 | |
KR102373608B1 (ko) | 디지털 휴먼 영상 형성을 위한 전자 장치 및 방법과, 그를 수행하도록 컴퓨터 판독 가능한 기록 매체에 저장된 프로그램 | |
Rebol et al. | Passing a non-verbal turing test: Evaluating gesture animations generated from speech | |
Tuyen et al. | Conditional generative adversarial network for generating communicative robot gestures | |
Filntisis et al. | Video-realistic expressive audio-visual speech synthesis for the Greek language | |
CN117152843B (zh) | 数字人的动作控制方法及其系统 | |
WO2024066549A1 (fr) | Procédé de traitement de données et dispositif associé | |
Liu et al. | Real-time speech-driven animation of expressive talking faces | |
JP7526874B2 (ja) | 画像生成システムおよび画像生成方法 | |
CN115409923A (zh) | 生成三维虚拟形象面部动画的方法、装置及系统 | |
Fatima et al. | Use of affect context in dyadic interactions for continuous emotion recognition | |
Manglani et al. | Lip Reading Into Text Using Deep Learning | |
US20240265605A1 (en) | Generating an avatar expression | |
Chand et al. | Survey on Visual Speech Recognition using Deep Learning Techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22841075 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22841075 Country of ref document: EP Kind code of ref document: A1 |