CN110688911B - Video processing method, device, system, terminal equipment and storage medium - Google Patents

Video processing method, device, system, terminal equipment and storage medium Download PDF

Info

Publication number
CN110688911B
CN110688911B CN201910838068.XA CN201910838068A CN110688911B CN 110688911 B CN110688911 B CN 110688911B CN 201910838068 A CN201910838068 A CN 201910838068A CN 110688911 B CN110688911 B CN 110688911B
Authority
CN
China
Prior art keywords
video
reply
image sequence
face
face image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910838068.XA
Other languages
Chinese (zh)
Other versions
CN110688911A (en
Inventor
刘炫鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhuiyi Technology Co Ltd
Original Assignee
Shenzhen Zhuiyi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhuiyi Technology Co Ltd filed Critical Shenzhen Zhuiyi Technology Co Ltd
Priority to CN201910838068.XA priority Critical patent/CN110688911B/en
Publication of CN110688911A publication Critical patent/CN110688911A/en
Application granted granted Critical
Publication of CN110688911B publication Critical patent/CN110688911B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The embodiment of the application discloses a video processing method, a video processing device, terminal equipment and a storage medium. The method comprises the following steps: acquiring a video to be processed; acquiring a target audio segment and a face image sequence in a video to be processed; performing emotion analysis on the face image sequence to obtain emotion characteristics; carrying out voice analysis on the target audio clip to obtain sentence characteristics, wherein the sentence characteristics are used for representing key words in the target audio clip; determining reply content and expression behaviors of the virtual character based on the emotional characteristics and the sentence characteristics; and generating and outputting a reply video aiming at the video to be processed, wherein the reply video comprises voice content corresponding to the reply content and a virtual character executing the expression behavior. According to the embodiment of the application, the emotion characteristics and the sentence characteristics of the character can be obtained according to the speaking video of the character, the virtual character video matched with the emotion characteristics and the sentence characteristics is generated as a reply, and the sense of reality and the naturalness of man-machine interaction are improved.

Description

Video processing method, device, system, terminal equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of human-computer interaction, in particular to a video processing method, a device, a system, terminal equipment and a storage medium.
Background
Along with the development of the related technology of artificial intelligence, the robot customer service has stronger and stronger functions and more applicable scenes, so that the customer service efficiency is greatly improved, and the artificial resources are saved. However, most of the existing robot customer services are in dialogue with users in a text mode, the interaction mode is single, and the semantic understanding accuracy is not high. And some robot customer service services can perform face-to-face communication with users, but the expressions are mechanical, and the robot customer service services lack vividness, so that the user experience is greatly reduced.
Disclosure of Invention
In view of the above problems, embodiments of the present application provide a video processing method, an apparatus, a terminal device, and a storage medium, which can improve semantic understanding accuracy of a robot, improve reality and naturalness of human-computer interaction, and optimize human-computer interaction experience.
In a first aspect, an embodiment of the present application provides a video processing method, where the video processing method may include: acquiring a video to be processed; acquiring a target audio segment and a face image sequence in a video to be processed, wherein the target audio segment comprises a human voice segment, the face image sequence comprises a plurality of face images, and the face image sequence corresponds to the target audio segment; performing emotion analysis on the face image sequence to obtain emotion characteristics, wherein the emotion characteristics are used for representing emotion of people in the face image, performing voice analysis on the target audio segment to obtain sentence characteristics, and the sentence characteristics are used for representing key words in the target audio segment; determining reply content and expression behaviors of the virtual character based on the emotional characteristics and the sentence characteristics; and generating and outputting a reply video aiming at the video to be processed, wherein the reply video comprises voice content corresponding to the reply content and a virtual character executing the expression behavior.
Optionally, determining the reply content and the performance behavior of the virtual character based on the emotional characteristic and the sentence characteristic includes: obtaining semantic information of the target audio clip based on the emotional features and the sentence features; according to the semantic information, determining reply content corresponding to the semantic information; and searching the expression behaviors of the virtual character corresponding to the semantic information and the emotional characteristics from a pre-established rule base.
Optionally, the performing behavior includes expressions and motions, and a reply video for the video to be processed is generated and output, where the reply video includes voice content corresponding to the reply content and a virtual character performing the performing behavior, including: generating voice content corresponding to the reply content according to the reply content; acquiring expression driving parameters and action driving parameters corresponding to the expression behaviors; driving the expression and the action of the virtual character based on the expression driving parameter and the action driving parameter to generate a first recovery image sequence, wherein the first recovery image sequence is formed by a plurality of continuous action images generated by driving the virtual character; and generating and outputting a reply video aiming at the video to be processed according to the voice content and the first reply image sequence, wherein the first reply image sequence is played corresponding to the voice content.
Optionally, generating and outputting a reply video for the video to be processed according to the voice content and the first reply image sequence, where the first reply image sequence is played corresponding to the voice content, and the method includes: acquiring a mouth shape image corresponding to the reply content from a mouth shape database established in advance; synthesizing the mouth shape image to the mouth position of the virtual character in each behavior image of the corresponding first reply image sequence to obtain a second reply image sequence; and generating and outputting a reply video aiming at the video to be processed according to the voice content and the second reply image sequence, wherein the second reply image sequence is played corresponding to the voice content.
Optionally, performing emotion analysis on the face image sequence to obtain an emotion feature, including: extracting a face key point corresponding to each face image in the face image sequence; providing each face image and the face key point corresponding to each face image as input to a machine learning model to obtain a feature vector corresponding to each face image, wherein the machine learning model is pre-trained to output the feature vector corresponding to the face image according to the face image and the face key point corresponding to the face image; and determining the emotion characteristics corresponding to the feature vectors according to the mapping relation between the feature vectors and the emotion characteristics to obtain the emotion characteristics corresponding to each face image in the face image sequence.
Optionally, the obtaining a target audio segment and a face image sequence in the video to be processed includes: decomposing a video to be processed to obtain a complete audio stream and a video image sequence; acquiring audio segments except the interference audio in the complete audio stream as target audio segments in the video to be processed; acquiring a target image sequence corresponding to the time stamp from the video image sequence according to the time stamp of the target audio clip; and extracting all face images of the target person in the target image sequence, and taking all face images as the face image sequence in the video to be processed.
Optionally, extracting all face images of the target person in the target image sequence, and taking all face images as the face image sequence in the video to be processed includes: extracting face key points of a target person in each target image in the target image sequence; based on the key points of the human face, intercepting the face image corresponding to the key points of the human face in each target image; and preprocessing each face image to obtain a face image sequence in a specified format, wherein the preprocessing comprises at least one of amplification, reduction and movement processing.
In a second aspect, an embodiment of the present application provides a video processing apparatus, which may include: the video acquisition module is used for acquiring a video to be processed; the audio image acquisition module is used for acquiring a target audio segment and a face image sequence in a video to be processed, wherein the target audio segment comprises a human voice segment, the face image sequence comprises a plurality of face images, and the face image sequence corresponds to the target audio segment; the emotion analysis module is used for carrying out emotion analysis on the face image sequence to obtain emotion characteristics, and the emotion characteristics are used for representing the emotion of people in the face image; the voice analysis module is used for carrying out voice analysis on the target audio clip to obtain sentence characteristics, and the sentence characteristics are used for representing key words in the target audio clip; the data determining module is used for determining reply content and the expression behavior of the virtual character based on the emotional characteristic and the sentence characteristic; and the video generation module is used for generating and outputting a reply video aiming at the video to be processed, and the reply video comprises voice content corresponding to the reply content and a virtual character executing the expression behavior.
Optionally, the data determination module comprises: the semantic acquiring unit is used for acquiring semantic information of the target audio clip based on the emotion characteristics and the sentence characteristics; the reply content determining unit is used for determining reply content corresponding to the semantic information according to the semantic information; and the expression behavior determining unit is used for searching the expression behaviors of the virtual character corresponding to the semantic information and the emotional characteristics from a pre-established rule base.
Optionally, the performance behavior includes expressions and motions, and the video generation module includes: the voice generating unit is used for generating voice content corresponding to the reply content according to the reply content; the parameter acquisition unit is used for acquiring expression driving parameters and action driving parameters corresponding to the expression behaviors; the image generation unit is used for driving the expression and the action of the virtual character based on the expression driving parameter and the action driving parameter to generate a first recovery image sequence, and the first recovery image sequence is formed by a plurality of continuous action images generated by driving the virtual character; and the first video generation unit is used for generating and outputting a reply video aiming at the video to be processed according to the voice content and the first reply image sequence, and the first reply image sequence is played corresponding to the voice content.
Optionally, the first video generating unit includes: the mouth shape obtaining subunit is used for obtaining a mouth shape image corresponding to the reply content from a mouth shape database established in advance; the mouth shape synthesizing subunit is used for synthesizing the mouth shape images to the mouth positions of the virtual characters in each behavior image of the corresponding first reply image sequence to obtain a second reply image sequence; and the second video generation subunit is used for generating and outputting a reply video aiming at the video to be processed according to the voice content and the second reply image sequence, and the second reply image sequence is played corresponding to the voice content.
Optionally, the emotion analysis module includes: the key point extraction unit is used for extracting a face key point corresponding to each face image in the face image sequence; the vector acquisition unit is used for providing each face image and the face key points corresponding to each face image as input to the machine learning model to obtain the feature vector corresponding to each face image, and the machine learning model is trained in advance to output the feature vector corresponding to the face image according to the face image and the face key points corresponding to the face image; and the feature determining unit is used for determining the emotion features corresponding to the feature vectors according to the mapping relation between the feature vectors and the emotion features to obtain the emotion features corresponding to each face image in the face image sequence.
Optionally, the audio image acquisition module includes: the decomposition unit is used for decomposing the video to be processed to obtain a complete audio stream and a video image sequence; the audio acquisition unit is used for acquiring audio segments except the interference audio in the complete audio stream as target audio segments in the video to be processed; the image acquisition unit is used for acquiring a target image sequence corresponding to the time stamp from the video image sequence according to the time stamp of the target audio clip; and the face image extraction unit is used for extracting all face images of the target person in the target image sequence and taking all the face images as the face image sequence in the video to be processed.
Optionally, the face image extraction unit includes: the data extraction subunit is used for extracting the face key points of the target person in each target image in the target image sequence; the image intercepting subunit is used for intercepting a face image corresponding to the face key point in each target image based on the face key point; and the image processing subunit is used for preprocessing each face image to obtain a face image sequence in a specified format, and the face image sequence is used as the face image sequence in the video to be processed, wherein the preprocessing comprises at least one of amplification, reduction and movement processing.
In a third aspect, an embodiment of the present application provides a terminal device, where the terminal device may include: a memory; one or more processors coupled with the memory; one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more application programs configured to perform the method of the first aspect as described above.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having program code stored therein, where the program code is called by a processor to execute the method according to the first aspect.
The embodiment of the application provides a video processing method, a video processing device, a terminal device and a storage medium, wherein a target audio segment and a face image sequence in a video to be processed are obtained, the target audio segment comprises a voice segment, the face image sequence comprises a plurality of face images, the face image sequence corresponds to the target audio segment, then emotion analysis is carried out on the face image sequence to obtain emotion characteristics, voice analysis is carried out on the target audio segment to obtain sentence characteristics, and based on the emotion characteristics and the sentence characteristics, reply content and expression behaviors of virtual characters are determined to generate and output a reply video aiming at the video to be processed, and the reply video comprises voice content corresponding to the reply content and the virtual characters executing the expression behaviors. Therefore, the interactive robot can realize video input, assist the robot in semantic understanding according to the emotional characteristics and the sentence characteristics of the figures in the video, improve the semantic understanding accuracy of the robot, and simultaneously generate the corresponding virtual figure video for replying according to the emotional characteristics and the sentence characteristics of the figures in the video, so that the reality and the naturalness of human-computer interaction are improved, and the human-computer interaction experience is optimized.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description are only some embodiments, not all embodiments, of the present application. All other embodiments and drawings obtained by a person skilled in the art based on the embodiments of the present application without any inventive step are within the scope of the present invention.
Fig. 1 shows a schematic diagram of an application environment suitable for the embodiment of the present application.
Fig. 2 is a flowchart illustrating a video processing method according to an embodiment of the present application.
Fig. 3 shows an interaction diagram of a video processing method provided by an embodiment of the present application.
Fig. 4 is a flowchart illustrating a video processing method according to another embodiment of the present application.
Fig. 5 shows a flowchart of the method of step S320 in fig. 4.
Fig. 6 shows a flowchart of the method of step S324 in fig. 5.
Fig. 7 shows a flowchart of the method of step S330 in fig. 4.
FIG. 8 is a diagram illustrating an Arousal-value emotion model provided in an embodiment of the present application.
Fig. 9 shows a flowchart of the method of step S350 in fig. 4.
Fig. 10 shows a flowchart of the method of step S360 in fig. 4.
Fig. 11 shows a flowchart of the method of step S364 in fig. 10.
Fig. 12 shows a block diagram of a video processing apparatus according to an embodiment of the present application;
fig. 13 is a block diagram illustrating a terminal device according to an embodiment of the present application, configured to execute a video processing method according to an embodiment of the present application;
fig. 14 shows a block diagram of a computer readable storage medium for executing a video processing method according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
With the development of the related technology of artificial intelligence, the function of the customer service robot is stronger and stronger. In communicating with the customer service robot, the user usually wants to get an accurate reply. However, when the machine semantically understands what the user said, the machine is susceptible to ambiguity, so that the machine has errors in determining the user's intention. In order to reduce the influence of ambiguity and improve the accuracy of semantic understanding, the prior art generally introduces context information or actively disambiguates by a user through a query mode.
However, the introduction of context information for disambiguation requires the acquisition and retention of the historical speech content of the user, and the technique cannot be used when the context information does not exist; when the context information is not related to the current speech content, the introduction of the context information may rather reduce the accuracy of semantic understanding. Active disambiguation by the user in the manner of interrogation may increase the communication cost of the user and reduce the user experience.
In order to solve the above problems, the inventor researches the difficulty of interaction between the customer service robot and the user at present, and more comprehensively considers the use requirements of the actual scene, and provides a video processing method, an apparatus, a terminal device and a storage medium according to the embodiments of the present application.
In order to better understand a video processing method, an apparatus, a terminal device, and a storage medium provided in the embodiments of the present application, an application environment suitable for the embodiments of the present application is described below.
Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment suitable for the embodiment of the present application. The video processing method provided by the embodiment of the present application can be applied to the polymorphic interaction system 100 shown in fig. 1. The polymorphic interaction system 100 includes a terminal device 101 and a server 102, the server 102 being communicatively coupled to the terminal device 101. The server 102 may be a conventional server or a cloud server, and is not limited herein.
The terminal device 101 may be various electronic devices having a display screen and supporting data input, including but not limited to a smart phone, a tablet computer, a laptop portable computer, a desktop computer, a wearable electronic device, and the like. Specifically, the data input may be based on a voice module provided on the terminal device 101 to input voice, a character input module to input characters, an image input module to input images, a video input module to input video, and the like, or may be based on a gesture recognition module provided on the terminal device 101, so that a user may implement an interaction manner such as gesture input.
Wherein, the terminal device 101 may be installed with a client application program, and the user may communicate with the server 102 based on the client application program (e.g. APP, wechat applet, etc.), specifically, the server 102 is installed with a corresponding server application program, and the user may register a user account at the server 102 based on the client application program and communicate with the server 102 based on the user account, for example, a user logs into a user account at a client application, and enters through the client application based on the user account, text information, voice information, image information or video information and the like can be input, and after the client application program receives the information input by the user, the information may be sent to the server 102 so that the server 102 may receive the information, process and store the information, and the server 102 may also receive the information and return a corresponding output information to the terminal device 101 according to the information.
In some embodiments, a client application may be used to provide customer service to a user, in customer service communication with the user, and the client application may interact with the user based on a virtual robot. In particular, the client application may receive information input by a user and respond to the information based on the virtual robot. The virtual robot is a software program based on visual graphics, and the software program can present robot forms simulating biological behaviors or ideas to a user after being executed. The virtual robot may be a robot simulating a real person, such as a robot resembling a real person, which is created according to the shape of the user himself or the other person, or a robot having an animation effect, such as a robot having an animal shape or a cartoon character shape.
In some embodiments, after acquiring reply information corresponding to information input by the user, the terminal device 101 may display a virtual robot image corresponding to the reply information on a display screen of the terminal device 101 or other image output device connected thereto. As a mode, while the virtual robot image is played, the audio corresponding to the virtual robot image may be played through a speaker of the terminal device 101 or other audio output devices connected thereto, and a text or a graphic corresponding to the reply information may be displayed on a display screen of the terminal device 101, so that multi-state interaction with the user in multiple aspects of image, voice, text, and the like is realized.
In some embodiments, the means for processing the information input by the user may also be disposed on the terminal device 101, so that the terminal device 101 can interact with the user without relying on establishing communication with the server 102, and in this case, the polymorphic interaction system 100 may only include the terminal device 101.
The above application environments are only examples for facilitating understanding, and it is to be understood that the embodiments of the present application are not limited to the above application environments.
The following describes in detail a video processing method, an apparatus, a terminal device, and a storage medium provided in embodiments of the present application with specific embodiments.
Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a video processing method according to an embodiment of the present application, where the video processing method according to the embodiment can be applied to a terminal device having a display screen or other image output devices, and can also be applied to a server. The terminal device can be an electronic device such as a smart phone, a tablet computer and a wearable intelligent terminal. In a specific embodiment, the video processing method can be applied to the video processing 900 shown in fig. 12 and the terminal device 600 shown in fig. 13. As will be described in detail with respect to the flow shown in fig. 2, the video processing method may specifically include the following steps:
step S210: and acquiring a video to be processed.
At present, the robot brings great convenience to life and work of people. For example, some cumbersome interaction may be accomplished by a virtual digital person instead of a real person. In the process of interaction between the robot and the user, the intention of the user is usually determined through text content or voice content input by the user, but the robot is easily affected by ambiguity, so that the robot has errors when judging the intention of the user, and accurate interaction cannot be performed. Therefore, in the embodiment of the application, the to-be-processed video including the speaking video of the user is obtained, so that the intention and emotion analysis of the user is performed according to the video, the accurate reply content is further determined, and the accuracy of semantic understanding of the robot can be effectively improved.
The video to be processed is a video stream containing a user face and a user audio, and the user face corresponds to the user audio. The terminal equipment can acquire the video to be processed in various modes. In some embodiments, the video to be processed may be a speaking video of the user, which is acquired by the terminal device in real time by using an audio acquisition device such as a microphone and an image acquisition device such as a camera when the user has a conversation with the interactive robot. Specifically, as a manner, when an application program corresponding to the interactive robot is run in a system foreground of the terminal device, each hardware module of the terminal device may be called to collect a speaking video of the user.
In other embodiments, the video to be processed may also be a recorded video, and the recorded video is required to satisfy the characters and audio in the video and keep consistent with the current conversation object of the interactive robot. As a mode, when an application corresponding to the interactive robot runs in a system foreground of the terminal device, a recorded video input by a user at an application interface corresponding to the interactive robot may be acquired through a background of the application. The recorded video may be a video acquired from a third-party client program, or a recorded video downloaded from the internet or remotely. It can be understood that the source of the video to be processed is not limited, and only the video to be processed includes the face and the audio of the user currently talking with the interactive robot, which are not listed here.
Step S220: the method comprises the steps of obtaining a target audio segment and a face image sequence in a video to be processed, wherein the target audio segment comprises a human voice segment, the face image sequence comprises a plurality of face images, and the face image sequence corresponds to the target audio segment.
In some embodiments, after the to-be-processed video is acquired by the terminal device, a target audio segment and a face image sequence in the to-be-processed video may be acquired, where the target audio segment includes a human voice segment, the face image sequence includes a plurality of face images, and the face image sequence corresponds to the target audio segment. Wherein, the human voice segment can be understood as the speaking segment of the user who is currently interacting with the interactive robot.
In some embodiments, the video to be processed may be decomposed to extract the target audio segment and the sequence of facial images. The face image sequence may be a video image including a face of the user selected from the decomposed video images. For example, after a video of 30FPS of 1 minute length is decomposed into 1800 video images (1 minute × 60 seconds/minute × 30 frames/second) and 1 minute audio, a target audio segment of 30 seconds and a video image including a human face corresponding to the target audio segment of 30 seconds are extracted.
In some embodiments, the sequence of facial images may be a chronological and continuous set of facial images. For example, a user's face in a video appears and speaks all the time in a period of time, and the sequence of the acquired face images is a set of face images that are sequential in time and continuous. In other embodiments, the face image sequence may also be a discontinuous face image set with a timestamp, for example, when the face information of the user cannot be acquired due to the user falling down or turning around in a certain time period in a video, the obtained face image sequence is a discontinuous face image set with a chronological order.
Step S230: and performing emotion analysis on the face image sequence to obtain emotion characteristics, wherein the emotion characteristics are used for representing the emotion of the person in the face image.
In the embodiment of the application, after the terminal device acquires the face image sequence in the video to be processed, emotion analysis can be performed on the face image sequence to acquire the emotion characteristics of the user. The emotion characteristics can be used for representing the emotion of the person in the face image. In some embodiments, the emotion characterized by the emotional characteristics may include positive emotions such as excitement, pleasure, happiness, satisfaction, relaxation, coolness, and the like, and may also include negative emotions such as fatigue, boredom, depression, anger, tension, and the like, without limitation.
In some embodiments, the sequence of face images may be analyzed for mood by a deep learning technique. As one way, the face image sequence may be input into the trained emotion recognition model, and an output result output by the emotion recognition model is obtained. Wherein the output result may include emotional characteristics of the person in the image. Specifically, in some embodiments, the emotion recognition model may be obtained by training through a neural network in advance based on a human face image sequence when a large number of real persons speak and training samples of emotional features presented by human faces. The training samples can comprise input samples and output samples, the input samples can comprise a face image sequence, the output samples can be emotion characteristics of people in the images, and therefore the trained emotion recognition model can be used for outputting the emotion characteristics of the people in the images according to the obtained face image sequence.
The emotion recognition model may adopt RNN (Recurrent Neural Network) model, CNN (Convolutional Neural Network) model, blst (Bi-directional Long Short-Term Memory) model, VAE (variable auto Encoder) model, BERT (Bidirectional Encoder representation of transformer), Support Vector Machine (SVM), and other Machine learning models, which are not limited herein. For example, the emotion recognition model may also be a variant or a combination of the above-described machine learning models, or the like.
It is understood that the specific training method of the machine learning model may be an existing training method, and is not limited in the embodiments of the present application. For example, the structure, training method and objective function of the model may be improved twice according to actual requirements. As one way, the machine learning model may be trained on a data set that does not distinguish a specific field (text) or a character object (image), and then fine tuning may be performed on the machine learning model according to the specific field (text) or the character object (image), so as to achieve an ideal effect quickly.
In some embodiments, the emotion recognition model may be run in a server, which converts the sequence of face images into corresponding emotional features by the emotion recognition model based on the sequence of face images. As one mode, after the terminal device obtains the face image sequence, the face image sequence may be sent to the server, and the server identifies and converts the face image sequence into a corresponding emotion feature, that is, the data processing process of converting the emotion feature may be completed by the server. The emotion recognition model is deployed in the server, occupation of storage capacity and operation resources of the terminal equipment can be reduced, the server only needs to receive a small amount of data (the size of face image sequence information is small), pressure of data transmission is greatly reduced, and efficiency of data transmission is improved.
In other embodiments, the emotion recognition model may also run locally at the terminal device, so that the interactive robot may provide services in an offline environment.
Step S240: and carrying out voice analysis on the target audio clip to obtain sentence characteristics, wherein the sentence characteristics are used for representing key words in the target audio clip.
In the embodiment of the application, after the terminal device acquires the target audio clip in the video to be processed, the terminal device may perform voice analysis on the target audio clip to acquire the sentence characteristics of the target audio clip. Wherein the sentence features can be used to characterize keywords in the target audio segment. For example, a certain sentence in the target audio segment is "how the weather of beijing tomorrow is, and the sentence features extracted after the terminal device performs the voice analysis may be: key words such as tomorrow, Beijing, weather, etc.
In some embodiments, the voice analysis is performed on the target audio segment to obtain the sentence characteristics, which may be performing voice-to-text processing on the target audio segment to obtain text content corresponding to the target audio segment, and then extracting keywords in the text content as the sentence characteristics. The target audio clip can be subjected to voice-to-text processing in various ways. As one approach, the target audio segment may be converted to textual content through deep learning techniques. Specifically, the target audio segment may be input into the trained speech recognition model to obtain the text content output by the speech recognition model corresponding to the target audio segment. The speech recognition model may be a training sample based on a large amount of audio information of a real person during speaking and text content corresponding to the audio information in advance, and is obtained through neural network training.
In some implementations, the speech recognition model can be run in a server, which converts the target audio segment into corresponding textual content based on the target audio segment via the speech recognition model. Or can run locally on the terminal equipment, so that the interactive robot can provide services in an offline environment.
Step S250: based on the emotional characteristics and the sentence characteristics, the reply content and the expression behavior of the virtual character are determined.
In the embodiment of the application, after the terminal device obtains the emotion characteristics and the sentence characteristics, reply contents to be replied by the virtual character and the expression behaviors of the virtual character can be determined based on the emotion characteristics and the sentence characteristics.
The virtual character is a virtual robot form which simulates human behaviors or ideas to a user through executing a software program of a visual graph, namely, a virtual character which simulates a human form, can be a virtual character which is correspondingly acquired by an interactive robot aiming at sentences and emotions of the user in a speaking video of the user and is used for presenting the user, such as a virtual anchor, a virtual customer service person and the like. The reply content of the virtual character can be content which is acquired by aiming at sentences and emotions of the user in the video and is used for the virtual character to perform corresponding reply to the user, and can be presented in the forms of text, audio, images and the like. The expressive behavior of the avatar may be the body language of the avatar that is obtained for presentation to the user. In some embodiments, the performance behavior of the virtual character may include a facial expression of the virtual character, such as distraction, dementia, surprise, or concern, and may also include a body movement of the virtual character, such as touching the chin, bending, or shaking hands, which is not limited herein, and may be set according to an actual scene.
It can be understood that the semantics of the same sentence characterized under different emotions are different. For example, the semantic information understood in the negative emotion of the same sentence "what this means" may be a question, a catharsis, or the like, and the semantic information understood in the positive emotion may be a question, a consultation, or the like. Therefore, in the embodiment of the application, semantic understanding can be performed according to the sentence of the user and the emotional state of the user when the user speaks the sentence, so that the intention of the user can be accurately determined, and the robot can adopt different reply contents and different expression behaviors. For example, the reply content in the negative emotion is "please disappear the breath" or the like, and the reply content in the positive emotion is "XX" means … … "or the like.
In some embodiments, after the emotion features and the sentence features are obtained, the terminal device may determine, by using a deep learning technique, semantic information related to a conversation, such as an intention of a user, a word slot, and the like, so as to determine, according to the semantic information, corresponding reply content and an expression behavior of a virtual character. As one mode, the emotion feature and the sentence feature are input into the trained feature recognition model to obtain semantic information output by the feature recognition model, and then the corresponding reply content and the expression behavior of the virtual character are generated according to the semantic information and the emotion feature of the user. The feature recognition model can be obtained by training a neural network based on a large number of input samples of the sentence features and the emotion features and output samples of semantic information corresponding to the sentence features under the emotion features.
The feature recognition model may be the machine learning model, and is not limited herein. The feature recognition model can be run locally on the terminal device or in the server.
Step S260: and generating and outputting a reply video aiming at the video to be processed, wherein the reply video comprises voice content corresponding to the reply content and a virtual character executing the expression behavior.
In this embodiment, after determining the reply content and the expression behavior of the virtual character, the terminal device may generate a reply video for the video to be processed, where the reply video includes the voice content corresponding to the reply content and the personalized virtual character executing the expression behavior, and then may output the reply video including the personalized virtual character and the replied voice content, so as to present a simulated appearance, sound, and behavior similar to a real-person customer service robot image to the user.
In some embodiments, the reply content may be processed by the audio conversion device to obtain corresponding speech content. Further, the timbre and tone of the sound for broadcasting the voice content may be matched with the emotional characteristics and the sentence characteristics of the user, for example, a softer sound may be selected for broadcasting in a negative emotion. The human-computer interaction experience is optimized.
In some embodiments, after determining the expression behavior of the virtual character, the terminal device may obtain corresponding expression behavior driving parameters, such as expression driving parameters, motion driving parameters, and the like, and then drive the virtual character to generate the expression behavior according to the expression behavior driving parameters, thereby generating an image stream containing the virtual character executing the expression behavior, and then correspond the image stream with the voice content to generate a reply video of the video to be processed.
Furthermore, in some embodiments, the lip action of the virtual character can be determined according to the voice content, so that the lip action of the virtual character can correspond to the output voice content, the virtual character can simulate the real character to speak, polymorphic interaction is performed with a user by utilizing the voice to match with natural expressions and behaviors, the sense of reality and the naturalness of human-computer interaction are improved, and the human-computer interaction experience is optimized and enriched.
In a specific application scenario, as shown in fig. 3, a user may open an application client (e.g., a wechat applet or a standalone APP) through a terminal device to enter an interactive interface with a virtual robot, where the interactive interface includes a video interface, and the user may directly perform a face-to-face conversation with the virtual robot displayed by the video interface. Meanwhile, the terminal device may acquire a speaking video of the user by calling each hardware module such as a camera and a microphone, and then perform emotion analysis and voice analysis on the speaking video to generate a reply content for the speaking video, such as "i sorry too much", and generate an expression for the speaking video, such as "hands are placed on the chest and chafe expression is exposed on the face", and generate a reply video (including a voice content corresponding to the reply content and a virtual robot performing the expression) according to the reply content and the expression, so as to play the reply video on a video interface on the interactive interface (a female image in the reply video shown in fig. 3 is a virtual character simulating a real person). In some scenarios, a text message "i sorry —" corresponding to the reply content may also be displayed at the bottom of the video.
In some embodiments, in a state where the terminal device establishes a communication connection with the server, when the terminal device acquires a to-be-processed video, the to-be-processed video may also be sent to the server, the server performs emotion analysis and voice analysis on the to-be-processed video, then the server determines reply content corresponding to the to-be-processed video and an expression behavior of a virtual character, and generates a reply video including the voice content corresponding to the reply content and the virtual character performing the expression behavior. And outputting the reply video to the terminal equipment, and acquiring, playing and displaying by the terminal equipment.
It can be understood that, in this embodiment, each step may be performed locally by the terminal device, may also be performed in the server, and may also be performed by the terminal device and the server separately, and according to different actual application scenarios, tasks may be allocated according to requirements, so as to implement an optimized virtual robot customer service experience, which is not limited herein.
According to the video processing method provided by the embodiment of the application, a to-be-processed video of a user who has a conversation with a robot is obtained, then voice analysis is carried out on a target audio segment in the to-be-processed video, sentence characteristics are obtained, emotion analysis is carried out on a face image sequence corresponding to the target audio segment, emotion characteristics are obtained, reply content of the robot and expression behaviors of virtual characters are determined based on the emotion characteristics and the sentence characteristics, so that the reply video for the to-be-processed video is generated and output, and the reply video comprises the voice content corresponding to the reply content and the virtual characters executing the expression behaviors. Therefore, the interactive robot can realize video input, assist the robot in semantic understanding according to the emotion characteristics and the sentence characteristics of the figures in the video, improve the semantic understanding accuracy of the robot, and simultaneously generate the corresponding personalized virtual figure video for replying according to the emotion characteristics and the sentence characteristics of the figures in the video, so that the sense of reality and the naturalness of human-computer interaction are improved, and the human-computer interaction experience is optimized.
Referring to fig. 4, fig. 4 is a flowchart illustrating a video processing method according to another embodiment of the present application. As will be described in detail with respect to the flow shown in fig. 4, the video processing method may specifically include the following steps:
step S310: and acquiring a video to be processed.
Step S320: the method comprises the steps of obtaining a target audio segment and a face image sequence in a video to be processed, wherein the target audio segment comprises a human voice segment, the face image sequence comprises a plurality of face images, and the face image sequence corresponds to the target audio segment.
In the embodiment of the present application, the content in the foregoing embodiment can be referred to for the specific description of step S310 and step S320, which is not described herein again.
In some embodiments, referring to fig. 5, the acquiring a target audio segment and a face image sequence in a video to be processed may include:
step S321: and decomposing the video to be processed to obtain a complete audio stream and a video image sequence.
The video is formed by combining an audio stream and a video image stream, and the video image stream is formed by splicing video images of one frame and another frame according to the time sequence. Thus, in some embodiments, the video to be processed may be decomposed by various existing video decomposition software to obtain a complete audio stream and video image sequence of the video to be processed. Wherein the time stamp lengths of the complete audio stream and the sequence of video images are identical, which may be the same as the length of the video. The video image sequence may be understood as a set of consecutive video image frames that are generated in chronological order after a video is decomposed into a plurality of video images.
Step S322: and acquiring audio segments except the interference audio in the complete audio stream as target audio segments in the video to be processed.
In the embodiment of the application, after the complete audio stream is obtained, the interference of the complete audio stream can be removed, so that an audio segment except for the interference audio in the complete audio stream is obtained and serves as a target audio segment in the video to be processed. The interference audio may be noise audio, background audio, silent audio, audio of other users, and other audio that is not related to the robot dialogue user. So that the speech audio of the user who has a conversation with the interactive robot can be extracted.
In some embodiments, when the sounds of multiple users are included in the complete audio stream, the sound segment of the user with the clearest sound or the largest volume may be selected from the complete audio stream as the audio segment of the user having a conversation with the robot, that is, as the target audio segment in the video to be processed.
Step S323: and acquiring a target image sequence corresponding to the time stamp from the video image sequence according to the time stamp of the target audio clip.
In this embodiment of the application, after the target audio segment in the video to be processed is obtained, the target image sequence corresponding to the timestamp may be obtained from the video image sequence obtained through the decomposition according to the timestamp of the target audio segment. So that the emotional state of the user when speaking the target audio piece can be determined from the target image sequence.
Because the target audio clip is a partial audio extracted from the complete audio stream, the entire target audio clip may be continuous, or an audio clip formed by combining a plurality of continuous partial audio clips in chronological order. Therefore, the target image sequence corresponding to the target audio clip may also be a continuous image set with a time sequence, or may also be a discontinuous image set with a time sequence, which is not limited herein, and only the time stamp of the target image sequence corresponds to the time stamp of the target audio stream.
Step S324: and extracting all face images of the target person in the target image sequence, and taking all face images as the face image sequence in the video to be processed.
In the embodiment of the application, after the target image sequence is obtained, all face images of a target person in the target image sequence can be extracted, and all face images are used as the face image sequence in the video to be processed. Wherein the target person is a user having a conversation with the interactive robot.
In some embodiments, when there are multiple users in a certain image in the target image sequence, the face image with the largest occupied area may be used as the target person, or the user closest to the camera may be used as the target person, which is not limited herein.
Further, since there may be actions of lowering head, turning around, etc. during the speaking process of the user, if the face image is extracted only according to the face occupation area or the distance of the face distance, it may be possible to cause that a correct face image cannot be extracted from a certain image, so that all face images of the target person may be obtained by first identifying the face feature information of the determined face image of the target person, and then selecting all face images matched with the face feature information from the target image sequence. Therefore, the emotional state of the user who has a conversation with the interactive robot can be accurately acquired.
In some embodiments, referring to fig. 6, in order to segment the image portion where the face is located from each complete single target image, face key point extraction may be performed on the target image. Specifically, the extracting all face images of the target person in the target image sequence, and taking all face images as the face image sequence in the video to be processed may include:
step S3241: and extracting the face key points of the target person in each target image in the target image sequence.
In some embodiments, the face of the target person in each target image in the sequence of target images may be identified and the face keypoints of the target person extracted 68 to determine the face region of the target person in each target image based on the location distribution of the face keypoints. The number of extracted face key points is not limited in the embodiment of the present application, and may be 128, for example.
Step S3242: and intercepting the face image corresponding to the face key points in each target image based on the face key points.
It can be understood that after the face key points of the target person in the target image are extracted, the key region positions of the face of the target person, such as eyebrows, eyes, a nose, a mouth, a face contour and the like, can be located according to the face key points, so that the face image corresponding to the face key points in each target image can be intercepted according to the position distribution of the face key points.
Step S3243: and preprocessing each face image to obtain a face image sequence in a specified format, wherein the preprocessing comprises at least one of amplification, reduction and movement processing.
In some embodiments, after obtaining the face image of the target person, each face image may be preprocessed to obtain a face image sequence in a specified format as the face image sequence in the video to be processed. Wherein the preprocessing includes at least one of a zoom-in, zoom-out, and move processing. Therefore, the face image can be processed into the face image in the specified format through operations such as scaling, translation and the like, and subsequent analysis is facilitated.
In some embodiments, the specified format may be a specified image size, such as 256 pixels by 256 pixels, or may be a specified proportion of a face in an image, where the face is located in the center of the image, and is not limited herein and is set reasonably according to actual situations. For example, when a face image is recognized by using a neural network model, a specified format of a face image sequence may be set according to an input requirement of the neural network model.
Step S330: and performing emotion analysis on the face image sequence to obtain emotion characteristics, wherein the emotion characteristics are used for representing the emotion of the person in the face image.
In some embodiments, referring to fig. 7, the performing emotion analysis on the face image sequence to obtain an emotional characteristic may include:
step S331: and extracting face key points corresponding to each face image in the face image sequence.
It can be understood that the position distribution of the face key points is different when the face makes different expressions, so that the face key points corresponding to each face image in the face image sequence can be extracted for emotion analysis, so as to improve the accuracy of emotion analysis. The number of the face key points may be 68.
Step S332: and providing each face image and the face key points corresponding to each face image as input to a machine learning model to obtain the feature vector corresponding to each face image, and pre-training the machine learning model to output the feature vector corresponding to the face image according to the face image and the face key points corresponding to the face image.
Specifically, in some embodiments, the machine learning model may encode the face image and the face key points corresponding to the face image respectively to obtain the first feature vector and the second feature vector. The machine learning model can then align and splice the two feature vectors to generate a third feature vector. The machine learning model performs the above processing on each face image in the face image sequence and the corresponding face key point, so as to obtain a feature sequence composed of third feature vectors, which is used as the real input of the machine learning model. For example, the machine learning model may encode the face image 1 and 68 personal face key points corresponding to the face image 1 as a feature vector a and a feature vector b, and then align and splice the feature vector a and the feature vector b into a feature vector c in the form of [ a, b ], so that after the above processing is repeated for a plurality of face images and corresponding face key points, a feature sequence composed of the feature vector c may be obtained as the real input of the machine learning model.
After the terminal device inputs each face image and the face key point corresponding to each face image into the machine learning model, a 2-dimensional feature vector corresponding to each face image output by the machine learning model can be obtained, and the 2-dimensional feature vector can be used for analyzing the emotional state of the user in the image.
Step S333: and determining the emotion characteristics corresponding to the feature vectors according to the mapping relation between the feature vectors and the emotion characteristics to obtain the emotion characteristics corresponding to each face image in the face image sequence.
In some embodiments, the mapping relationship between the 2-dimensional feature vector and the emotional feature may be embodied by an Arousal-value emotion model. Wherein, the 2-dimension of the emotion feature vector is respectively corresponding to and fixed with an Arousal axis and a Valence axis. Specifically, the Arousal-value emotion space mapped by the 2-dimensional feature vector may be divided into 12 equally-divided subspaces according to a designed method, and the 12 equally-divided subspaces respectively correspond to 12 emotion states. Of these, 12 emotional states are divided into 6 positive emotions (excited, happy, satisfied, relaxed, calm) and 6 negative emotions (tired, bored, depressed, angry, tense). Then, a coordinate point in the emotion space can be uniquely determined according to the value of each dimension in the 2-dimensional feature vector, and the emotion state corresponding to the 2-dimensional feature vector can be determined by acquiring the emotion state corresponding to the subspace where the coordinate point falls. For example, referring to FIG. 8, FIG. 8 is a diagram of an Arousal-value emotion model.
Step S340: and carrying out voice analysis on the target audio clip to obtain sentence characteristics, wherein the sentence characteristics are used for representing key words in the target audio clip.
Step S350: based on the emotional characteristics and the sentence characteristics, the reply content and the expression behavior of the virtual character are determined.
In the embodiment of the present application, the content in the foregoing embodiment can be referred to for the specific description of step S340 and step S350, and is not described herein again.
In some embodiments, referring to fig. 9, the determining the reply content and the performance behavior of the virtual character based on the emotional characteristic and the sentence characteristic may include:
step S351: and obtaining semantic information of the target audio clip based on the emotion characteristics and the sentence characteristics.
In some embodiments, a pre-trained sentence recognition model may be used to encode the sentence features to obtain a high-dimensional feature vector of each keyword, and the high-dimensional feature vector and a 2-dimensional feature vector of emotion are used together for semantic understanding, so as to obtain semantic information related to a conversation, such as a user intention and a word slot. The sentence recognition model may be the machine learning model, and is not limited in this respect.
In the embodiment of the present application, semantic understanding may be performed by combining the high-dimensional feature vector of the keyword and the 2-dimensional feature vector of the emotion in various ways, which is not limited herein. As a way, the high-dimensional feature vector and the 2-dimensional feature vector of emotion can be respectively used as the input of different machine learning models, and then the outputs of the different machine learning models are spliced (concat) and input into another machine learning model for semantic understanding, namely the semantic understanding model is composed of a plurality of machine learning submodels; alternatively, the high-dimensional feature vector and the 2-dimensional feature vector of the emotion may be vector-spliced directly, and a new machine learning model is input for semantic understanding, as described in the foregoing "[ a, b ]".
Step S352: and determining reply content corresponding to the semantic information according to the semantic information.
In some embodiments, a question-answer model (which may be a machine learning model) may be established according to semantic information and reply content, and the question-answer model may be obtained by training based on a large amount of semantic information and corresponding reply content, for example, massive semantic information and corresponding reply content obtained from a communication record of massive manual customer service may be used as a training sample, information of a user side may be used as input, a reply of a customer service side may be used as expected output, and the question-answer model may be obtained by training based on a machine learning method, so that reply content corresponding to the semantic information may be obtained through the question-answer model.
Step S353: and searching the expression behaviors of the virtual character corresponding to the semantic information and the emotional characteristics from a pre-established rule base.
In some embodiments, based on the obtained semantic information and emotional characteristics, suitable expressions and actions can be retrieved in a "robot expression and action database" designed and constructed in advance by people through a rule matching mode, so as to guide the generation of the virtual character video. A large number of rules designed in advance are stored in the database, for example, one rule may be "semantic call, emotional tension, robot expression, expression parameter (a0, a1, …), robot motion (soft up and down waving hands), and motion driving parameter (b0, b1, …)". Therefore, the expression behaviors of the virtual characters corresponding to the semantic information and the emotional characteristics can be matched from the database according to the semantic information and the emotional characteristics.
Step S360: and generating and outputting a reply video aiming at the video to be processed, wherein the reply video comprises voice content corresponding to the reply content and a virtual character executing the expression behavior.
In some embodiments, referring to fig. 10, the generating and outputting a reply video for the video to be processed, where the reply video includes the voice content corresponding to the reply content and the avatar performing the performance behavior, may include:
step S361: and generating voice content corresponding to the reply content according to the reply content.
In some embodiments, the reply content may be subjected to an audio conversion process through a deep learning technique to generate a speech content corresponding to the reply content. Specifically, the reply content may be input to the speech synthesis model based on a pre-trained speech synthesis model, resulting in the speech content output by the speech synthesis model corresponding to the reply content. The speech synthesis model may be the machine learning model, and is not limited herein. As one mode, the speech synthesis model may select a CNN model, which may perform feature extraction through a convolution kernel, and generate speech content corresponding to the reply content by one-to-one correspondence of each phoneme in the phoneme sequence corresponding to the reply content with the spectrum information and the fundamental frequency information.
Step S362: and acquiring expression driving parameters and action driving parameters corresponding to the expression behaviors.
In some embodiments, when the expression behavior of the virtual character includes an expression and an action of the virtual character, corresponding expression driving parameters and action driving parameters may be acquired. The expression driving parameters can be a series of expression parameters for adjusting the human face model of the virtual character, and the action driving parameters can be a series of limb parameters for adjusting the body model of the virtual character. The Face model of the virtual character may be a three-dimensional Face model manufactured by a 3D Face modeling (3D Face morphology Models) technology based on a 3D Digital Media Management (DMM) (3D morphology Models), and details of the Face model of the virtual character may be similar to a human Face. The body model of the virtual character may be a three-dimensional body model created by three-dimensional creation software such as 3D modeling.
It can be understood that, in the embodiment of the present application, the obtained expression driving parameters and motion driving parameters are a plurality of sets of parameter sequences corresponding to changes in time, and each set of expression driving parameters and motion driving parameters corresponds to a set of preset three-dimensional model key points of the face model, which correspond to the speech content in time. For example, if the number of video images required for the virtual character to reply the video is 30 per second, the number of the key points of the three-dimensional model corresponding to the face model of the required virtual character is 30 (the key points of the three-dimensional model of the face model of each virtual character include position information and depth information of each key point in a three-dimensional space), and if the audio duration corresponding to the voice content is 10 seconds (i.e., corresponding to the video image requiring 10 seconds), the total number of the key points of the required three-dimensional model is 300, that is, the number of the expression driving parameters and the motion driving parameters required to be corresponding is 300, and the 300 sets of expression driving parameters and motion driving parameters are aligned with the voice content requiring 10 seconds in time.
Therefore, when the virtual character informs the user of the content through voice, the corresponding expression can be exposed on the face, and the body can make corresponding action. For example, when the interactive robot informs a child of a specific route by voice, the interactive robot performs an action of "pointing the direction of the road or a displayed electronic map" while exposing an expression of "the other party of interest" on the face.
Step S363: and driving the expression and the action of the virtual character based on the expression driving parameter and the action driving parameter to generate a first recovery image sequence, wherein the first recovery image sequence is formed by a plurality of continuous action images generated by driving the virtual character.
In the embodiment of the application, the expression and the action of the virtual character are driven through the expression driving parameters and the action driving parameters, so that the human face model of the virtual character can be driven to present different expressions, and the body model of the virtual character can be driven to present different actions.
As one way, the expression driving parameters and the motion driving parameters may be aligned so that the duration of the corresponding videos is consistent, and then a face detail sequence a ═ a1, a2, …, an and a limb motion sequence B ═ B1, B2, …, bn are respectively generated according to the two aligned parameters, so as to generate a corresponding first recovered image sequence C ═ C1, C2, …, cn with one-to-one correspondence of face detail and limb motion, where f (a1, B1) ═ C1, f (a2, B2) ═ C2, ….
It is understood that the face model of the virtual character has a plurality of three-dimensional model key points corresponding to different feature positions on the face model, and the body model of the virtual character also has a plurality of three-dimensional model key points corresponding to different feature positions on the body model, and these three-dimensional model key points may be a set of key points for describing the whole or partial morphology of the face model and the body model, which record the positions of the respective key points on the face model and the body model in a three-dimensional space. For example, in correspondence to the lip shape of the human face, a plurality of points distributed at intervals may be selected as three-dimensional model key points for describing the lip shape as required on the contour line of the human face model corresponding to the lip part of the human face, and in correspondence to the arm of the human body, a plurality of points distributed at intervals may be selected as three-dimensional model key points for describing the lip shape as required on the contour line of the body model corresponding to the arm part of the human body.
Of course, the portions of the face model of the virtual character that can be driven by the expression driving parameters may include an eye shape, a nose shape, a face shape, a forehead shape, and the like, in addition to the lip shape. In this way, when the face model and the body model of the virtual character are driven by the expression driving parameters and the motion driving parameters, a detailed, vivid and realistic behavior image of the virtual character can be obtained. So that a first sequence of images of the reply can be generated from a succession of frames of avatar images.
Step S364: and generating and outputting a reply video aiming at the video to be processed according to the voice content and the first reply image sequence, wherein the first reply image sequence is played corresponding to the voice content.
In some embodiments, the terminal device may synthesize the acquired voice content and the first reply image sequence to generate a reply video for the video to be processed, and output the reply video. And the first reply image sequence is played corresponding to the voice content. Therefore, the simulated appearance, sound and behavior of the customer service robot are similar to those of a real person, so that the emotion of the user is taken care of, and the user experience is improved.
Further, referring to fig. 11, the generating and outputting the reply video for the video to be processed according to the voice content and the first reply image sequence, where the playing of the first reply image sequence and the voice content in correspondence may include:
step S3641: and acquiring a mouth shape image corresponding to the reply content from a mouth shape database established in advance.
Wherein, various reply contents and corresponding mouth pictures are pre-stored in the mouth database. It will be appreciated that where reply content is obtained, a sequence of mouth images matching the reply content may be searched from a mouth database. Specifically, the reply content may be transmitted to the mouth database through the data processing interactive interface of the terminal device, so that the mouth database may find the most matched mouth image sequence according to the reply content. The mouth-type image sequence can be understood as different mouth-type images and corresponds to the reply content.
Optionally, because the facial muscles change with the expression and the mouth shape changes continuously during the speech of the person, the terminal device may further find out the mouth image sequence (or the target mouth image) matching the reply content (or a certain character) by analyzing the change trend of the local face when acquiring the reply content.
In some embodiments, the mouth-shaped image corresponding to the reply content may also be obtained through the voice content corresponding to the reply content. As one approach, speech content may be input to a mouth prediction model to obtain a sequence of mouth images output by the mouth prediction model. The mouth shape prediction model may be obtained by training a large amount of real person speaking videos (including real person speaking images and real person speaking audios corresponding to the real person speaking images) and training samples of mouth images of real persons during speaking through a neural network, may be the above machine learning model, and is not limited herein.
Step S3642: and synthesizing the mouth images to the mouth positions of the virtual characters in each behavior image of the corresponding first reply image sequence to obtain a second reply image sequence.
In some embodiments, after the mouth image corresponding to the reply content is obtained, the mouth image may be synthesized to the position of the mouth of the virtual character in each behavior image of the corresponding first reply image sequence, resulting in a second reply image sequence.
In some embodiments, the mouth image is synthesized to the mouth position of the virtual character in the behavior image, the mouth image may be directly covered on the mouth portion of the virtual character, or after the mouth image is directly covered on the mouth portion of the virtual character, the whole mouth is completely filled and replaced with the lip outline as the boundary, so that the details of the lips and the expression of the human face keep high correlation.
Step S3643: and generating and outputting a reply video aiming at the video to be processed according to the voice content and the second reply image sequence, wherein the second reply image sequence is played corresponding to the voice content.
In some embodiments, the terminal device may synthesize the acquired voice content and the second reply image sequence to generate a reply video for the video to be processed, and output the reply video. And the second reply image sequence is played corresponding to the voice content. Therefore, when the voice content in the reply video is played, the expression, the action and the lip of the virtual character presented to the user are dynamically changed along with the voice content, so that the sense of reality and the naturalness of human-computer interaction are improved, and the human-computer interaction experience is optimized.
It can be understood that, in this embodiment, each step may be performed locally by the terminal device, may also be performed in the server, and may also be performed by the terminal device and the server separately, and according to different actual application scenarios, tasks may be allocated according to requirements, so as to implement an optimized virtual robot customer service experience, which is not limited herein.
According to the video processing method provided by the embodiment of the application, the to-be-processed video of the user who has a conversation with the robot is obtained, voice analysis is carried out on the target audio segment in the to-be-processed video, sentence characteristics are obtained, emotion analysis is carried out on the face image sequence corresponding to the target audio segment, emotion characteristics are obtained, reply content of the robot and expression behaviors of the virtual character are determined based on the emotion characteristics and the sentence characteristics, the mouth shape image of the virtual character, expression driving parameters and action driving parameters corresponding to the expression behaviors are determined according to the reply content, the reply video aiming at the to-be-processed video is generated and output, the interactive robot can be presented in front of the user in a more vivid and natural image, and reality sense and naturalness of man-machine interaction are improved. By adopting video input, the robot is assisted to carry out semantic understanding according to the emotional characteristics and the sentence characteristics of characters in the video, a user does not need to intervene in the semantic understanding process, the situation that the context is irrelevant when the context information is utilized does not exist, the interactive robot can sense the emotional state of the user while the semantic understanding accuracy of the robot is improved, so that a video reply which is more fit with the emotion of the user is generated, and the human-computer interaction experience is optimized.
It should be understood that, although the steps in the flow charts of fig. 2, 4, 5, 6, 7, 9, 10, and 11 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 4, 5, 6, 7, 9, 10, and 11 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least some of the sub-steps or stages of other steps.
Referring to fig. 12, fig. 12 is a block diagram illustrating a video processing apparatus according to an embodiment of the present application. As will be explained below with respect to the block diagram of fig. 12, the video processing apparatus 900 includes: video acquisition module 910, audio image acquisition module 920, emotion analysis module 930, voice analysis module 940, data determination module 950, and video generation module 960, wherein:
a video obtaining module 910, configured to obtain a video to be processed;
the audio image acquiring module 920 is configured to acquire a target audio segment and a face image sequence in a video to be processed, where the target audio segment includes a human voice segment, the face image sequence includes a plurality of face images, and the face image sequence corresponds to the target audio segment;
in some embodiments, the audio image acquisition module 920 may include: the decomposition unit is used for decomposing the video to be processed to obtain a complete audio stream and a video image sequence; the audio acquisition unit is used for acquiring audio segments except the interference audio in the complete audio stream as target audio segments in the video to be processed; the image acquisition unit is used for acquiring a target image sequence corresponding to the time stamp from the video image sequence according to the time stamp of the target audio clip; and the face image extraction unit is used for extracting all face images of the target person in the target image sequence and taking all the face images as the face image sequence in the video to be processed.
Further, as one mode, the face image extraction unit may include: the data extraction subunit is used for extracting the face key points of the target person in each target image in the target image sequence; the image intercepting subunit is used for intercepting a face image corresponding to the face key point in each target image based on the face key point; and the image processing subunit is used for preprocessing each face image to obtain a face image sequence in a specified format, and the face image sequence is used as the face image sequence in the video to be processed, wherein the preprocessing comprises at least one of amplification, reduction and movement processing.
The emotion analysis module 930 is configured to perform emotion analysis on the face image sequence to obtain emotion characteristics, where the emotion characteristics are used to represent emotion of a person in the face image;
in some implementations, the emotion analysis module 930 may include: the key point extraction unit is used for extracting a face key point corresponding to each face image in the face image sequence; the vector acquisition unit is used for providing each face image and the face key points corresponding to each face image as input to the machine learning model to obtain the feature vector corresponding to each face image, and the machine learning model is trained in advance to output the feature vector corresponding to the face image according to the face image and the face key points corresponding to the face image; and the feature determining unit is used for determining the emotion features corresponding to the feature vectors according to the mapping relation between the feature vectors and the emotion features to obtain the emotion features corresponding to each face image in the face image sequence.
The voice analysis module 940 is configured to perform voice analysis on the target audio segment to obtain a sentence feature, where the sentence feature is used to represent a keyword in the target audio segment;
a data determining module 950 for determining the reply content and the expression behavior of the virtual character based on the emotional characteristic and the sentence characteristic.
In some embodiments, the data determination module 950 may include: the semantic acquiring unit is used for acquiring semantic information of the target audio clip based on the emotion characteristics and the sentence characteristics; the reply content determining unit is used for determining reply content corresponding to the semantic information according to the semantic information; and the expression behavior determining unit is used for searching the expression behaviors of the virtual character corresponding to the semantic information and the emotional characteristics from a pre-established rule base.
The video generating module 960 is configured to generate and output a reply video for the video to be processed, where the reply video includes the voice content corresponding to the reply content and the avatar performing the performance.
In some embodiments, the performance behaviors include expressions and actions, and the video generation module 960 may include: the voice generating unit is used for generating voice content corresponding to the reply content according to the reply content; the parameter acquisition unit is used for acquiring expression driving parameters and action driving parameters corresponding to the expression behaviors; the image generation unit is used for driving the expression and the action of the virtual character based on the expression driving parameter and the action driving parameter to generate a first recovery image sequence, and the first recovery image sequence is formed by a plurality of continuous action images generated by driving the virtual character; and the first video generation unit is used for generating and outputting a reply video aiming at the video to be processed according to the voice content and the first reply image sequence, and the first reply image sequence is played corresponding to the voice content.
Further, as one mode, the first video generating unit may include: the mouth shape obtaining subunit is used for obtaining a mouth shape image corresponding to the reply content from a mouth shape database established in advance; the mouth shape synthesizing subunit is used for synthesizing the mouth shape images to the mouth positions of the virtual characters in each behavior image of the corresponding first reply image sequence to obtain a second reply image sequence; and the second video generation subunit is used for generating and outputting a reply video aiming at the video to be processed according to the voice content and the second reply image sequence, and the second reply image sequence is played corresponding to the voice content.
The video processing apparatus provided in the embodiment of the present application is used to implement the corresponding video processing method in the foregoing method embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
It can be clearly understood by those skilled in the art that the video processing apparatus provided in the embodiment of the present application can implement each process in the foregoing method embodiment, and for convenience and brevity of description, specific working processes of the apparatus and the modules described above may refer to corresponding processes in the foregoing method embodiment, and are not described herein again.
In the several embodiments provided in the present application, the coupling or direct coupling or communication connection between the modules shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or modules may be in an electrical, mechanical or other form.
In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
Referring to fig. 13, a block diagram of a terminal device 600 according to an embodiment of the present disclosure is shown. The terminal device 600 may be a terminal device capable of running an application, such as a smart phone, a tablet computer, and an electronic book. The terminal device 600 in the present application may comprise one or more of the following components: a processor 610, a memory 620, and one or more applications, wherein the one or more applications may be stored in the memory 620 and configured to be executed by the one or more processors 610, the one or more programs configured to perform the methods as described in the aforementioned method embodiments.
The processor 610 may include one or more processing cores. The processor 610 connects various parts within the entire terminal apparatus 600 using various interfaces and lines, and performs various functions of the terminal apparatus 600 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 620 and calling data stored in the memory 620. Alternatively, the processor 610 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 610 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 610, but may be implemented by a communication chip.
The Memory 620 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 620 may be used to store instructions, programs, code sets, or instruction sets. The memory 620 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may also store data created by the terminal device 600 during use (such as a phonebook, audio-video data, chat log data), and the like.
Further, the terminal device 100 may further include a foldable Display screen, and the Display screen may be a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. The display screen is used to display information entered by the user, information provided to the user, and various graphical user interfaces that may be composed of graphics, text, icons, numbers, video, and any combination thereof.
Those skilled in the art will appreciate that the structure shown in fig. 13 is a block diagram of only a portion of the structure relevant to the present application, and does not constitute a limitation on the terminal device to which the present application is applied, and a particular terminal device may include more or less components than those shown in fig. 13, or combine certain components, or have a different arrangement of components.
Referring to fig. 14, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable storage medium 1100 has stored therein a program code 1110, the program code 1110 being invokable by the processor for performing the method described in the above-described method embodiments.
The computer-readable storage medium 1100 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer-readable storage medium 1100 includes a non-transitory computer-readable storage medium. The computer readable storage medium 1100 has storage space for program code 1110 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 1110 may be compressed, for example, in a suitable form.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a smart gateway, a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
While the present embodiments have been described with reference to the accompanying drawings, the present embodiments are not limited to the above embodiments, which are merely illustrative and not restrictive, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention.

Claims (9)

1. A method of video processing, the method comprising:
acquiring a video to be processed;
decomposing the video to be processed to obtain a complete audio stream and a video image sequence;
acquiring audio segments except the interference audio in the complete audio stream as target audio segments in the video to be processed;
acquiring a target image sequence corresponding to the time stamp from the video image sequence according to the time stamp of the target audio clip;
extracting all face images of a target figure in the target image sequence, and taking all face images as the face image sequence in the video to be processed, wherein the target audio segment comprises a human voice segment, the face image sequence comprises a plurality of face images, and the face image sequence corresponds to the target audio segment;
performing emotion analysis on the face image sequence to obtain emotion characteristics, wherein the emotion characteristics are used for representing the emotion of people in the face image;
performing voice analysis on the target audio clip to obtain sentence characteristics, wherein the sentence characteristics are used for representing key words in the target audio clip;
determining reply content and expression behaviors of the virtual character based on the emotional characteristics and the sentence characteristics;
and generating and outputting a reply video aiming at the video to be processed, wherein the reply video comprises voice content corresponding to the reply content and a virtual character executing the expression behavior, the reply content comprises sound information of the virtual character broadcasting the voice content, the sound information is matched with the emotion characteristics and the sentence characteristics, and the sound information comprises tone and/or tone.
2. The method of claim 1, wherein determining the reply content and the expressive behavior of the avatar based on the emotional characteristic and the sentence characteristic comprises:
obtaining semantic information of the target audio clip based on the emotion characteristics and the sentence characteristics;
according to the semantic information, determining reply content corresponding to the semantic information;
and searching the expression behaviors of the virtual character corresponding to the semantic information and the emotional characteristics from a pre-established rule base.
3. The method of claim 1, wherein the expressive behavior comprises expressions and actions, and wherein generating and outputting a reply video for the to-be-processed video, the reply video containing voice content corresponding to the reply content and a virtual character performing the expressive behavior comprises:
generating voice content corresponding to the reply content according to the reply content;
obtaining expression driving parameters and action driving parameters corresponding to the expression behaviors;
driving the expression and the action of the virtual character based on the expression driving parameter and the action driving parameter to generate a first recovery image sequence, wherein the first recovery image sequence is composed of a plurality of continuous action images generated by driving the virtual character;
and generating and outputting a reply video aiming at the video to be processed according to the voice content and the first reply image sequence, wherein the first reply image sequence is played corresponding to the voice content.
4. The method according to claim 3, wherein the generating and outputting a reply video for the video to be processed according to the voice content and the first reply image sequence, the first reply image sequence being played corresponding to the voice content, comprises:
acquiring a mouth shape image corresponding to the reply content from a pre-established mouth shape database;
synthesizing the mouth shape image to the mouth position of the virtual character in each behavior image of the corresponding first reply image sequence to obtain a second reply image sequence;
and generating and outputting a reply video aiming at the video to be processed according to the voice content and the second reply image sequence, wherein the second reply image sequence is played corresponding to the voice content.
5. The method according to any one of claims 1 to 4, wherein the performing emotion analysis on the face image sequence to obtain emotion characteristics comprises:
extracting a face key point corresponding to each face image in the face image sequence;
providing each face image and the face key point corresponding to each face image as input to a machine learning model to obtain a feature vector corresponding to each face image, wherein the machine learning model is trained in advance to output the feature vector corresponding to the face image according to the face image and the face key point corresponding to the face image;
and determining the emotional characteristics corresponding to the feature vectors according to the mapping relation between the feature vectors and the emotional characteristics to obtain the emotional characteristics corresponding to each face image in the face image sequence.
6. The method of claim 1, wherein the extracting all face images of the target person in the target image sequence, and using all face images as the face image sequence in the video to be processed, comprises:
extracting the face key points of the target person in each target image in the target image sequence;
based on the face key points, intercepting face images corresponding to the face key points in each target image;
and preprocessing each face image to obtain a face image sequence in a specified format, wherein the facial image sequence is used as the face image sequence in the video to be processed, and the preprocessing comprises at least one of amplification, reduction and movement processing.
7. A video processing apparatus, characterized in that the apparatus comprises:
the video acquisition module is used for acquiring a video to be processed;
the audio image acquisition module is used for decomposing the video to be processed to obtain a complete audio stream and a video image sequence; acquiring audio segments except the interference audio in the complete audio stream as target audio segments in the video to be processed; acquiring a target image sequence corresponding to the time stamp from the video image sequence according to the time stamp of the target audio clip; extracting all face images of a target figure in the target image sequence, and taking all face images as the face image sequence in the video to be processed, wherein the target audio segment comprises a human voice segment, the face image sequence comprises a plurality of face images, and the face image sequence corresponds to the target audio segment;
the emotion analysis module is used for carrying out emotion analysis on the face image sequence to obtain emotion characteristics, and the emotion characteristics are used for representing the emotion of a person in the face image;
the voice analysis module is used for carrying out voice analysis on the target audio clip to obtain sentence characteristics, and the sentence characteristics are used for representing key words in the target audio clip;
the data determining module is used for determining reply content and the expression behavior of the virtual character based on the emotion characteristics and the sentence characteristics;
the video generation module is used for generating and outputting a reply video aiming at the video to be processed, wherein the reply video comprises voice content corresponding to the reply content and a virtual character executing the expression behavior, the reply content comprises sound information of the voice content broadcasted by the virtual character, the sound information is matched with the emotion characteristics and the sentence characteristics, and the sound information comprises tone and/or tone.
8. A terminal device, comprising:
a memory;
one or more processors coupled with the memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of any of claims 1-6.
9. A computer-readable storage medium, having stored thereon program code that can be invoked by a processor to perform the method according to any one of claims 1 to 6.
CN201910838068.XA 2019-09-05 2019-09-05 Video processing method, device, system, terminal equipment and storage medium Active CN110688911B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910838068.XA CN110688911B (en) 2019-09-05 2019-09-05 Video processing method, device, system, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910838068.XA CN110688911B (en) 2019-09-05 2019-09-05 Video processing method, device, system, terminal equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110688911A CN110688911A (en) 2020-01-14
CN110688911B true CN110688911B (en) 2021-04-02

Family

ID=69107855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910838068.XA Active CN110688911B (en) 2019-09-05 2019-09-05 Video processing method, device, system, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110688911B (en)

Families Citing this family (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274417B (en) * 2020-01-17 2023-05-12 新华网股份有限公司 Emotion labeling method and device, electronic equipment and computer readable storage medium
CN111415662A (en) * 2020-03-16 2020-07-14 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating video
CN111489424A (en) * 2020-04-10 2020-08-04 网易(杭州)网络有限公司 Virtual character expression generation method, control method, device and terminal equipment
CN111597381A (en) * 2020-04-16 2020-08-28 国家广播电视总局广播电视科学研究院 Content generation method, device and medium
CN111225237B (en) * 2020-04-23 2020-08-21 腾讯科技(深圳)有限公司 Sound and picture matching method of video, related device and storage medium
CN113689880A (en) * 2020-05-18 2021-11-23 北京搜狗科技发展有限公司 Method, device, electronic equipment and medium for driving virtual human in real time
CN113689879A (en) * 2020-05-18 2021-11-23 北京搜狗科技发展有限公司 Method, device, electronic equipment and medium for driving virtual human in real time
CN111738210B (en) * 2020-07-20 2020-12-08 平安国际智慧城市科技股份有限公司 Audio and video based student psychological state analysis method, device, terminal and medium
CN111862279A (en) * 2020-07-23 2020-10-30 中国工商银行股份有限公司 Interaction processing method and device
CN111966671A (en) * 2020-08-04 2020-11-20 深圳追一科技有限公司 Digital human training data cleaning method and device, electronic equipment and storage medium
CN111966855A (en) * 2020-08-04 2020-11-20 深圳追一科技有限公司 Digital human training data acquisition method and device, electronic equipment and storage medium
CN112151027A (en) * 2020-08-21 2020-12-29 深圳追一科技有限公司 Specific person inquiry method, device and storage medium based on digital person
CN112100352A (en) * 2020-09-14 2020-12-18 北京百度网讯科技有限公司 Method, device, client and storage medium for interacting with virtual object
CN112215927B (en) * 2020-09-18 2023-06-23 腾讯科技(深圳)有限公司 Face video synthesis method, device, equipment and medium
CN112329586A (en) * 2020-10-30 2021-02-05 中国平安人寿保险股份有限公司 Client return visit method and device based on emotion recognition and computer equipment
CN112383722B (en) * 2020-11-13 2023-04-07 北京有竹居网络技术有限公司 Method and apparatus for generating video
CN112446938B (en) * 2020-11-30 2023-08-18 重庆空间视创科技有限公司 Multi-mode-based virtual anchor system and method
CN112633110B (en) * 2020-12-16 2024-02-13 中国联合网络通信集团有限公司 Data processing method and device
CN112286366B (en) * 2020-12-30 2022-02-22 北京百度网讯科技有限公司 Method, apparatus, device and medium for human-computer interaction
CN112767520A (en) * 2021-01-07 2021-05-07 深圳追一科技有限公司 Digital human generation method and device, electronic equipment and storage medium
CN112785667A (en) * 2021-01-25 2021-05-11 北京有竹居网络技术有限公司 Video generation method, device, medium and electronic equipment
CN112927712A (en) * 2021-01-25 2021-06-08 网易(杭州)网络有限公司 Video generation method and device and electronic equipment
CN112967212A (en) * 2021-02-01 2021-06-15 北京字节跳动网络技术有限公司 Virtual character synthesis method, device, equipment and storage medium
CN113570686A (en) * 2021-02-07 2021-10-29 腾讯科技(深圳)有限公司 Virtual video live broadcast processing method and device, storage medium and electronic equipment
CN113822967A (en) * 2021-02-09 2021-12-21 北京沃东天骏信息技术有限公司 Man-machine interaction method, device, system, electronic equipment and computer medium
CN112883896B (en) * 2021-03-10 2022-10-11 山东大学 Micro-expression detection method based on BERT network
CN113192161B (en) * 2021-04-22 2022-10-18 清华珠三角研究院 Virtual human image video generation method, system, device and storage medium
CN113179449B (en) * 2021-04-22 2022-04-12 清华珠三角研究院 Method, system, device and storage medium for driving image by voice and motion
CN113269066B (en) * 2021-05-14 2022-10-04 网易(杭州)网络有限公司 Speaking video generation method and device and electronic equipment
CN113658254B (en) * 2021-07-28 2022-08-02 深圳市神州云海智能科技有限公司 Method and device for processing multi-modal data and robot
CN113627301B (en) * 2021-08-02 2023-10-31 科大讯飞股份有限公司 Real-time video information extraction method, device and system
CN113923462A (en) * 2021-09-10 2022-01-11 阿里巴巴达摩院(杭州)科技有限公司 Video generation method, live broadcast processing method, video generation device, live broadcast processing device and readable medium
CN113851145A (en) * 2021-09-23 2021-12-28 厦门大学 Virtual human action sequence synthesis method combining voice and semantic key actions
CN114245215B (en) * 2021-11-24 2023-04-07 清华大学 Method, device, electronic equipment, medium and product for generating speaking video
CN114302153B (en) * 2021-11-25 2023-12-08 阿里巴巴达摩院(杭州)科技有限公司 Video playing method and device
CN114187405B (en) * 2021-12-07 2023-05-05 北京百度网讯科技有限公司 Method, apparatus, medium and product for determining avatar
CN114222077A (en) * 2021-12-14 2022-03-22 惠州视维新技术有限公司 Video processing method and device, storage medium and electronic equipment
CN114422862A (en) * 2021-12-24 2022-04-29 上海浦东发展银行股份有限公司 Service video generation method, device, equipment, storage medium and program product
CN114567693B (en) * 2022-02-11 2024-01-30 维沃移动通信有限公司 Video generation method and device and electronic equipment
CN114661885B (en) * 2022-05-26 2022-10-11 深圳追一科技有限公司 Question-answer processing method, device, computer equipment and storage medium
CN115187727B (en) * 2022-06-29 2023-06-13 北京百度网讯科技有限公司 Virtual face image generation method, device, equipment and storage medium
CN115204127B (en) * 2022-09-19 2023-01-06 深圳市北科瑞声科技股份有限公司 Form filling method, device, equipment and medium based on remote flow adjustment
CN115375809B (en) * 2022-10-25 2023-03-14 科大讯飞股份有限公司 Method, device and equipment for generating virtual image and storage medium
CN116151917B (en) * 2023-01-04 2024-02-13 上海铱维思智能科技有限公司 Transaction right determining method and system based on three-dimensional model
CN115996303B (en) * 2023-03-23 2023-07-25 科大讯飞股份有限公司 Video generation method, device, electronic equipment and storage medium
CN116820285A (en) * 2023-07-13 2023-09-29 亿迅信息技术有限公司 Virtual character interaction method based on video customer service and deep learning
CN117218224A (en) * 2023-08-21 2023-12-12 华院计算技术(上海)股份有限公司 Face emotion image generation method and device, readable storage medium and terminal

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105807933A (en) * 2016-03-18 2016-07-27 北京光年无限科技有限公司 Man-machine interaction method and apparatus used for intelligent robot
CN107294837A (en) * 2017-05-22 2017-10-24 北京光年无限科技有限公司 Engaged in the dialogue interactive method and system using virtual robot
CN107340859A (en) * 2017-06-14 2017-11-10 北京光年无限科技有限公司 The multi-modal exchange method and system of multi-modal virtual robot
CN107340865A (en) * 2017-06-29 2017-11-10 北京光年无限科技有限公司 Multi-modal virtual robot exchange method and system
CN107577661A (en) * 2017-08-07 2018-01-12 北京光年无限科技有限公司 A kind of interaction output intent and system for virtual robot
CN107808191A (en) * 2017-09-13 2018-03-16 北京光年无限科技有限公司 The output intent and system of the multi-modal interaction of visual human
CN107818785A (en) * 2017-09-26 2018-03-20 平安普惠企业管理有限公司 A kind of method and terminal device that information is extracted from multimedia file
CN107958433A (en) * 2017-12-11 2018-04-24 吉林大学 A kind of online education man-machine interaction method and system based on artificial intelligence
CN108000526A (en) * 2017-11-21 2018-05-08 北京光年无限科技有限公司 Dialogue exchange method and system for intelligent robot
CN108326855A (en) * 2018-01-26 2018-07-27 上海器魂智能科技有限公司 A kind of exchange method of robot, device, equipment and storage medium
CN108942919A (en) * 2018-05-28 2018-12-07 北京光年无限科技有限公司 A kind of exchange method and system based on visual human
CN109271018A (en) * 2018-08-21 2019-01-25 北京光年无限科技有限公司 Exchange method and system based on visual human's behavioral standard
CN109948153A (en) * 2019-03-07 2019-06-28 张博缘 It is a kind of to be related to man-machine communication's system of video and audio multimedia information processing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108259989B (en) * 2018-01-19 2021-09-17 广州方硅信息技术有限公司 Video live broadcast method, computer readable storage medium and terminal equipment

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105807933A (en) * 2016-03-18 2016-07-27 北京光年无限科技有限公司 Man-machine interaction method and apparatus used for intelligent robot
CN107294837A (en) * 2017-05-22 2017-10-24 北京光年无限科技有限公司 Engaged in the dialogue interactive method and system using virtual robot
CN107340859A (en) * 2017-06-14 2017-11-10 北京光年无限科技有限公司 The multi-modal exchange method and system of multi-modal virtual robot
CN107340865A (en) * 2017-06-29 2017-11-10 北京光年无限科技有限公司 Multi-modal virtual robot exchange method and system
CN107577661A (en) * 2017-08-07 2018-01-12 北京光年无限科技有限公司 A kind of interaction output intent and system for virtual robot
CN107808191A (en) * 2017-09-13 2018-03-16 北京光年无限科技有限公司 The output intent and system of the multi-modal interaction of visual human
CN107818785A (en) * 2017-09-26 2018-03-20 平安普惠企业管理有限公司 A kind of method and terminal device that information is extracted from multimedia file
CN108000526A (en) * 2017-11-21 2018-05-08 北京光年无限科技有限公司 Dialogue exchange method and system for intelligent robot
CN107958433A (en) * 2017-12-11 2018-04-24 吉林大学 A kind of online education man-machine interaction method and system based on artificial intelligence
CN108326855A (en) * 2018-01-26 2018-07-27 上海器魂智能科技有限公司 A kind of exchange method of robot, device, equipment and storage medium
CN108942919A (en) * 2018-05-28 2018-12-07 北京光年无限科技有限公司 A kind of exchange method and system based on visual human
CN109271018A (en) * 2018-08-21 2019-01-25 北京光年无限科技有限公司 Exchange method and system based on visual human's behavioral standard
CN109948153A (en) * 2019-03-07 2019-06-28 张博缘 It is a kind of to be related to man-machine communication's system of video and audio multimedia information processing

Also Published As

Publication number Publication date
CN110688911A (en) 2020-01-14

Similar Documents

Publication Publication Date Title
CN110688911B (en) Video processing method, device, system, terminal equipment and storage medium
WO2022048403A1 (en) Virtual role-based multimodal interaction method, apparatus and system, storage medium, and terminal
CN110807388B (en) Interaction method, interaction device, terminal equipment and storage medium
CN106653052B (en) Virtual human face animation generation method and device
US20230042654A1 (en) Action synchronization for target object
US9082400B2 (en) Video generation based on text
Busso et al. Rigid head motion in expressive speech animation: Analysis and synthesis
CN113454708A (en) Linguistic style matching agent
US20120130717A1 (en) Real-time Animation for an Expressive Avatar
CN112650831A (en) Virtual image generation method and device, storage medium and electronic equipment
CN108763190A (en) Voice-based mouth shape cartoon synthesizer, method and readable storage medium storing program for executing
US20150042662A1 (en) Synthetic audiovisual storyteller
CN110688008A (en) Virtual image interaction method and device
CN114401438B (en) Video generation method and device for virtual digital person, storage medium and terminal
CN109801349B (en) Sound-driven three-dimensional animation character real-time expression generation method and system
WO2022170848A1 (en) Human-computer interaction method, apparatus and system, electronic device and computer medium
KR102174922B1 (en) Interactive sign language-voice translation apparatus and voice-sign language translation apparatus reflecting user emotion and intention
CN110148406B (en) Data processing method and device for data processing
CN114357135A (en) Interaction method, interaction device, electronic equipment and storage medium
US20230082830A1 (en) Method and apparatus for driving digital human, and electronic device
CN114495927A (en) Multi-modal interactive virtual digital person generation method and device, storage medium and terminal
KR101738142B1 (en) System for generating digital life based on emotion and controlling method therefore
WO2023284435A1 (en) Method and apparatus for generating animation
CN115953521B (en) Remote digital person rendering method, device and system
CN117078816A (en) Virtual image generation method, device, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant