CN115222857A - Method, apparatus, electronic device and computer readable medium for generating avatar - Google Patents

Method, apparatus, electronic device and computer readable medium for generating avatar Download PDF

Info

Publication number
CN115222857A
CN115222857A CN202210889973.XA CN202210889973A CN115222857A CN 115222857 A CN115222857 A CN 115222857A CN 202210889973 A CN202210889973 A CN 202210889973A CN 115222857 A CN115222857 A CN 115222857A
Authority
CN
China
Prior art keywords
user
audio
features
video
generating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210889973.XA
Other languages
Chinese (zh)
Inventor
田野
汤跃忠
张晓灿
陈云坤
陈骁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Third Research Institute Of China Electronics Technology Group Corp
Beijing Zhongdian Huisheng Technology Co ltd
Original Assignee
Third Research Institute Of China Electronics Technology Group Corp
Beijing Zhongdian Huisheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Third Research Institute Of China Electronics Technology Group Corp, Beijing Zhongdian Huisheng Technology Co ltd filed Critical Third Research Institute Of China Electronics Technology Group Corp
Priority to CN202210889973.XA priority Critical patent/CN115222857A/en
Publication of CN115222857A publication Critical patent/CN115222857A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

Embodiments of the present disclosure disclose methods, apparatuses, devices, and media for generating an avatar. One embodiment of the method comprises: displaying a first display interface; determining virtual image information according to the selection operation; acquiring a user video and a user audio; extracting the characteristics of the user video and the user audio to obtain video characteristics and audio characteristics; generating a target text according to the video characteristics and the audio characteristics; and generating a target virtual image according to the target text and the virtual image information. The embodiment enables the generated virtual image to be fuller, better accords with user preferences, is closer to the user, realizes the improvement of the reality of the virtual digital image, and brings better emotion soothing and psychological persuasion for the user.

Description

Method, apparatus, electronic device and computer readable medium for generating avatar
Technical Field
The embodiment of the disclosure relates to the technical field of computer intelligent interaction, in particular to a method and a device for generating an avatar, electronic equipment and a computer readable medium.
Background
The virtual digital human or the virtual digital pet with the intelligent human-computer interaction capability can replace the manual talk, can realize low-cost rapid popularization, and is expected to become a novel tool for emotion soothing and psychological persuasion. How to improve the reality of the virtual digital image and improve the service capability in the aspects of emotion soothing and psychological persuasion becomes a problem which needs to be solved urgently.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Some embodiments of the present disclosure propose a method, apparatus, electronic device and computer readable medium for generating an avatar to solve the technical problems mentioned in the background section above.
In a first aspect, some embodiments of the present disclosure provide a method for generating an avatar, the method comprising: displaying a first display interface, wherein the first display interface comprises an image control and a tone control; in response to detecting the selection operation of a user for a target control in the first display interface, determining virtual image information according to the selection operation; acquiring a user video and a user audio; extracting the characteristics of the user video and the user audio to obtain video characteristics and audio characteristics; generating a target text according to the video characteristics and the audio characteristics; and generating a target virtual image according to the target text and the virtual image information.
In a second aspect, some embodiments of the present disclosure provide an avatar generating apparatus, the apparatus including: the display unit is configured to display a first display interface, wherein the first display interface comprises a character control and a tone control; the determining unit is configured to respond to the detection of the selection operation of a user for the target control in the first display interface, and determine the avatar information according to the selection operation; an audio/video acquisition unit configured to acquire a user video and a user audio; the characteristic extraction unit is configured to perform characteristic extraction on the user video and the user audio to obtain video characteristics and audio characteristics; a target text generating unit configured to generate a target text according to the video feature and the audio feature; an avatar generating unit configured to generate a target avatar based on the target text and the avatar information.
In a third aspect, an embodiment of the present application provides an electronic device, where the network device includes: one or more processors; storage means for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.
In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which, when executed by a processor, implements the method as described in any implementation manner of the first aspect.
One of the above-described various embodiments of the present disclosure has the following advantageous effects: the method comprises the steps of interacting with a user through a display interface to determine virtual image information according to user preferences, generating a text capable of realizing question-answer interaction for the user through acquiring videos and audios of the user, and finally combining the preferred image of the user with the text capable of realizing question-answer interaction with the user to generate a soothing virtual image for the user. Therefore, the generated virtual image is fuller and more accordant with the preference of the user, the virtual image is closer to the user, the sense of reality of the virtual digital image is improved, and better emotion soothing and psychological counseling are brought to the user.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.
FIG. 1 is a schematic illustration of one application scenario of a method of generating an avatar according to some embodiments of the present disclosure;
FIG. 2 is a flow diagram of some embodiments of a method of generating an avatar according to the present disclosure;
FIG. 3 is a schematic structural diagram of some embodiments of an avatar generation apparatus according to the present disclosure;
FIG. 4 is a schematic block diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 is a schematic diagram of an application scenario of a method of generating an avatar according to some embodiments of the present disclosure.
As shown in fig. 1, the executive body server 101 may display a first display interface, determine avatar information 107 according to the selection operation, acquire a user video 102 and a user audio 103, perform feature extraction on the user video 102 and the user audio 103 to obtain a video feature 104 and an audio feature 105, generate a target text 106 according to the video feature 104 and the audio feature 105, and finally generate a target avatar 108 according to the target text 106 and the avatar information 107.
It is understood that the avatar generating method may be performed by a terminal device or by the server 101, and the execution body of the method may further include a device formed by integrating the terminal device and the server 101 through a network, or may also be performed by various software programs. The terminal device may be various electronic devices with information processing capability, including but not limited to a smart phone, a tablet computer, an e-book reader, a laptop portable computer, a desktop computer, and the like. The execution body may also be embodied as the server 101, software, or the like. When the execution subject is software, the software can be installed in the electronic device listed above. It may be implemented, for example, as multiple software or software modules for providing distributed services, or as a single software or software module. And is not particularly limited herein.
It should be understood that the number of servers in fig. 1 is merely illustrative. There may be any number of servers, as desired for implementation.
With continued reference to fig. 2, a flow 200 of some embodiments of a method of generating an avatar according to the present disclosure is shown. The method for generating the virtual image comprises the following steps:
step 201, displaying a first display interface.
In some embodiments, an executing agent (e.g., a server shown in fig. 1) of the avatar generating method may display the first presentation interface. Here, the first presentation interface includes a tone control and a character control. The tone control is used for selecting the tone for the virtual image or editing the selected tone. The image control part is used for the user to select the shape of the virtual image or edit the selected shape of the virtual image.
Step 202, in response to detecting a selection operation of a user for a target control in the first display interface, determining avatar information according to the selection operation.
In some embodiments, the execution subject (e.g., the server shown in fig. 1) may determine avatar information according to the user's selection operation for the target control. The target control is generally a control selected or edited by a user, and the target control can be a tone control and/or a character control. Specifically, the execution body may use a tone and an avatar of an avatar selected by a user or edited by the user as the avatar information.
Step 203, acquiring a user video and a user audio.
In some embodiments, the execution body may obtain the user video and the user audio from a terminal of a user through a wired connection manner or a wireless connection manner. It is noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a UWB (ultra wideband) connection, and other wireless connection means now known or developed in the future.
As an example, the user video generally refers to a video containing the user. The user audio mentioned above generally refers to audio containing a user's voice. As yet another example, the user video and user audio described above may also be video and audio of a target user recorded by the user. Specifically, the target user may be a friend of the user, or the like.
And step 204, performing feature extraction on the user video and the user audio to obtain video features and audio features.
In some embodiments, the execution subject may perform feature extraction on the user video and the user audio to obtain video features and audio features. Specifically, the video features generally refer to facial image features of a user. The audio feature generally refers to a feature extraction result of the user audio. In particular, the video features described above may be characterized in a variety of ways. Here, there are various ways of extracting the features, and the details are not described herein.
In some optional implementation manners of some embodiments, the executing body may perform noise reduction processing on the user audio to obtain a noise reduction audio. Specifically, in real-time audio acquisition, the acquired audio is environmental sound, including various sounds in a user use scene, such as traffic noise outside a window, operation sound of a washing machine at home, noise of an air conditioner, and the like, and the voice signal of the user is extracted from the environmental sound by using technologies such as microphone array orientation, beam forming, noise reduction, active voice detection, echo cancellation, and the like.
And inputting the noise reduction audio into a pre-trained audio recognition model to obtain the audio characteristics corresponding to the noise reduction audio. Specifically, the audio recognition model is generally used to characterize the correspondence between the noise reduction audio and the audio features. As an example, the audio recognition model may be a correspondence table generated by a researcher based on a large amount of noise reduction audio and audio feature data. As yet another example, the audio recognition model described above may also be a neural network pre-trained based on a large amount of noise-reduced audio and audio feature data.
In some optional implementation manners of some embodiments, the audio recognition model is obtained by training by using methods of sharing acoustic features with multiple common accents, sharing parameters of an acoustic model training network, transfer learning, and semi-supervised training.
Specifically, aiming at the requirements of different regional user groups on different accents and local spoken language expressions, a voice recognition technology supporting multiple accents is adopted. Aiming at the problem that resources in the aspects of voice data, text data, a phoneme set, a pronunciation dictionary and the like commonly existing in local accents are scarce, methods such as sharing of common acoustic features of multiple accents, sharing of acoustic model training network parameters, transfer learning, semi-supervised training and the like are adopted, a network is extracted from universal phoneme representations and acoustic features learned from mass training data of large languages, the transfer application is carried out in the training of an audio recognition model, and a large amount of unmarked voice data is labeled based on the semi-supervised method, so that the development of the audio recognition model under the condition of low data resources is realized. In real-time processing, an audio recognition model is called based on real-time voice data of a user, and voice is transcribed into characters.
And step 205, generating a target text according to the video characteristics and the audio characteristics.
In some embodiments, the execution subject may generate the target text according to the video feature and the audio feature. Specifically, the target text generally refers to answer text to the user's utterance in the user's video or audio. As an example, the execution subject may determine an expression feature of the user in the user video through image recognition, perform voice recognition on the user audio to obtain a text spoken by the user, determine a response text through a preset correspondence table (a correspondence table where the expression feature and the text correspond to the response text), and use the response text as a target text.
In some optional implementation manners of some embodiments, the executing body may perform feature extraction on the audio features by using a pre-trained TextCNN model (text classification model) to obtain text features; performing feature extraction on the video features by using a pre-trained Vision Transformer model (ViT model, a model for image classification) to obtain image features; fusing the text features, the audio features and the image features through a pre-trained deep network to obtain user emotion features; and generating a target text according to the user emotional characteristics and the audio characteristics.
Specifically, for the requirement of automatic monitoring and recognition of the emotion and psychological state of the user, the executing body may adopt a multi-modal emotion analysis technology based on video images, voice and text content. Firstly, extracting voice features by adopting an audio signal processing technology, extracting text features of a voice recognition result by adopting a TextCNN architecture, and acquiring image (video) features of a user by adopting a Vision Transformer architecture; and then, carrying out deep fusion on the three features through a deep network, and finally judging the current emotional state of the user by adopting a softmax classifier. Compared with emotion analysis based on single characteristics, the multi-mode emotion analysis can improve the accuracy of emotion analysis by fusing various information such as voice, text, expression and the like.
In the deep network training stage, training data sets corresponding to voice features, text features and expression features under different emotions and psychological states (such as anger, sadness, depression, anxiety, fear and the like) are constructed, and the deep network is endowed with the capability of identifying the corresponding emotions and psychological states through training. In the real-time data processing, the current emotion and psychological state information of the user can be analyzed by the deep network according to the input current voice, text and expression information of the user.
In some optional implementations of some embodiments, the executive may build a psychology knowledge graph and a user information graph. Here, the Knowledge map (Knowledge Graph) is a series of different graphs displaying the relationship between the Knowledge development process and the structure in the book intelligence world, and describes Knowledge resources and their carriers by using visualization technology, and mines, analyzes, constructs, draws and displays Knowledge and their interrelations.
Specifically, aiming at the requirement of psychological dispersion, for psychological expert knowledge data, ontology construction is carried out by using expert knowledge, named entity identification and relation extraction are carried out by using a deep learning sequence labeling model, entity alignment is carried out by using vectorization text similarity, so that knowledge map construction is completed, and a map database query interface is designed. When psychology is dredged, the knowledge graph is inquired to obtain professional data in the psychology field, and therefore the effect of psychology dredging texts is improved.
And inputting the psychological knowledge map, the user information map, the audio features and the user emotion features into a Pre-trained text generation model to obtain a target text, wherein the text generation model is obtained by Training a self-encoder (a Training sample is a dialogue text with emotion labels) by using a GPT (general purpose Pre-Training, GPT) method.
Specifically, a text generation technology which meets the emotion expectation of the user is adopted according to the emotion soothing and psychological persuasion requirements of the user. Based on a knowledge graph established by using psychological knowledge and user related knowledge and the current emotion and psychological state information of the user output by the deep network, a deep learning combined model based on a conditional self-encoder and a GPT (general purpose transform) framework is used, model training is carried out by using a dialogue text with emotion labels, the emotion state is analyzed by the model when the user input is obtained, and the effect of generating the text is improved by using the analysis result. Meanwhile, when a user who needs networking query asks, the real-time information is obtained through the networking search function and is used as input information of the text generation model.
Step 206, generating a target avatar according to the target text and the avatar information.
In some embodiments, the execution body may generate a target avatar according to the target text and the avatar information. Here, the above-mentioned target avatar generally refers to an avatar generated with a user-selected avatar to read the target text in a user-selected timbre.
Specifically, the executing body may generate an avatar according to the avatar selected by the user in step 201, generate avatar voices according to the timbre selected by the user in step 201 and the target text, and finally generate the target avatar by using the avatar and the avatar voices. Here, the above-mentioned avatar may become old over time, and the sound of the avatar may also become vicious over time.
Specifically, the executing agent may utilize 3D digital human generation technology, including mouth shape synchronization technology, 3D modeling technology, and animation generation technology. The basic 3D digital images of different sexes, ages and categories are designed by using tools such as MetaHuman and the like, the facial expressions and the limb actions of the basic virtual images are designed by using an action capturing technology, and high fidelity to real human beings is realized on the design effects of appearance, expression, sound, action and the like. In addition, the user is supported to adjust the characteristics of the height, the weight, the body type, the hair style, the five sense organs and the like according to the preference on the basis of the basic virtual digital image. Aiming at special requirements, a high-fidelity 3D digital image can be designed according to historical image video data of real people. The mouth shape of the virtual digital image is matched with the voice content by adopting a mouth shape synchronization technology, so that the synchronous and consistent effect is achieved. And obtaining dynamic audio and video data of the virtual digital image by adopting an animation generation technology.
In some optional implementation manners of some embodiments, the executing body may further determine whether the user emotional feature meets a preset condition, and send the user emotional feature information, the user video, and the user audio to the target device if the user emotional feature meets the preset condition. Specifically, the emotional characteristics of the user are compared with preset dangerous and sensitive keywords, once the emotional characteristics are matched with the preset dangerous and sensitive keywords, the user video and the user audio can be sent to target equipment for manual intervention, and the user is informed of a preset emergency contact.
One of the above-described various embodiments of the present disclosure has the following advantageous effects: the method comprises the steps of interacting with a user through a display interface to determine virtual image information according to user preferences, then generating a text capable of interacting with questions and answers for the user through obtaining videos and audios of the user, and finally combining the preferred images of the user with the text capable of interacting with the questions and answers of the user to generate a soothing virtual image for the user. Therefore, the generated virtual image is fuller and more accordant with the preference of the user, the virtual image is closer to the user, the sense of reality of the virtual digital image is improved, and better emotion soothing and psychological counseling are brought to the user.
With further reference to fig. 3, as an implementation of the methods illustrated in the above figures, the present disclosure provides some embodiments of an avatar generation apparatus, corresponding to those of the method embodiments illustrated in fig. 2, which may be particularly applicable in various electronic devices.
As shown in fig. 3, the avatar generating apparatus 300 of some embodiments includes: a display unit 301, a determination unit 302, an audio-video acquisition unit 303, a feature extraction unit 304, a target text generation unit 305, and an avatar generation unit 306. The display unit 301 is configured to display a first display interface, where the first display interface includes an avatar control and a timbre control; a determining unit 302, configured to, in response to detecting a selection operation of a user for a target control in the first presentation interface, determine avatar information according to the selection operation; an audio/video acquisition unit 303 configured to acquire a user video and a user audio; a feature extraction unit 304, configured to perform feature extraction on the user video and the user audio to obtain a video feature and an audio feature; a target text generation unit 305 configured to generate a target text based on the video feature and the audio feature; an avatar generating unit 306 configured to generate a target avatar based on the target text and the avatar information.
In an optional implementation of some embodiments, the feature extraction unit is further configured to: carrying out noise reduction processing on the user audio to obtain noise reduction audio; and inputting the noise reduction audio into a pre-trained audio recognition model to obtain the audio characteristics corresponding to the noise reduction audio.
In an optional implementation manner of some embodiments, the audio recognition model is obtained by training by using a method of sharing acoustic features with multiple common accents, sharing parameters of an acoustic model training network, transfer learning, and semi-supervised training.
In an optional implementation of some embodiments, the target text generation unit is further configured to: performing feature extraction on the audio features by using a pre-trained TextCNN model to obtain text features; performing feature extraction on the video features by using a pre-trained Vision Transformer model to obtain image features; fusing the text features, the audio features and the image features through a pre-trained deep network to obtain user emotion features; and generating a target text according to the user emotional characteristics and the audio characteristics.
In an optional implementation of some embodiments, the target text generation unit is further configured to: constructing a psychological knowledge map and a user information map; and inputting the psychological knowledge map, the user information map, the audio features and the user emotion features into a pre-trained text generation model to obtain a target text, wherein the text generation model is obtained by training a self-encoder by using a GPT (general packet transport) method.
In an optional implementation manner of some embodiments, the apparatus further includes a sending unit configured to: and judging whether the user emotional characteristics meet preset conditions or not, and if the user emotional characteristics meet the preset conditions, sending the user emotional characteristic information, the user video and the user audio to target equipment.
It will be understood that the units described in the apparatus 300 correspond to the various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 300 and the units included therein, and are not described herein again.
One of the above-described various embodiments of the present disclosure has the following advantageous effects: the method comprises the steps of interacting with a user through a display interface to determine virtual image information according to user preferences, generating a text capable of realizing question-answer interaction for the user through acquiring videos and audios of the user, and finally combining the preferred image of the user with the text capable of realizing question-answer interaction with the user to generate a soothing virtual image for the user. Therefore, the generated virtual image is fuller and more accordant with the preference of the user, the virtual image is closer to the user, the sense of reality of the virtual digital image is improved, and better emotion soothing and psychological counseling are brought to the user.
Referring now to fig. 4, a schematic diagram of an electronic device (e.g., the server of fig. 1) 400 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 4, electronic device 400 may include a processing device (e.g., central processing unit, graphics processor, etc.) 401 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 402 or a program loaded from a storage device 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic apparatus 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
Generally, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 408 including, for example, magnetic tape, hard disk, etc.; and a communication device 409. The communication device 409 may allow the electronic device 400 to communicate with other devices, either wirelessly or by wire, to exchange data. While fig. 4 illustrates an electronic device 400 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 4 may represent one device or may represent multiple devices, as desired.
In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network through the communication device 409, or from the storage device 408, or from the ROM 402. The computer program, when executed by the processing apparatus 401, performs the above-described functions defined in the methods of some embodiments of the present disclosure.
It should be noted that the computer readable medium described above in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: displaying a first display interface, wherein the first display interface comprises an image control and a tone control; in response to detecting the selection operation of a user for a target control in the first display interface, determining virtual image information according to the selection operation; acquiring a user video and a user audio; extracting the characteristics of the user video and the user audio to obtain video characteristics and audio characteristics; generating a target text according to the video characteristics and the audio characteristics; and generating a target virtual image according to the target text and the virtual image information.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in some embodiments of the present disclosure may be implemented by software, and may also be implemented by hardware. The described units may also be provided in a processor, and may be described as: a processor comprises a display unit, a determination unit, an audio and video acquisition unit, a feature extraction unit, a target text generation unit and an avatar generation unit. Where the names of the elements do not in some cases constitute a limitation on the elements themselves, for example, a display element may also be described as an "element displaying a first presentation interface".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims (10)

1. A method for generating an avatar, comprising:
displaying a first display interface, wherein the first display interface comprises a character control and a tone control;
responding to the detection of the selection operation of a user for a target control in the first display interface, and determining virtual image information according to the selection operation;
acquiring a user video and a user audio;
carrying out feature extraction on the user video and the user audio to obtain video features and audio features;
generating a target text according to the video characteristic and the audio characteristic;
and generating a target virtual image according to the target text and the virtual image information.
2. The method of claim 1, wherein said feature extracting said user video and said user audio comprises the steps of:
carrying out noise reduction processing on the user audio to obtain noise reduction audio;
and inputting the noise reduction audio into a pre-trained audio recognition model to obtain the audio characteristics corresponding to the noise reduction audio.
3. The method of claim 2, wherein the audio recognition model is trained by using a method of sharing acoustic features with multiple common accents, sharing parameters of an acoustic model training network, transfer learning and semi-supervised training.
4. The method of claim 1, wherein said generating target text from said video features and said audio features comprises the steps of:
performing feature extraction on the audio features by using a pre-trained TextCNN model to obtain text features;
performing feature extraction on the video features by using a pre-trained Vision Transformer model to obtain image features;
fusing the text features, the audio features and the image features through a pre-trained deep network to obtain user emotion features;
and generating a target text according to the user emotional characteristics and the audio characteristics.
5. The method of claim 4, wherein the generating target text from the user emotion characteristics and the audio characteristics comprises:
constructing a psychological knowledge map and a user information map;
and inputting the psychological knowledge map, the user information map, the audio features and the user emotion features into a pre-trained text generation model to obtain a target text, wherein the text generation model is obtained by training a self-encoder by using a GPT (general purpose test) method.
6. The method of claim 4, wherein the method further comprises:
judging whether the emotional characteristics of the user meet preset conditions, and
and if the user emotional characteristics meet preset conditions, sending user emotional characteristic information, the user video and the user audio to target equipment.
7. An apparatus for generating an avatar, comprising:
the display unit is configured to display a first display interface, wherein the first display interface comprises an image control part and a tone control part;
the determining unit is configured to respond to the detection of the selection operation of a user for the target control in the first display interface, and determine the avatar information according to the selection operation;
an audio/video acquisition unit configured to acquire a user video and a user audio;
the feature extraction unit is configured to perform feature extraction on the user video and the user audio to obtain video features and audio features;
a target text generation unit configured to generate a target text according to the video feature and the audio feature;
an avatar generating unit configured to generate a target avatar according to the target text and the avatar information.
8. The apparatus according to claim 7, wherein the target text generation unit is configured to perform the steps of:
performing feature extraction on the audio features by using a pre-trained TextCNN model to obtain text features;
performing feature extraction on the video features by using a pre-trained Vision Transformer model to obtain image features;
fusing the text features, the audio features and the image features through a pre-trained deep network to obtain user emotion features;
and generating a target text according to the user emotional characteristics and the audio characteristics.
9. An electronic device, comprising:
one or more processors;
a storage device having one or more computer programs stored thereon,
the one or more computer programs, when executed by the one or more processors, implement the method of any of claims 1-6.
10. A computer-readable medium, on which a computer program is stored, wherein the computer program, when executed, implements the method of any of claims 1-6.
CN202210889973.XA 2022-07-27 2022-07-27 Method, apparatus, electronic device and computer readable medium for generating avatar Pending CN115222857A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210889973.XA CN115222857A (en) 2022-07-27 2022-07-27 Method, apparatus, electronic device and computer readable medium for generating avatar

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210889973.XA CN115222857A (en) 2022-07-27 2022-07-27 Method, apparatus, electronic device and computer readable medium for generating avatar

Publications (1)

Publication Number Publication Date
CN115222857A true CN115222857A (en) 2022-10-21

Family

ID=83613629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210889973.XA Pending CN115222857A (en) 2022-07-27 2022-07-27 Method, apparatus, electronic device and computer readable medium for generating avatar

Country Status (1)

Country Link
CN (1) CN115222857A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116993876A (en) * 2023-09-28 2023-11-03 世优(北京)科技有限公司 Method, device, electronic equipment and storage medium for generating digital human image

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116993876A (en) * 2023-09-28 2023-11-03 世优(北京)科技有限公司 Method, device, electronic equipment and storage medium for generating digital human image
CN116993876B (en) * 2023-09-28 2023-12-29 世优(北京)科技有限公司 Method, device, electronic equipment and storage medium for generating digital human image

Similar Documents

Publication Publication Date Title
CN110688911B (en) Video processing method, device, system, terminal equipment and storage medium
CN107657017B (en) Method and apparatus for providing voice service
CN105843381B (en) Data processing method for realizing multi-modal interaction and multi-modal interaction system
CN110298906B (en) Method and device for generating information
CN112162628A (en) Multi-mode interaction method, device and system based on virtual role, storage medium and terminal
CN109767765A (en) Talk about art matching process and device, storage medium, computer equipment
CN109710748B (en) Intelligent robot-oriented picture book reading interaction method and system
WO2022170848A1 (en) Human-computer interaction method, apparatus and system, electronic device and computer medium
CN107463700B (en) Method, device and equipment for acquiring information
CN109272984A (en) Method and apparatus for interactive voice
JP2017016566A (en) Information processing device, information processing method and program
KR20210001859A (en) 3d virtual figure mouth shape control method and device
CN111414506B (en) Emotion processing method and device based on artificial intelligence, electronic equipment and storage medium
CN110880198A (en) Animation generation method and device
CN110262665A (en) Method and apparatus for output information
CN109739605A (en) The method and apparatus for generating information
CN109885277A (en) Human-computer interaction device, mthods, systems and devices
CN114495927A (en) Multi-modal interactive virtual digital person generation method and device, storage medium and terminal
KR101738142B1 (en) System for generating digital life based on emotion and controlling method therefore
CN112364144A (en) Interaction method, device, equipment and computer readable medium
CN115222857A (en) Method, apparatus, electronic device and computer readable medium for generating avatar
CN113205569A (en) Image drawing method and device, computer readable medium and electronic device
CN112381926A (en) Method and apparatus for generating video
CN111354362A (en) Method and device for assisting hearing-impaired communication
JP2022531994A (en) Generation and operation of artificial intelligence-based conversation systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Tian Ye

Inventor after: Tang Yuezhong

Inventor after: Zhang Xiaocan

Inventor after: Chen Yunkun

Inventor after: Chen Xiao

Inventor after: Wang Lei

Inventor before: Tian Ye

Inventor before: Tang Yuezhong

Inventor before: Zhang Xiaocan

Inventor before: Chen Yunkun

Inventor before: Chen Xiao

CB03 Change of inventor or designer information