CN113160819B - Method, apparatus, device, medium, and product for outputting animation - Google Patents

Method, apparatus, device, medium, and product for outputting animation Download PDF

Info

Publication number
CN113160819B
CN113160819B CN202110461816.4A CN202110461816A CN113160819B CN 113160819 B CN113160819 B CN 113160819B CN 202110461816 A CN202110461816 A CN 202110461816A CN 113160819 B CN113160819 B CN 113160819B
Authority
CN
China
Prior art keywords
information
animation
voice
user
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110461816.4A
Other languages
Chinese (zh)
Other versions
CN113160819A (en
Inventor
钟鹏飞
任晓华
车炜春
廖加威
黄晓琳
赵慧斌
董粤强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110461816.4A priority Critical patent/CN113160819B/en
Publication of CN113160819A publication Critical patent/CN113160819A/en
Application granted granted Critical
Publication of CN113160819B publication Critical patent/CN113160819B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/227Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Processing Or Creating Images (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The application discloses a method, a device, equipment, a medium and a product for outputting animation, which relate to the field of computers and further relate to the technical field of artificial intelligence. The specific implementation scheme is as follows: acquiring voice information; determining user characteristic information and voice characteristic information based on voice information and a preset voice processing model; determining a target animation based on the user characteristic information and the voice characteristic information; outputting the target animation. The implementation mode can improve the diversity of the display effect and has stronger pertinence of animation output.

Description

Method, apparatus, device, medium, and product for outputting animation
Technical Field
The present disclosure relates to the field of computers, and more particularly to the field of artificial intelligence, and more particularly to methods, apparatus, devices, media, and products for outputting animations.
Background
At present, the application of man-machine voice interaction is more and more extensive, and various intelligent voice assistants are layered endlessly. The user can input the questions to be consulted by himself by voice, the intelligent voice assistant can determine the meaning of the questions based on the voice recognition technology, generate answer information matched with the meaning of the questions, and output the answer information to complete a round of dialogue with the user.
In practical application, the human-computer interaction interface can present a preset animation display effect. However, since the animation display effect is preset, it is difficult to output different animation display effects for different users, and thus there are problems of single display effect and poor pertinence.
Disclosure of Invention
The present disclosure provides a method, apparatus, device, medium, and article for outputting an animation.
According to a first aspect, there is provided a method for outputting an animation, comprising: acquiring voice information; determining user characteristic information and voice characteristic information based on the voice information and a preset voice processing model; determining a target animation based on the user characteristic information and the voice characteristic information; and outputting the target animation.
According to a second aspect, there is provided an apparatus for outputting an animation, comprising: an information acquisition unit configured to acquire voice information; a feature determination unit configured to determine user feature information and voice feature information based on the voice information and a preset voice processing model; an animation determination unit configured to determine a target animation based on the user feature information and the voice feature information; and an animation output unit configured to output the target animation.
According to a third aspect, there is provided an electronic device that performs a method for outputting an animation, comprising: one or more computing units; a storage unit for storing one or more programs; when the one or more programs are executed by the one or more computing units, the one or more computing units implement the method for outputting an animation as in any of the above.
According to a fourth aspect, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method for outputting an animation as any one of the above.
According to a fifth aspect, there is provided a computer program product comprising a computer program which, when executed by a computing unit, implements a method for outputting an animation as in any of the above.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present application may be applied;
FIG. 2 is a flow chart of one embodiment of a method for outputting an animation according to the present application;
FIG. 3 is a schematic illustration of one application scenario of a method for outputting an animation according to the present application;
FIG. 4 is a flow chart of another embodiment of a method for outputting an animation according to the present application;
FIG. 5 is a schematic diagram of the structure of one embodiment of an apparatus for outputting animation according to the present application;
fig. 6 is a block diagram of an electronic device for implementing a method for outputting an animation according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 is an exemplary system architecture diagram according to a first embodiment of the present disclosure, which illustrates an exemplary system architecture 100 to which an embodiment of the method for outputting animation of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. The terminal devices 101, 102, 103 may be mobile phones, computers, tablet computers, etc., and in the terminal devices 101, 102, 103, an intelligent voice assistant may be installed, and the user may obtain a message replied by the intelligent voice assistant based on asking questions to the intelligent voice assistant.
The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, televisions, smartphones, tablets, electronic book readers, car-mounted computers, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.
The server 105 may be a server providing various services, for example, may acquire voice information input by a user and received by the terminal devices 101, 102, 103, and the content of the voice information may be a question asking an intelligent voice assistant. Thereafter, the server 105 may determine user feature information for indicating the user feature and voice feature information for indicating the voice feature based on the voice information and a preset voice processing model. And determining a target animation based on the user characteristic information and the voice characteristic information. Thereafter, the server 105 may return the target animation to the terminal device 101, 102, 103 for outputting the target animation on the human-machine interaction interface of the terminal device 101, 102, 103.
The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or as a single server. When server 105 is software, it may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.
It should be noted that, the method for outputting animation provided in the embodiment of the present application may be executed by the terminal devices 101, 102, 103, or may be executed by the server 105. Accordingly, the means for outputting the animation may be provided in the terminal devices 101, 102, 103 or may be provided in the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method for outputting an animation according to the present application is shown. The method for outputting an animation of the present embodiment includes the steps of:
step 201, obtaining voice information.
In this embodiment, the executing body (such as the server 105 or the terminal devices 101, 102, 103 in fig. 1) may acquire the voice information in the man-machine interaction process in the electronic device running with the intelligent voice assistant. The voice information may be voice information input by the user, such as a question requiring voice query and an entry for voice search, or may be voice information replied to the user by an intelligent voice assistant, such as a voice answer to the question of the user, a voice broadcast to the entry, etc., which is not limited in this embodiment. Optionally, the man-machine interaction interface may be preset with a plurality of common questions, and the user may select a specific common question to be triggered on the man-machine interaction interface, so as to search for an answer of the specific common question. At this time, the execution subject may control the voice broadcast of the specified common problem, and determine the voice broadcast information of the specified common problem as the voice information.
Step 202, determining user characteristic information and voice characteristic information based on the voice information and a preset voice processing model.
In this embodiment, the execution subject may be pre-trained with a speech processing model for performing speech processing on the speech information, and based on the speech analysis, obtain the user feature information and the speech feature information. Wherein, the user characteristic information is characteristic information for describing the user, such as age, gender, user emotion state, user history problem information and the like; the voice characteristic information is characteristic information for describing voice information, such as volume, tone color, text emotion state of voice information, etc., which is not limited in this embodiment. For the case that the voice information is sent by the intelligent voice assistant, the user characteristic information can be preset characteristic information matched with the intelligent voice assistant, such as parameters set by the current voice broadcasting function of the intelligent voice assistant; the voice characteristic information may be a text emotion state of the voice information currently broadcasted, a volume of the voice currently broadcasted, and the like, which is not limited in this embodiment.
It should be noted that, for the above-mentioned speech processing model, the existing speech recognition and artificial intelligence related technologies may be adopted to implement analysis of speech to obtain the above-mentioned feature information, and specific implementation principles are not repeated here.
Step 203, determining the target animation based on the user characteristic information and the voice characteristic information.
In this embodiment, the target animation refers to an animation effect presented by the man-machine interaction interface in the man-machine interaction process. The target animation at least includes a spectrum animation, an expression animation, a custom animation, and the like, which is not limited in this embodiment. Optionally, the executing body may store the corresponding relationship between the user feature information, the voice feature information and the target animation in advance, and after obtaining the user feature information and the voice feature information, the executing body may determine the corresponding target animation based on the corresponding relationship. The target animation obtained through the determination in the process can reflect the characteristic information of the user and the characteristic information of the voice, and the animation effect is more targeted. For example, by executing the steps, the feature information of the user is six years old and the emotion is happy, the feature information of the voice information is the text of the voice information, and at the moment, the target animation which accords with the six years old and has the emotion is active and happy can be searched in the preset database based on the corresponding relation.
Step 204, outputting the target animation.
In this embodiment, after the execution subject determines the target animation, the target animation may be output on the human-computer interaction interface. The human-computer interaction interface can be divided into a plurality of display areas in advance, and each display area can be correspondingly provided with different animations. After the target animation is determined, the target animation can be output to a display area corresponding to the target animation, so that various display requirements are realized. In addition, in the case where the voice information is voice input by the user, the target animation may be displayed for a period of time during which the user inputs voice. In the case where the voice information is voice output by the intelligent voice assistant, the target animation may be displayed during a period of time when the voice is output by the intelligent voice assistant. In addition, the execution body may also receive a user-defined setting of the target animation display period, which is not limited in this embodiment.
With continued reference to fig. 3, a schematic diagram of one application scenario of a method for outputting animation according to the present application is shown. In the application scenario of fig. 3, the execution body may be a terminal device, and the terminal device supports running a related application of the intelligent voice assistant, where functions such as man-machine interaction voice query may be implemented. The man-machine interaction interface 301 of the intelligent voice assistant in the terminal device may display a question 302 of user voice input, and may also display an intelligent voice assistant reply message 303 output for the question 302. Questions 302 may be derived by parsing speech input by the user. After the voice input by the user is obtained, the user characteristic information and the voice characteristic information can be determined based on a preset voice processing model, the target animation 304 is generated based on the user characteristic information and the voice characteristic information, and the target animation 304 is output on the human-computer interaction interface. It will be appreciated that the target animation 304 in FIG. 3 is merely an example and is not limiting of the display form of the target animation.
The method for outputting the animation provided by the embodiment of the application can extract the user characteristic information and the voice characteristic information based on the voice information and the preset voice processing model, and determine the target animation based on the user characteristic information and the voice characteristic information. Under the scene of man-machine interaction, the output target animation can be determined in a targeted manner based on the user characteristic information and the voice characteristic information, and compared with playing the preset animation, the richness and the pertinence of the display effect can be improved.
With continued reference to FIG. 4, a flow 400 of another embodiment of a method for outputting an animation according to the present application is shown. As shown in fig. 4, the method for outputting an animation of the present embodiment may include the steps of:
step 401, in response to detecting the voice wake-up instruction, outputting a preset initial animation and acquiring voice information.
In this embodiment, the execution subject may be in a sleep state without performing man-machine interaction. Thereafter, it may be detected whether a voice wake instruction is received. The voice wake-up instruction is used for controlling the execution main body to enter a wake-up state from a sleep state, and the specific form of the voice wake-up instruction can be a preset keyword. If the keyword is detected, a preset initial animation can be output, and voice information can be acquired. Optionally, the specific form of the voice wake-up instruction may be that a voice is detected, and at this time, the executing body detects that the user utters, then a preset initial animation may be output, and specific uttered content of the user, that is, the voice information described above, is obtained.
It should be noted that, for the step of obtaining the voice information, please refer to the description of step 201, and the description is omitted here.
Step 402, determining user characteristic information and voice characteristic information based on voice information and a preset voice processing model; the user characteristic information at least comprises user age and/or user emotion information; the speech characteristic information includes at least text mood information, volume, pitch and/or timbre.
In this embodiment, the user characteristic information includes user age, user emotion information. The age of the user can be obtained based on voice analysis of voice information, and the emotion information of the user can be obtained by analyzing the sound characteristics of the voice information. The speech characteristic information includes text emotion information, volume, tone, timbre. The text emotion information can be obtained by extracting the semantics corresponding to the voice information and analyzing the semantics, and the volume, the tone and the tone color can be obtained by analyzing the sound characteristics of the voice information. Alternatively, the user emotion information may be determined based on a preset emotion recognition model. Wherein, the emotion recognition model can be trained by the following steps: acquiring a sample set to be trained and emotion marking information corresponding to each sample voice in the sample set to be trained; inputting each sample voice in the sample set to be trained into a model to be trained, and carrying out audio feature analysis by the model to be trained to obtain an output emotion recognition result; and adjusting parameters of the model to be trained based on the difference between the emotion recognition result and the emotion marking information until the model converges to obtain an emotion recognition model. For analysis of text emotion information, voice information can be converted into text information, and then the text emotion information can be obtained through recognition of a trained text emotion recognition model. The training steps of the text emotion recognition model are similar to the principle of the steps of the emotion recognition model, and are not repeated here.
It should be noted that, the detailed description of step 402 is referred to the detailed description of step 202, and will not be repeated here.
Step 403, determining animation color information based on the user age, the user emotion information, and/or the text emotion information.
In the present embodiment, the animation color information is used to render a specified graphic such that a target animation is generated based on the rendered graphic. The designated pattern in fig. 3 is, for example, a row of equally spaced deformed circles. The designated graphics may be adjusted based on the actual needs of the user, which is not limited in this embodiment.
In some optional implementations of the present embodiment, determining the animated color information based on the user age, user mood information, and/or text mood information includes: determining an age quantified value corresponding to the age of the user, a user emotion quantified value corresponding to the user emotion information and/or a text emotion quantified value corresponding to the text emotion information; the animation color information is determined based on the age quantization value, the user emotion quantization value, and/or the text emotion quantization value.
In this implementation, in determining the animation color information, the user age, the user emotion information, and/or the text emotion information described above need to be converted into quantized values. Specifically, for a user's age, a specified age region may be mapped proportionally to a specified quantized value interval. For example, a specified age range of 3 to 92 years old may be mapped proportionally to a numerical range of 0 to 360, with an age quantization value corresponding to an age greater. For the user emotion, the user emotion information obtained by the emotion recognition model described above may include ten basic emotions, specifically, interests, pleasure, surprise, sadness, anger, aversion, light, fear, shy and timidity. Based on these basic moods, it may be further determined a probability that the user mood information belongs to positive, neutral or negative. And mapping the probability that the user emotion information belongs to the positive direction, the neutral direction or the negative direction to a specified quantized value region according to a specified mapping relation. For example, the probability that the user emotion information belongs to the positive direction is larger, and the quantized value of the user emotion information is larger, which can be mapped to a numerical range of 0 to 360. For the text emotion, the user emotion obtained by the text emotion recognition model can also comprise the ten basic emotions, and based on the same principle, the probability that the text emotion information belongs to the positive, neutral or negative direction can be mapped to the designated quantized value region. Also, for the measurement of the above-described quantized values, an error that allows a certain range, such as an error that allows the fluctuation 10 up and down, may be set.
Further, determining the animation color information may include, based on the age quantization value, the user emotion quantization value, and/or the text emotion quantization value: and determining a lower layer color tone corresponding to the age quantized value, a middle layer color tone corresponding to the user emotion quantized value and an upper layer color tone corresponding to the text emotion quantized value in the specified graph. For example, the confirmation of the animated color information may be performed using HSB (one color mode), wherein HSB has H values of 0 to 360, S values of 0 to 100, and B values of 0 to 100. In this embodiment, 0 to 360 may be used as the value range of the quantized values, and the corresponding layer hue (H value) may be determined based on each quantized value. The S value and the B value for each layer may be preset fixed values.
At step 404, animation deformation information is determined based on the volume, tone, and/or timbre.
In this embodiment, the animation morphing information is used to morph the above-described designated graphic so that a target animation is generated based on the morphed graphic. If a row of equidistant circles is deformed, if elongated, a row of equidistant deformed circles can be obtained, and a target animation, such as target animation 304 in fig. 3, is generated based on the deformed circles. Further, the designated pattern may include upper, middle, and lower layers, and the deformation of the lowermost layer pattern may be controlled based on the volume, the deformation of the middle layer pattern may be controlled based on the tone, and the deformation of the upper layer pattern may be controlled based on the tone. Alternatively, a change range of the deformation may be set, for example, for a change in the vertical height value of the graph, the height value thereof may be set to be within a specified range for the change. If the maximum value of the vertical height value of the graph corresponding to the volume is set to be 10, and the minimum value is set to be 1; setting the maximum value of the vertical height value of the graph corresponding to the tone as 8 and the minimum value as 1; setting the maximum value of the vertical height value of the corresponding graph corresponding to the tone color as 6 and the minimum value as 1. And Fourier transformation can be adopted for the tone color to obtain a quantitative value of the tone color, and deformation of the vertical height value of the graph is controlled based on the quantitative value of the tone color.
Step 405, generating a spectrum animation according to the animation color information and/or the animation deformation information.
In this embodiment, the above-described animation color information and/or animation morphing information may be used to generate a spectral animation. The spectrum animation is used for controlling the animation change rate of the graph based on the change conditions of volume, tone and tone color, and/or controlling the color change of the graph based on the change conditions of age, emotion and text emotion of the user, and can reflect the characteristics of the user and the voice characteristics.
In some optional implementations of the present embodiments, generating the spectral animation according to the animation color information and/or the animation morphing information includes: acquiring each preset layer; performing color rendering on each preset layer based on animation color information to obtain a first processing result; performing deformation processing on each preset layer based on animation deformation information to obtain a second processing result; and generating the spectrum animation based on the first processing result and/or the second processing result.
In this implementation manner, each preset layer may be an upper layer, a middle layer, or a lower layer of the above specified graph, and the number of layers may be other than three, which is not limited in this embodiment. The animation color information can perform color rendering on each layer, and the animation deformation information can perform deformation processing on graphics in each layer. The obtained first processing result is each layer after color rendering, the second processing result is each layer after deformation processing, and the finally obtained spectrum animation can be an animation obtained by combining each layer of the first processing result, an animation obtained by combining each layer of the second processing result, or an animation obtained by combining each layer of the first processing result and the second processing result, namely, an animation subjected to both color rendering and deformation processing.
Step 406, determining a speech emotion category based on the user emotion information and/or the text emotion information.
In this embodiment, the user emotion information is mainly obtained based on the sound characteristic analysis of the voice information, the text emotion information is mainly obtained by identifying the text content of the voice information, and more accurate voice emotion types can be obtained by combining the two. The categories of speech emotion may include positive, neutral, and negative.
Step 407, determining the expression animation matched with the voice emotion type.
In this embodiment, an expression database may be preset, including a plurality of expression animations that are positively matched, a plurality of expression animations that are neutral matched, and a plurality of expression animations that are negatively matched.
It should be noted that, the steps 406-407 and the steps 403-405 are used to generate the expression animation and the spectrum animation, respectively. In practical application, the generation can be selected, or the generation can be performed simultaneously. For the case of simultaneous generation, steps 406-407 may be performed after steps 403-405, before steps 403-405, or steps 406-407 may be performed simultaneously with steps 403-405, which is not limited in this embodiment.
Step 408, outputting the target animation.
In this embodiment, the target animation may include the above-mentioned expression animation and/or spectrum animation, and may be selectively output or simultaneously output. For a detailed description of step 408, please refer to the detailed description of step 204, which is not repeated here.
Step 409, determining answer text information corresponding to the voice information.
In the present embodiment, if the voice information is voice input by the user, the execution subject may generate the answer text information based on a preset answer policy. The answer text information is information having an association relationship with the voice information.
Step 410, a text emotion category of the answer text information is determined.
In this embodiment, the text emotion types may also include the above positive, neutral and negative directions, and for determining the text emotion type of the answer text information, a text emotion recognition model may be preset, and the training principle of the text emotion recognition model is referred to the above description and will not be repeated here.
Step 411, generating an answer expression corresponding to the text emotion category.
In this embodiment, the answer expressions corresponding to the text emotion categories may include a plurality of expression animations matching in the forward direction, a plurality of expression animations matching in the neutral direction, and a plurality of expression animations matching in the negative direction.
Step 412, outputting the answer text information and the answer expression.
In this embodiment, the executing body may output the answer text information and the answer expression at the same time, or may alternatively output the answer text information and the answer expression, which is not limited in this embodiment. When outputting the answer expression and the expression animation, one or more expression animations corresponding to the emotion categories may be selected and output, or a plurality of expression animations may be simultaneously output.
Further, if in the scenario of multi-round human-computer interaction, steps 402 to 412 may be repeatedly performed after the initial animation is output.
The method for outputting animation provided in the above embodiment of the present application may further determine animation color information based on user emotion information of age of a user and/or text emotion information, and determine animation deformation information based on volume, tone and/or tone color; thereby generating the frequency spectrum animation with color change and/or deformation, and further improving the richness of the display effect. In addition, the color of the frequency spectrum animation can reflect the emotion information of the user and/or the emotion information of the text, the deformation can reflect the volume, the tone and/or the tone color, and the richness of the animation carrying information is improved. And the method can also output the consistent expression animation based on the emotion of the voice information and output the answer expression reflecting the emotion of the answer text information, thereby increasing the emotion characteristics of the animation and improving the man-machine interaction experience.
With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of an apparatus for outputting an animation, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus is particularly applicable to various servers.
As shown in fig. 5, the apparatus 500 for outputting an animation of the present embodiment includes: an information acquisition unit 501, a feature determination unit 502, an animation determination unit 503, and an animation output unit 504.
The information acquisition unit 501 is configured to acquire voice information.
The feature determining unit 502 is configured to determine user feature information and voice feature information based on the voice information and a preset voice processing model.
The animation determination unit 503 is configured to determine a target animation based on the user feature information and the voice feature information.
The animation output unit 504 is configured to output a target animation.
In some optional implementations of the present embodiment, the user characteristic information includes at least user age and/or user mood information; the voice characteristic information at least comprises text emotion information, volume, tone and/or tone color; the target animation at least comprises a spectrum animation; and, the animation determination unit 503 is further configured to: determining animation color information based on the user age, the user mood information, and/or the text mood information; determining animation deformation information based on volume, tone, and/or timbre; and generating the frequency spectrum animation according to the animation color information and/or the animation deformation information.
In some optional implementations of the present embodiment, the animation determination unit 503 is further configured to: determining an age quantified value corresponding to the age of the user, a user emotion quantified value corresponding to the user emotion information and/or a text emotion quantified value corresponding to the text emotion information; the animation color information is determined based on the age quantization value, the user emotion quantization value, and/or the text emotion quantization value.
In some optional implementations of the present embodiment, the animation determination unit 503 is further configured to: acquiring each preset layer; performing color rendering on each preset layer based on animation color information to obtain a first processing result; performing deformation processing on each preset layer based on animation deformation information to obtain a second processing result; and generating the spectrum animation based on the first processing result and/or the second processing result.
In some optional implementations of the present embodiment, the target animation further includes an expression animation; and, the animation determination unit 503 is further configured to: determining a speech emotion category based on the user emotion information and/or the text emotion information; and determining the expression animation matched with the voice emotion type.
In some optional implementations of this embodiment, the apparatus further includes: an answer determining unit configured to determine answer text information corresponding to the voice information; a category determining unit configured to determine a text emotion category of the answer text information; an expression generating unit configured to generate an answer expression corresponding to the text emotion category; and an expression output unit configured to output the answer text information and the answer expression.
In some optional implementations of the present embodiment, the information acquisition unit 501 is further configured to: and responding to the detection of the voice wake-up instruction, outputting a preset initial animation and acquiring voice information.
It should be understood that the units 501 to 504 described in the apparatus 500 for outputting an animation correspond to the respective steps in the method described with reference to fig. 2, respectively. Thus, the operations and features described above with respect to the method of outputting animation are equally applicable to the apparatus 500 and the units contained therein, and are not described in detail herein.
According to embodiments of the present application, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 6 illustrates a block diagram of an electronic device 600 for implementing a method for outputting an animation according to an embodiment of the disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the respective methods and processes described above, for example, a method for outputting an animation. For example, in some embodiments, the method for outputting an animation may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the method for outputting an animation described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the method for outputting the animation in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (14)

1. A method for outputting an animation, comprising:
acquiring voice information;
determining user characteristic information and voice characteristic information based on the voice information and a preset voice processing model;
determining a target animation based on the user characteristic information and the voice characteristic information;
outputting the target animation; the user characteristic information at least comprises user age and/or user emotion information; the voice characteristic information at least comprises text emotion information, volume, tone and tone; the target animation at least comprises a frequency spectrum animation; and
the determining a target animation based on the user characteristic information and the voice characteristic information comprises the following steps:
determining animation color information based on the user age, the user mood information, and/or the text mood information;
Determining animation deformation information based on the volume, the volume and the tone, wherein the animation deformation information is used for performing deformation processing on a specified graph, the specified graph comprises an upper layer, a middle layer and a lower layer, and determining the animation deformation information based on the volume, the volume and the tone comprises the following steps: controlling the deformation of the lowermost layer pattern based on the volume, controlling the deformation of the middle layer pattern based on the tone, and controlling the deformation of the upper layer pattern based on the tone;
and generating the frequency spectrum animation according to the animation color information and the animation deformation information.
2. The method of claim 1, wherein the determining animated color information based on the user age, the user mood information, and/or the text mood information comprises:
determining an age quantified value corresponding to the user age, a user emotion quantified value corresponding to the user emotion information, and/or a text emotion quantified value corresponding to the text emotion information;
the animation color information is determined based on the age quantization value, the user emotion quantization value, and/or the text emotion quantization value.
3. The method of claim 1, wherein said generating said spectral animation in accordance with said animation color information and said animation morphing information comprises:
Acquiring each preset layer;
performing color rendering on each preset layer based on the animation color information to obtain a first processing result;
performing deformation processing on each preset layer based on the animation deformation information to obtain a second processing result;
and generating the spectrum animation based on the first processing result and the second processing result.
4. The method of claim 1, wherein the target animation further comprises an expressive animation; and
the determining a target animation based on the user characteristic information and the voice characteristic information comprises the following steps:
determining a speech emotion category based on the user emotion information and/or the text emotion information;
determining the expression animation matched with the voice emotion type.
5. The method of claim 1, wherein the method further comprises:
determining answer text information corresponding to the voice information;
determining a text emotion category of the answer text information;
generating an answer expression corresponding to the text emotion type;
and outputting the answer text information and the answer expression.
6. The method of claim 1, wherein the obtaining speech information comprises:
And responding to the voice awakening instruction, outputting a preset initial animation and acquiring the voice information.
7. An apparatus for outputting an animation, comprising:
an information acquisition unit configured to acquire voice information;
a feature determination unit configured to determine user feature information and voice feature information based on the voice information and a preset voice processing model;
an animation determination unit configured to determine a target animation based on the user feature information and the voice feature information;
an animation output unit configured to output the target animation; the user characteristic information at least comprises user age and/or user emotion information; the voice characteristic information at least comprises text emotion information, volume, tone and tone; the target animation at least comprises a frequency spectrum animation; and
the animation determination unit is further configured to:
determining animation color information based on the user age, the user mood information, and/or the text mood information;
determining animation deformation information based on the volume, the volume and the tone, wherein the animation deformation information is used for performing deformation processing on a specified graph, the specified graph comprises an upper layer, a middle layer and a lower layer, and determining the animation deformation information based on the volume, the volume and the tone comprises the following steps: controlling the deformation of the lowermost layer pattern based on the volume, controlling the deformation of the middle layer pattern based on the tone, and controlling the deformation of the upper layer pattern based on the tone;
And generating the frequency spectrum animation according to the animation color information and the animation deformation information.
8. The apparatus of claim 7, wherein the animation determination unit is further configured to:
determining an age quantified value corresponding to the user age, a user emotion quantified value corresponding to the user emotion information, and/or a text emotion quantified value corresponding to the text emotion information;
the animation color information is determined based on the age quantization value, the user emotion quantization value, and/or the text emotion quantization value.
9. The apparatus of claim 7, wherein the animation determination unit is further configured to:
acquiring each preset layer;
performing color rendering on each preset layer based on the animation color information to obtain a first processing result;
performing deformation processing on each preset layer based on the animation deformation information to obtain a second processing result;
and generating the spectrum animation based on the first processing result and the second processing result.
10. The apparatus of claim 7, wherein the target animation further comprises an expressive animation; and
The animation determination unit is further configured to:
determining a speech emotion category based on the user emotion information and/or the text emotion information;
determining the expression animation matched with the voice emotion type.
11. The apparatus of claim 7, wherein the apparatus further comprises:
an answer determining unit configured to determine answer text information corresponding to the voice information;
a category determining unit configured to determine a text emotion category of the answer text information;
an expression generating unit configured to generate an answer expression corresponding to the text emotion category;
and an expression output unit configured to output the answer text information and the answer expression.
12. The apparatus of claim 7, wherein the information acquisition unit is further configured to:
and responding to the voice awakening instruction, outputting a preset initial animation and acquiring the voice information.
13. An electronic device that performs a method for outputting an animation, comprising:
at least one computing unit; and
a storage unit in communication with the at least one computing unit; wherein,,
the storage unit stores instructions executable by the at least one computing unit to enable the at least one computing unit to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.
CN202110461816.4A 2021-04-27 2021-04-27 Method, apparatus, device, medium, and product for outputting animation Active CN113160819B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110461816.4A CN113160819B (en) 2021-04-27 2021-04-27 Method, apparatus, device, medium, and product for outputting animation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110461816.4A CN113160819B (en) 2021-04-27 2021-04-27 Method, apparatus, device, medium, and product for outputting animation

Publications (2)

Publication Number Publication Date
CN113160819A CN113160819A (en) 2021-07-23
CN113160819B true CN113160819B (en) 2023-05-26

Family

ID=76871896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110461816.4A Active CN113160819B (en) 2021-04-27 2021-04-27 Method, apparatus, device, medium, and product for outputting animation

Country Status (1)

Country Link
CN (1) CN113160819B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113707146A (en) * 2021-08-31 2021-11-26 北京达佳互联信息技术有限公司 Information interaction method and information interaction device
CN113763968B (en) * 2021-09-08 2024-05-07 北京百度网讯科技有限公司 Method, apparatus, device, medium, and product for recognizing speech
CN113744369A (en) * 2021-09-09 2021-12-03 广州梦映动漫网络科技有限公司 Animation generation method, system, medium and electronic terminal

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI454955B (en) * 2006-12-29 2014-10-01 Nuance Communications Inc An image-based instant message system and method for providing emotions expression
CN105930035A (en) * 2016-05-05 2016-09-07 北京小米移动软件有限公司 Interface background display method and apparatus
CN106328164A (en) * 2016-08-30 2017-01-11 上海大学 Ring-shaped visualized system and method for music spectra
CN109885277A (en) * 2019-02-26 2019-06-14 百度在线网络技术(北京)有限公司 Human-computer interaction device, mthods, systems and devices
CN112287129A (en) * 2019-07-10 2021-01-29 阿里巴巴集团控股有限公司 Audio data processing method and device and electronic equipment

Also Published As

Publication number Publication date
CN113160819A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
CN111368609B (en) Speech interaction method based on emotion engine technology, intelligent terminal and storage medium
CN113160819B (en) Method, apparatus, device, medium, and product for outputting animation
US11947920B2 (en) Man-machine dialogue method and system, computer device and medium
WO2017186050A1 (en) Segmented sentence recognition method and device for human-machine intelligent question-answer system
CN111428010A (en) Man-machine intelligent question and answer method and device
US11816609B2 (en) Intelligent task completion detection at a computing device
CN112910761B (en) Instant messaging method, device, equipment, storage medium and program product
CN110795913A (en) Text encoding method and device, storage medium and terminal
JP7488871B2 (en) Dialogue recommendation method, device, electronic device, storage medium, and computer program
CN113569017B (en) Model processing method and device, electronic equipment and storage medium
CN113157874B (en) Method, apparatus, device, medium, and program product for determining user's intention
CN113539261B (en) Man-machine voice interaction method, device, computer equipment and storage medium
CN112148850A (en) Dynamic interaction method, server, electronic device and storage medium
CN117539975A (en) Method, device, equipment and medium for generating prompt word information of large language model
CN111557001B (en) Method for providing natural language dialogue, computer device and computer readable storage medium
CN110781329A (en) Image searching method and device, terminal equipment and storage medium
CN110288974B (en) Emotion recognition method and device based on voice
CN117688385A (en) Training method, training device, training equipment and training storage medium for text analysis model
CN115062718A (en) Language model training method and device, electronic equipment and storage medium
CN110728983A (en) Information display method, device, equipment and readable storage medium
CN113468857B (en) Training method and device for style conversion model, electronic equipment and storage medium
CN111556999B (en) Method, computer device and computer readable storage medium for providing natural language dialogue by providing substantive answer in real time
CN111444321B (en) Question answering method, device, electronic equipment and storage medium
CN112860995A (en) Interaction method, device, client, server and storage medium
CN113420136A (en) Dialogue method, system, electronic equipment, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant