CN112529992B

CN112529992B - Dialogue processing method, device, equipment and storage medium of virtual image

Info

Publication number: CN112529992B
Application number: CN201910818804.5A
Authority: CN
Inventors: 吴淑明; 王思杰; 陈永波; 刘宗杰; 王甫; 林冠芠; 周芷慧
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2022-08-19
Anticipated expiration: 2039-08-30
Also published as: CN112529992A

Abstract

The embodiment of the invention provides a dialogue processing method, a device, equipment and a storage medium of an avatar, wherein the method comprises the following steps: determining dialog information of the current virtual image, wherein the dialog information comprises voice information; generating rendering data of the virtual image to be rendered currently according to the playing time sequence of the voice information; and rendering the rendering data to show the current virtual image. The real interactive scene can be provided for the user through the body language and/or mouth action of the 3D object, so that the participation degree of the user in the human-computer interaction process is improved, and the user experience is improved.

Description

Dialogue processing method, device, equipment and storage medium of virtual image

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, device, and storage medium.

Background

With the rapid development of Artificial Intelligence (AI) and Internet of Things (Internet of Things, IOT), the Internet of Artificial Intelligence (AI & Internet of Things, AIOT) has come into play.

AIOT is widely used in a variety of different fields, for example: an online voice interactive robot or an online navigation robot. At present, an online voice interactive robot can perform interactive question answering with a user through a display interface, however, image display and voice conversation of the robot are often simply matched, so that the image display effect of the robot is poor.

Disclosure of Invention

One or more embodiments of the invention describe a data processing method, device, equipment and storage medium, which are used for solving the problem of poor image display effect of a robot in a human-computer interaction process.

In order to solve the technical problem, the invention is realized as follows:

according to a first aspect, there is provided a dialog processing method of an avatar, the method may include:

determining dialog information of the current virtual image, wherein the dialog information comprises voice information;

generating rendering data of the virtual image to be rendered currently according to the playing time sequence of the voice information;

and rendering the rendering data to show the current virtual image.

According to a second aspect, there is provided a method of processing a three-dimensional model, the method may comprise:

determining display information of the current three-dimensional model, wherein the display information comprises voice information and expression information;

generating rendering data of the three-dimensional image model needing to be rendered at present according to the playing time sequence of the voice information and the display characteristics of the three-dimensional model corresponding to the expression information;

and rendering the rendering data to show the conversation and the action of the current three-dimensional image model.

According to a third aspect, there is provided a data processing method, which may comprise:

acquiring response text information corresponding to the query information based on the received query information, wherein the response text information comprises at least one emotion tag;

acquiring action information corresponding to the response text information according to the emotion label;

converting the response text information into voice information played according to time sequence;

associating the action information and the expression information with voice information played according to time sequence to determine rendering data; wherein, the expression information is obtained by the voice information;

and 3D rendering is carried out on the rendering data to obtain rendering data of the virtual image.

According to a fourth aspect, there is provided a dialog processing apparatus of an avatar, the apparatus may include:

the processing module is used for determining the dialogue information of the current virtual image, and the dialogue information comprises voice information;

the generating module is used for generating rendering data of the virtual image to be rendered currently according to the playing time sequence of the voice information;

and the rendering module is used for rendering the rendering data so as to display the current virtual image.

According to a fifth aspect, there is provided an apparatus for processing a three-dimensional model, the apparatus may comprise:

the processing module is used for determining display information of the current three-dimensional model, wherein the display information comprises voice information and expression information;

the generating module is used for generating rendering data of the three-dimensional image model needing rendering at present according to the playing time sequence of the voice information and the display characteristics of the three-dimensional model corresponding to the expression information;

and the rendering module is used for rendering the rendering data so as to display the conversation and the action of the current three-dimensional image model.

According to a sixth aspect, there is provided a data processing method, which may include:

the first acquisition module is used for acquiring response text information corresponding to the query information based on the received query information, and the response text information comprises at least one emotion tag;

the second acquisition module is used for acquiring action information corresponding to the response text information according to the emotion label;

the conversion module is used for converting the response text information into voice information played according to time sequence;

the processing module is used for associating the action information and the expression information with the voice information played according to time sequence and determining rendering data; wherein, the expression information is obtained by the voice information;

and the rendering module is used for performing 3D rendering on the rendering data to obtain rendering data of the virtual image.

According to a seventh aspect, there is provided a computing device comprising at least one processor and a memory, the memory storing computer program instructions, the processor being configured to execute a program of the memory to control a server to implement a method as in the first, second or third aspect.

According to an eighth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, if executed in a computer, causes the computer to perform the method of the first, second or third aspect.

According to the scheme of the embodiment of the invention, the rendering process of the virtual image is driven through the voice information in the dialogue information, so that the action of the virtual image corresponding to the voice information can be generated according to the voice information, and the voice information of the virtual image is highly fused with the displayed image.

In addition, because the action of the virtual image is highly integrated with the voice, the virtual image can present some exaggerated actions and expressions through the voice information with tone, more interesting interactive experience can be provided for the user, and the user experience is enhanced while the participation degree of the user in the human-computer interaction process is improved.

Drawings

The present invention may be better understood from the following description of specific embodiments of the invention taken in conjunction with the accompanying drawings, in which like or similar reference numerals identify like or similar features.

Fig. 1 illustrates an application scenario diagram of a dialog processing method of an avatar according to an embodiment;

FIG. 2 illustrates a flow diagram of a dialog processing method for an avatar according to one embodiment;

FIG. 3 illustrates a schematic structural diagram of determining expression information according to one embodiment;

FIG. 4 illustrates a schematic structural diagram of a determination sentiment tag in accordance with one embodiment;

FIG. 5 illustrates a block diagram of determining motion information according to one embodiment;

FIG. 6 shows a flow diagram of a data processing method according to one embodiment;

FIG. 7 is a schematic diagram illustrating a decoding flow of speech information according to one embodiment;

FIG. 8 illustrates a structural diagram of determining rendering data according to one embodiment;

FIG. 9 illustrates an interface diagram including an avatar according to one embodiment;

FIG. 10 illustrates a flow diagram of a method of processing a three-dimensional model according to one embodiment;

FIG. 11 illustrates a block diagram of a dialog processing device for an avatar, according to one embodiment;

FIG. 12 shows a block diagram of a data processing apparatus according to an embodiment;

FIG. 13 shows a block diagram of a processing device of a three-dimensional model according to one embodiment;

FIG. 14 illustrates a schematic structural diagram of a computing device, according to one embodiment.

Detailed Description

Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any such measured relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

In order to solve the problem of the prior art, embodiments of the present invention provide a method, an apparatus, a device, and a storage medium for processing a dialog of an avatar, which are specifically shown in the following.

Firstly, the method comprises the following steps: an application scenario of the dialog processing method for an avatar according to an embodiment of the present invention is described with reference to fig. 1.

Fig. 1 illustrates an application scenario diagram of a dialog processing method of an avatar according to an embodiment.

As shown in fig. 1, when a merchant wants to promote a product, the merchant wants to combine the functions of the online voice interaction robot with the virtual image corresponding to the brand image of the merchant to serve a large number of user clients.

At this time, the user may be guided to consult related questions through prompt information displayed by the data processing device 10 (for example, size; alternatively, the user directly consults the data processing device 10 for the relevant question in the case where the microphone array of the data processing device 10 is able to receive the user's query data. Alternatively, the user may wake up the avatar in the data processing device 10 to serve it with a specific indication. Wherein the form of this avatar may change according to the age and/or age of the user, for example: when the user is a child, the avatar may be some cartoon character; when the user is an adult male, the avatar may be a female avatar. In addition, the user may also personalize the customized avatar through the data processing device 10.

In this way, when the data processing device 10 receives the query information of the user, the response text information corresponding to the query information is obtained, wherein the response text information comprises at least one sentiment tag. For example: the inquiry information includes: "is a product of a merchant recently given a benefit? ", the response text message includes: "have, some product has recently promoted a bonus for buying one, very favorable! Thus, the corresponding emotion labels may be "surprise", "happy", and "exaggerated".

Next, the data processing device 10 acquires action information corresponding to the response text information according to at least one emotion tag. For example: the motion information of the avatar corresponding to the emotion tag "surprise" includes a motion of putting out "je" by the hand; the action information corresponding to the emotion tag 'happy' comprises at least one jumping action of feet and legs; and the action information corresponding to the emotion tag "exaggeration" includes performing a header raising action at least once.

Then, the data processing device 10 converts the response text information into voice information having a play timing, generates rendering data of the avatar currently to be rendered according to the play timing of the voice information, and performs rendering processing on the rendering data to display the current avatar. Thus, the user can watch the virtual image, and play the voice information according to time sequence, and change the facial expression and body movement according to the virtual image correlated with the voice information played according to time sequence.

The rendering process of the virtual image is driven through the voice information in the dialogue information, so that the action of the virtual image corresponding to the voice information can be generated according to the voice information, and the voice information of the virtual image is highly fused with the displayed image. In addition, because the action of the virtual image is highly integrated with the voice, the virtual image can present some exaggerated actions and expressions through the voice information with tone, more interesting interactive experience can be provided for the user, and the user experience is enhanced while the participation degree of the user in the human-computer interaction process is improved.

In addition, the data processing method provided by the embodiment of the invention can be applied to the scenes, which relate to the application of the artificial intelligent voice technology, such as the equipment for inquiring the traffic information at the traffic hub station, the traffic equipment (for example, unmanned vehicles), the equipment related to the intelligent retail shopping guide in the shopping mall, the related introduction of a certain area, the map navigation and the like.

Secondly, the method comprises the following steps: based on the above mentioned related scenes, the embodiment of the invention provides a dialog processing method for an avatar.

The following describes in detail the dialog processing method of the avatar provided by the embodiment of the present invention with reference to fig. 2 to 5.

Fig. 2 illustrates a flowchart of a dialog processing method of an avatar according to an embodiment.

As shown in fig. 2, the method may include steps 210 to 230:

firstly, step 210, determining dialog information of a current avatar, wherein the dialog information comprises voice information; next, step 220, generating rendering data of the virtual image to be rendered currently according to the playing time sequence of the voice information; then, in step 230, rendering processing is performed on the rendering data to display the current avatar.

The above steps are described in detail below:

first, referring to step 210, before determining the dialog information of the current avatar, the voice information with the playing time sequence is determined, wherein the query information of the user needs to be obtained before determining the voice information with the playing time sequence.

The method I comprises the following steps: the data processing device displays at least one prompt message (such as the size of the product you want to ask; the preferential event of the product, and the like), receives a preset operation that a user determines a target prompt message in the at least one prompt message, and acquires inquiry data corresponding to the target prompt message according to the preset operation.

The second method comprises the following steps: an audio signal emitted by a user is captured by a microphone array of the data processing device, the audio signal is recognized and the query information in the audio signal is determined.

Here, the query information is determined in the second method, for example, and the detailed description is given.

Under the condition of acquiring an audio signal sent by a user, preprocessing the audio signal such as noise reduction, clutter removal, frequency domain to time domain conversion and the like is carried out to obtain target audio data; converting the target audio data into query information corresponding to the audio signal by using voice recognition; and matching the answer text information corresponding to the inquiry information through natural semantic understanding.

Time marking is carried out on each keyword in the response text information respectively to obtain response text information with word time stamps; converting the response text information carrying the word time stamp into voice information with playing time sequence; wherein, the playing time sequence is according to the time arrangement sequence of the word time stamps.

Next, step 220 is involved, and rendering data of the avatar currently to be rendered is generated according to the playing time sequence of the voice information.

Specifically, splitting an audio frame in the voice information to obtain at least one audio clip with a playing time sequence; the audio clip includes: audio data and word timestamps corresponding to the audio segments;

and associating the audio data and the word time stamp corresponding to the audio fragment with the avatar data needing to be rendered currently, and generating rendering data of the avatar needing to be rendered currently.

Based on this, before specifying the generation of rendering data, avatar data is first determined. Wherein, in a possible embodiment, the avatar data includes facial expression information.

The embodiment of the invention provides a mode for determining expression information, which is specifically as follows:

identifying a phoneme characteristic of each audio fragment in the speech information;

and acquiring expression information having a preset association relation with the phoneme characteristics according to the phoneme characteristics.

Specifically, identifying the phoneme characteristics of each keyword in the voice information;

and obtaining expression information (such as mouth data) having a preset association relation with the phoneme characteristics according to the phoneme characteristics.

For example, the following steps are carried out: as shown in FIG. 3, the phoneme feature "o" is included in the pronunciation of "I" and "O", and if the pronunciation of "o" is to be uttered, the mouth needs to be opened to make the whole mouth appear circular, so that the mouth coordinates (upper lip coordinate A, left mouth corner coordinate B, lower lip coordinate C and right mouth corner coordinate D) corresponding to the mouth appearing circular are determined as "I" and "O" mouth data. If a 'very' sound is uttered behind me, the factor characteristic 'e' is included in the 'very' pronunciation, the mouth shape uttered with the 'e' is smaller than the mouth shape uttered with the 'o', the elliptical mouth coordinates (the upper lip coordinate a1, the left mouth angular coordinate B1, the lower lip coordinate C1, and the right mouth angular coordinate D1) are obtained from the circular mouth shape, and at this time, the reduced mouth coordinates are determined as the 'very' mouth shape data. Thus, the mouth coordinates change according to the pronunciation of each character, and the mouth coordinates corresponding to each character are determined as mouth shape data.

In another possible embodiment, the avatar data may also include motion information. The embodiment of the invention provides a way for determining action information, which is specifically as follows:

searching a characteristic vocabulary corresponding to each keyword in the response text information in a characteristic vocabulary base;

determining the emotion label corresponding to the characteristic vocabulary as an emotion label for responding text information;

and obtaining action information corresponding to the voice information according to the emotion label.

Here, it may be determined in step 210 that the response text information matches an emotion tag capable of representing the response text information, and an embodiment of the present invention provides a manner of obtaining an emotion tag corresponding to the response text information, which is specifically shown as follows:

splitting the response text data into at least one keyword by using natural semantics;

searching a characteristic vocabulary corresponding to each keyword in a characteristic vocabulary base;

and determining the emotion label corresponding to the characteristic vocabulary as the emotion label of the response text information.

In a possible example, when only one keyword is included, the feature vocabulary corresponding to one keyword may be at least one, and the emotion tag corresponding to at least one feature vocabulary is used as the emotion tag of the response text message.

For example, the following steps are carried out: the response text message includes: here, the response text data is divided into "sticks", the characteristic vocabulary "like" of the "sticks" is searched in the characteristic word library, and the emotion tags "happy" and "encouragement" corresponding to the "like" are determined as the emotion tags of the response text information.

In another possible example, the keyword may include a plurality of keywords, and the feature vocabulary of each keyword in the plurality of keywords is searched in the feature vocabulary library; and determining the emotion label corresponding to each characteristic vocabulary as the emotion label of the response text information.

By way of example: as shown in fig. 4, the response text message includes: "have, a product recently launched a bonus for buying," A favor! Here, the response text information is divided into 3 keywords of 'present', 'give a gift to buy', and 'very preferential'; inquiring the feature words ' affirmation ' with ' and ' number ' of ' buying one for one ' and ' strong tone ' of ' very preferential ' in the feature word bank; next, the emotion tags "happy" corresponding to "positive", the emotion tags "surprise" corresponding to "numeric" and the emotion tags "exaggerated" corresponding to "big" are determined as the emotion tags of the response text information.

It should be noted that, in the embodiment of the present invention, one keyword (or keyword) may correspond to at least one feature vocabulary, and each feature vocabulary may also correspond to at least one emotion tag.

After the at least one emotion label is acquired based on the above steps, in order to make the avatar and the response text information more fit, action information corresponding to each emotion label in the at least one emotion label is also acquired, where the emotion label may further represent a tone intensity of the response text information, and the tone intensity may be reflected in a magnitude of an action performed by the avatar through the action information and/or the expression information.

And inquiring an association list related to the action information based on the acquired emotion label. And acquiring action information corresponding to the voice information according to the preset association relation between the emotion labels and the action information in the association list.

By way of example: in the above example, as shown in fig. 5, the motion information corresponding to the emotion tag "surprise" includes a motion of putting out "je" by the hand; the action information corresponding to the emotion tag 'happy' comprises at least one jumping action of feet and legs; and the action information corresponding to the emotion tag "exaggeration" includes performing a header raising action at least once.

It should be noted that, in the embodiment of the present invention, each emotion tag may correspond to an action, for example: the emotion label 'surprise' corresponds to the action of putting out 'Ye' on the hand; alternatively, each sentiment tag may correspond to a set of actions, such as: the feet and the legs are subjected to at least one jumping action corresponding to the emotion label 'happy'.

Therefore, rendering data of the virtual image which needs to be rendered currently can be generated based on the acquired audio data, the word time stamp corresponding to the audio clip, and the expression information and the action information.

And associating the word time stamps corresponding to the audio data and the audio fragments with the avatar data needing to be rendered currently, and generating rendering data of the avatar needing to be rendered currently.

Specifically, the display characteristics of the three-dimensional model corresponding to the expression information are obtained according to the expression information; acquiring display characteristics of the three-dimensional model corresponding to the action information according to the action information;

and associating the audio data, the word time stamp corresponding to the audio fragment, the display characteristics of the three-dimensional model corresponding to the expression information and the display characteristics of the three-dimensional model corresponding to the action information to generate rendering data of the virtual image to be rendered currently.

Then, step 230 is involved, rendering the rendering data to present the current avatar.

And 3D rendering is carried out on the rendering data through the rendering model to obtain the 3D data of the current virtual image to be displayed.

For example, three-dimensional (3D) rendering may be performed on the rendering data by using three-dimensional (ThreeJS) to obtain 3D data of the current avatar to be displayed.

In the embodiment of the invention, the rendering process of the virtual image can be driven by the voice information in the dialogue information, so that the action of the virtual image corresponding to the voice information can be generated according to the voice information, and the voice information of the virtual image is highly fused with the displayed image.

In addition, because the action of the virtual image is highly integrated with the voice, the virtual image can present some exaggerated actions and expressions through the voice information with tone, more interesting interactive experience can be provided for the user, and the user experience is enhanced while the participation of the user in the human-computer interaction process is improved.

Thirdly, the method comprises the following steps: based on the contents shown in fig. 2 to 5, the embodiment of the present invention provides a specific data processing method how to generate an avatar.

The data processing method provided by the embodiment of the invention is described in detail below with reference to fig. 6.

FIG. 6 shows a flow diagram of a data processing method according to one embodiment.

Firstly, step 610, acquiring response text information corresponding to the query information based on the received query information, wherein the response text information comprises at least one emotion tag;

then, step 620, obtaining action information corresponding to the response text information according to the emotion label;

secondly, step 630, converting the response text information into voice information played according to time sequence;

furthermore, in step 640, the action information and the expression information are associated with the voice information played in time sequence, and rendering data is determined; wherein, the expression information is obtained by the voice information;

then, in step 650, the rendering data is rendered in 3D to obtain rendering data of the avatar.

The above steps are described in detail below:

first, referring to step 610, before determining the response text message, query information needs to be obtained, the embodiment of the present invention provides the following two ways to obtain the query information proposed by the user.

The first method is as follows: the data processing device displays at least one prompt message (such as the size of the product you want to ask; the preferential event of the product, and the like), receives a preset operation that a user determines a target prompt message in the at least one prompt message, and acquires inquiry information corresponding to the target prompt message according to the preset operation.

Here, the query information is determined in the second manner, for example, and the detailed description is given.

Under the condition of acquiring an audio signal sent by a user, preprocessing the audio signal such as noise reduction, clutter removal, frequency domain to time domain conversion and the like is carried out to obtain target audio data; converting the target audio data into text data corresponding to the audio signal by using voice recognition; and the response text information corresponding to the text data is matched through natural semantic understanding.

After the response text information is determined, according to the response text information, matching an emotion tag capable of representing the response text information, an embodiment of the present invention provides a manner of obtaining an emotion tag corresponding to the response text information, which is specifically as follows:

splitting the response text information into at least one keyword by utilizing natural semantics;

searching a characteristic vocabulary of each keyword in at least one keyword in a characteristic vocabulary library;

In a possible example, when the keyword includes only one keyword, the feature vocabulary corresponding to the keyword may be at least one, and the emotion tag corresponding to at least one feature vocabulary is used as the emotion tag of the response text message.

For example, the following steps are carried out: the response text message includes: "do you good, what can help you? Here, the response text information is divided into "hello", a feature word "smile" of "hello" is searched for in the feature word library, and the emotion tags "happy" and "lovely" corresponding to the "smile" are determined as the emotion tags of the response text information.

For example, the following steps are carried out: referring to fig. 4, the response text message includes: "have, some product has recently promoted a bonus for buying one, very favorable! Here, the response text information is divided into 3 keywords of "have", "buy one gift" and "very good" respectively; inquiring the feature words ' affirmation ' with ' and ' number ' of ' buying one for one ' and ' strong tone ' of ' very preferential ' in the feature word bank; next, emotion tags "happy" corresponding to "positive", emotion tags "surprised" corresponding to "number", and emotion tags "exaggerated" corresponding to "good" are determined as emotion tags of response text information.

It should be noted that a keyword according to the embodiment of the present invention may correspond to at least one feature vocabulary, and each feature vocabulary may also correspond to at least one emotion tag.

After the at least one emotion label is acquired based on the above steps, in order to make the avatar and the response text information more fit, action information corresponding to each emotion label in the at least one emotion label is also acquired, where the emotion label may further represent a tone intensity of the response text information, and the tone intensity may be reflected in a magnitude of an action performed by the avatar through the action information. The detailed description is specifically made in conjunction with step 620.

Next, step 620 is involved: in one possible embodiment, an association list of emotion tags related to action information is obtained. And acquiring action information corresponding to the response text information according to the preset association relation between the emotion labels and the action information in the association list.

For example, the following steps are carried out: in the above example, referring to fig. 5, the motion information of the avatar corresponding to the emotion tag "surprise" includes a motion of putting out "je" by the hand; the action information corresponding to the emotion tag 'happy' comprises at least one jumping action of feet and legs; and the action information corresponding to the emotion tag "exaggeration" includes performing a header raising action at least once.

It should be noted that, in the embodiment of the present invention, each emotion tag may correspond to an action, for example: the action information of the virtual image corresponding to the emotion label 'surprise' comprises the action of swinging out 'Ye' by the hand; alternatively, each sentiment tag may correspond to a set of actions, such as: the motion information corresponding to the emotion tag "happy" includes that the foot and the leg perform jumping motions at least once.

At this time, the action information corresponding to the response text information has been determined according to the response text information, and then, the response text information may be converted into voice information to be played so as to perform voice interaction with the user.

Next, step 630 is involved: here, in the embodiment of the present invention, the voice information includes streaming audio data and a word time stamp. Wherein the word time stamp is used to mark the start time and the end time of each audio segment in the streaming audio data, as well as the start time and the end time of the entire streaming audio data. Chronologically played voice information can be characterized by streaming audio data that includes word timestamps.

For example, the following steps are carried out: the answer text message' some, some product has recently promoted a bonus for buying! "convert to voice message, wherein, start playing every word interval 1 second from 10 o 'clock 00 (wherein, punctuation marks are written as a word), 10 o' clock 0.4 minutes end.

In order to enable the first time stream of the speech information played in time sequence, the second time stream of the action information acquired in step 220, and the third time stream of the expression information determined according to the speech information to be on the same time baseline, before step 640, the audio response data needs to be decoded to obtain audio data and a word timestamp, which is specifically as follows:

splitting an audio frame in the voice information to obtain at least one audio clip played according to a time sequence;

and decoding each audio segment in the at least one audio segment to obtain audio data and word time stamps in each audio segment.

For example, the following steps are carried out: as shown in FIG. 7, one will include the Activity that a product recently promoted to buy a gift, a special offer! "the voice information is split into 4 audio segments:

audio segment 1: word time stamping: (point 10); audio data: ("there");

audio clip 2: word time stamping: 10 points and 0.05 minutes; audio data: ("some product was recently released");

audio clip 3: word time stamp: 10 o' clock 0.17 min; audio data: (buy a bonus activity ");

audio segment 4: word time stamping: 10 points and 0.3 minutes; audio data: ("well").

The 4 audio segments are decoded in turn, obtaining audio data and word time stamps for each audio segment.

Further, step 640 is involved: before determining rendering data, in one possible embodiment, the method may further include: and obtaining the expression information according to the voice information.

Identifying the phoneme characteristics of each keyword in the voice information;

and obtaining expression information having a preset association relation with the phoneme characteristics according to the phoneme characteristics.

By way of example: referring to fig. 3, the factor feature "o" is included in the pronunciation of "me" and "me", and if the voice of "o" is to be uttered, the mouth needs to be enlarged to make the whole mouth form present a circular shape, so that the mouth coordinates (upper lip coordinate a, left mouth corner coordinate B, lower lip coordinate C and right mouth corner coordinate D) corresponding to the circular shape presented by the mouth are determined as "me" and "me" expressive information. If a 'very' sound is uttered behind me, the 'very' sound includes a factor feature 'e', the mouth shape uttering the 'e' is smaller than the mouth shape uttering the 'o', the elliptical mouth coordinates (upper lip coordinate a1, left mouth corner coordinate B1, lower lip coordinate C1, and right mouth corner coordinate D1) are obtained by rendering the elliptical mouth shape from the circular mouth shape, and at this time, the reduced mouth coordinates are determined as 'very' expression information. Thus, the mouth coordinates change with the pronunciation of each character, and the mouth coordinates corresponding to each character are determined as expression information.

In one possible embodiment, the rendering data is determined by associating the motion information, the emotion information, the audio data, and the word timestamp.

Wherein, according to the expression information

Acquiring display characteristics of a three-dimensional model corresponding to the expression information; acquiring display characteristics of the three-dimensional model corresponding to the action information according to the action information;

respectively associating the display characteristics of the three-dimensional model corresponding to the expression information and the display characteristics of the three-dimensional model corresponding to the action information with the word time stamp in each audio clip to obtain associated data;

rendering data is determined based on the association of the association data with the audio data in each audio clip.

This step is illustrated by taking audio clip 4 as an example: as shown in fig. 8, the word time stamp in the audio piece 4 is obtained: 10 dots 0.3 point and audio data: ("well").

Acquiring display characteristics of a three-dimensional model corresponding to the expression information according to the expression information by using the first 3D driving model; and acquiring the display characteristics of the three-dimensional model corresponding to the action information according to the action information by using the second 3D driving model.

Associating each word time stamp in the audio segment 4 and the starting time 10 point 0.3 of the audio segment 4 with the display characteristics of the three-dimensional model corresponding to the expression information; and associating each word time stamp in the audio segment 4 and the starting time 10 point 0.3 of the audio segment 4 with the presentation characteristic of the three-dimensional model corresponding to the action information.

Then, the audio data in the audio section 4 associates the presentation characteristics of the three-dimensional model corresponding to the expression information with the presentation characteristics of the three-dimensional model corresponding to the action information, so that the expression of the avatar and each word in the "big coup" correspondingly change while the avatar utters audio data at 10 points 0.3 ("big coup"), and the avatar performs a brow-up action at least once during this time.

Then, step 650: in a possible embodiment, the rendering data may include text data associated with at least one of the following data in addition to the time-series speech information shown above and the motion information and the expression information associated with the time-series speech information: response text information, voice information, query information.

For example: as shown in fig. 9, in order to ensure the accuracy of voice interaction and improve the efficiency of voice interaction, in the case of displaying the avatar through the data processing device, a text prompt associated with the response text message and/or the voice message (for example, "wa, cony you answered right and you are excellent, and" riddle "again) may be displayed to ensure that the user can find the answer matching the query message in the text prompt and the voice message.

Alternatively, a text prompt associated with the query message may be displayed to ensure that the user finds the desired answer data and data associated with the current or next query message in at least one of the voice message and text message retrieval.

Here, the 3D audio-visual data may be displayed through the data processing device such that the user views the avatar while playing the voice information in time series, while transforming the physical and/or facial expression of the avatar according to the action information and expression information associated with the voice information played in time series. The body language and/or the face of the virtual image can provide a relatively real interactive scene for the user, so that the participation degree of the user in the human-computer interaction process is improved.

In addition, when the virtual image replies to the user problem, the action information and/or the expression information can present some exaggerated actions and expressions according to the tone intensity represented by the emotion tag, more interesting interaction experience can be provided for the user, and the user experience is enhanced while the user participation is improved. Fourthly: based on the content shown in fig. 2, the embodiment of the present invention provides a processing method for embodying an avatar into a three-dimensional model.

Fourthly: based on the contents shown in fig. 2 to 5, the embodiment of the present invention provides a processing method for forming an avatar into a three-dimensional model. The following describes in detail a processing method of a three-dimensional model according to an embodiment of the present invention with reference to fig. 10.

FIG. 10 shows a flow diagram of a method of processing a three-dimensional model according to one embodiment.

Firstly, step 1010, determining display information of a current three-dimensional model, wherein the display information comprises voice information and expression information; then, step 1020, generating first rendering data of the three-dimensional image model to be rendered currently according to the playing time sequence of the voice information and the display characteristics of the three-dimensional model corresponding to the expression information; then, in step 1030, rendering processing is performed on the rendering data to show the dialog and actions of the current three-dimensional character model.

The above steps are described in detail below:

involving step 1010: here, since the presentation information includes voice information and facial expression information. Here, how to determine the voice information and the expression information is explained, respectively.

(1) And determining the voice information with the playing time sequence.

Determining response text information corresponding to the query information according to the received query information;

time marking is carried out on each keyword in the response text information respectively, and response text information with word time stamps is obtained;

converting the response text information carrying the word time stamp into voice information with a playing time sequence; wherein, the playing time sequence is according to the time arrangement sequence of the word time stamps.

(2) And determining the expression information.

In addition, in one possible embodiment, the presentation information packet may also include action information. The embodiment of the invention can determine the action information through the following steps:

searching a characteristic vocabulary of each keyword in the response text information in a characteristic vocabulary base;

Thus, the method may further comprise: acquiring display characteristics of the three-dimensional model corresponding to the action information according to the action information; and generating second rendering data of the virtual image to be rendered currently according to the audio data, the word time stamp corresponding to the audio clip, the display characteristics of the three-dimensional model corresponding to the expression information and the display characteristics of the three-dimensional model corresponding to the action information.

Fifth: based on the methods provided in the above-mentioned fig. 2, 6 and 10, the following describes in detail the blocks of the dialogue processing device, the data processing device and the processing device of the three-dimensional model of the avatar in conjunction with fig. 11-13.

Fig. 11 shows a block diagram of a dialog processing device of an avatar according to an embodiment.

As shown in fig. 11, the processing device 110 of the three-dimensional model may include:

a processing module 1101, configured to determine dialog information of a current avatar, where the dialog information includes voice information;

a generating module 1102, configured to generate rendering data of a current avatar to be rendered according to a playing timing sequence of the voice information;

and a rendering module 1103, configured to perform rendering processing on the rendering data to display the current avatar.

The processing module 1101 may be specifically configured to receive query information of a user;

determining response text information corresponding to the query information according to the query information;

converting the response text information carrying the word time stamp into voice information with playing time sequence; wherein, the playing time sequence is according to the time arrangement sequence of the word time stamp.

The generating module 1102 may be specifically configured to split an audio frame in the speech information to obtain at least one audio segment with a playing time sequence; the audio clip includes: audio data and word timestamps corresponding to the audio segments;

The dialog processing device 110 of the avatar of the embodiment of the present invention may further include a determining module 1104 for recognizing a phoneme characteristic of each audio clip in the speech information; and obtaining expression information having a preset association relation with the phoneme characteristics according to the phoneme characteristics.

The determining module 1104 may be further configured to search a feature vocabulary corresponding to each keyword in the response text information in the feature vocabulary library;

determining the emotion label corresponding to the characteristic vocabulary as the emotion label of the response text information;

In a possible embodiment, the generating module 1102 may be specifically configured to obtain, according to the expression information, a display feature of a three-dimensional model corresponding to the expression information; acquiring display characteristics of the three-dimensional model corresponding to the action information according to the action information;

The rendering module 1103 may be specifically configured to perform 3D rendering on the rendering data through the rendering model to obtain 3D data of the current avatar to be displayed.

FIG. 12 shows a block diagram of a data processing apparatus according to an embodiment.

As shown in fig. 12, the data processing apparatus 120 may include:

a first obtaining module 1201, configured to obtain response text information corresponding to the query information based on the received query information, where the response text information includes at least one emotion tag;

a second obtaining module 1202, configured to obtain, according to the emotion tag, action information corresponding to the response text information;

a converting module 1203, configured to convert the response text information into voice information played according to a time sequence;

the processing module 1204 is configured to associate the action information and the expression information with the voice information played in time sequence, and determine rendering data; wherein, the expression information is obtained by the voice information;

and a rendering module 1205 for performing 3D rendering on the rendering data to obtain rendering data of the avatar.

The processing module 1204 of the embodiment of the present invention may be specifically configured to split an audio frame in the speech information to obtain at least one audio clip with a playing time sequence; decoding each audio segment of the at least one audio segment to obtain audio data and a word time stamp in each audio segment;

and associating the action information, the expression information, the audio data and the word time stamp to determine rendering data.

Further, the processing module 1204 may be specifically configured to obtain, according to the expression information, a display feature of the three-dimensional model corresponding to the expression information; acquiring display characteristics of the three-dimensional model corresponding to the action information according to the action information;

FIG. 13 shows a block diagram of a processing device of a three-dimensional model according to one embodiment.

As shown in fig. 13, the processing device 130 of the three-dimensional model may include:

the processing module 1301 is configured to determine display information of the current three-dimensional model, where the display information includes voice information and expression information;

the generating module 1302 is configured to generate first rendering data of a three-dimensional image model to be rendered currently according to a playing time sequence of the voice information and a display characteristic of the three-dimensional model corresponding to the expression information;

and the rendering module 1303 is used for rendering the rendering data to display the dialog and the action of the current three-dimensional image model.

The processing module 1301 of the embodiment of the present invention may be specifically configured to determine, according to the received query information, response text information corresponding to the query information;

time marking is carried out on each keyword in the response text information respectively to obtain response text information with word time stamps;

converting the response text information carrying the word time stamp into voice information with a playing time sequence; wherein, the playing time sequence is according to the time arrangement sequence of the word time stamp.

Further, the processing module 1301 may specifically be configured to identify a phoneme feature of each audio segment in the speech information; and acquiring expression information having a preset association relation with the phoneme characteristics according to the phoneme characteristics.

The processing device 130 of the three-dimensional model according to the embodiment of the present invention may further include a determining module 1303, configured to search a characteristic vocabulary of each keyword in the response text information in the characteristic vocabulary library;

The generating module 1302 of the embodiment of the present invention may be further configured to obtain, according to the action information, a display characteristic of the three-dimensional model corresponding to the action information;

and generating second rendering data of the virtual image to be rendered currently according to the audio data, the word time stamp corresponding to the audio clip, the display characteristics of the three-dimensional model corresponding to the expression information and the display characteristics of the three-dimensional model corresponding to the action information.

In summary, the apparatus according to the embodiment of the present invention may display 3D audio/video data through the data processing apparatus, so that the user may view the avatar while playing the voice information in time series, and may change the body and/or facial expression of the avatar according to the motion information and expression information associated with the voice information played in time series. The body language and/or the face of the virtual image can provide a relatively real interactive scene for the user, so that the participation degree of the user in the human-computer interaction process is improved.

In addition, when the virtual image replies to the user problem, the action information and/or the expression information can present some exaggerated actions and expressions according to the tone intensity represented by the emotion tag, more interesting interaction experience can be provided for the user, and the user experience is enhanced while the user participation is improved.

Sixth: based on the methods provided in fig. 1-10 above, a computing device for the above method and apparatus is described in detail below in conjunction with fig. 14.

As shown in fig. 14, a block diagram of an exemplary hardware architecture of a computing device capable of implementing the data processing method and apparatus according to the embodiments of the present invention.

The device may include a processor 1401 and a memory 1402 storing computer program instructions.

Specifically, the processor 1401 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of the embodiments of the present application.

Memory 1402 may include mass storage for data or instructions. By way of example, and not limitation, memory 1402 may include a Hard Disk Drive (HDD), a floppy disk drive, flash memory, an optical disk, a magneto-optical disk, magnetic tape, or a Universal Serial Bus (USB) drive or a combination of two or more of these. Memory 1402 may include removable or non-removable (or fixed) media, where appropriate. Memory 1402 may be internal or external to the integrated gateway device, where appropriate. In a particular embodiment, the memory 1402 is a non-volatile solid-state memory. In certain embodiments, memory 1402 comprises Read Only Memory (ROM). Where appropriate, the ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these.

The processor 1401 implements any of the methods of fig. 1-10 in the above-described embodiments by reading and executing computer program instructions stored in the memory 1402.

The transceiver 1403 is mainly used for implementing the apparatuses in the embodiments of the present invention or communicating with other devices.

In one example, the device can also include a bus 1404. As shown in fig. 14, the processor 1401, the memory 1402, and the transceiver 1403 are connected via a bus 1404 and communicate with each other.

The bus 1404 includes hardware, software, or both. By way of example, and not limitation, a bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a Hypertransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus or a combination of two or more of these. Bus 1403 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

In one possible embodiment, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when executed in a computer, causes the computer to perform the method steps shown in fig. 1-10 of an embodiment of the present invention.

It is to be understood that the invention is not limited to the particular arrangements and instrumentality described in the above embodiments and shown in the drawings. For convenience and brevity of description, detailed description of a known method is omitted here, and for the specific working processes of the system, the module and the unit described above, reference may be made to corresponding processes in the foregoing method embodiments, which are not described herein again.

It will be apparent to those skilled in the art that the method procedures of the present invention are not limited to the specific steps described and illustrated, and that various changes, modifications and additions, or equivalent substitutions and sequences between the steps, which are within the technical scope of the present invention as disclosed, may be made by those skilled in the art after appreciating the spirit of the present invention, are intended to be covered by the scope of the present invention.

Claims

1. A dialog processing method of an avatar, comprising:

determining dialog information of a current avatar, wherein the dialog information comprises voice information;

rendering the rendering data to display the current virtual image;

wherein, the generating of rendering data of the current avatar to be rendered according to the playing time sequence of the voice information comprises: splitting audio frames in the voice information to obtain at least one audio clip with a playing time sequence, wherein the audio clip comprises: audio data and word timestamps corresponding to the audio segments; associating the audio data and the word time stamp corresponding to the audio fragment with the avatar data needing to be rendered currently, and generating rendering data of the avatar needing to be rendered currently; the avatar data includes expression information and action information, the audio data and the word time stamp corresponding to the audio clip are associated with the avatar data to be rendered currently, and rendering data of the avatar to be rendered currently is generated, including:

acquiring display characteristics of a three-dimensional model corresponding to the expression information according to the expression information, and acquiring display characteristics of the three-dimensional model corresponding to the action information according to the action information;

will audio data audio frequency fragment corresponding word time stamp, with expression information corresponding three-dimensional model's show characteristic and with action information corresponding three-dimensional model's show characteristic is correlated, generates the present avatar's that needs to render rendering data, so that avatar is issuing when audio data avatar's expression with each word in audio data corresponds the change, and is issuing audio data's in-period avatar carries out corresponding action information.

2. The method of claim 1, wherein determining dialog information for a current avatar comprises:

receiving inquiry information of a user;

determining response text information corresponding to the inquiry information according to the inquiry information;

converting the response text information carrying the word time stamp into voice information with a playing time sequence; and the playing time sequence is the time arrangement sequence according to the word time stamps.

3. The method of claim 1, wherein the method further comprises:

identifying a phoneme feature of each audio segment in the speech information;

and acquiring the expression information with a preset association relation with the phoneme characteristics according to the phoneme characteristics.

4. The method of claim 2, wherein the method further comprises:

searching a characteristic vocabulary corresponding to each keyword in the response text information in a characteristic vocabulary bank;

5. The method of claim 1, wherein rendering the rendering data comprises:

and 3D rendering is carried out on the rendering data through a rendering model to obtain 3D data of the current virtual image to be displayed.

6. A method of processing a three-dimensional model, comprising:

determining display information of a current three-dimensional model, wherein the display information comprises voice information, expression information and action information; generating rendering data of the three-dimensional image model needing to be rendered at present according to the playing time sequence of the voice information and the display characteristics of the three-dimensional model corresponding to the expression information;

rendering the rendering data to display the dialogue and the action of the current three-dimensional image model;

the method for generating rendering data of the three-dimensional image model needing to be rendered currently according to the playing time sequence of the voice information and the display characteristics of the three-dimensional model corresponding to the expression information comprises the following steps:

splitting audio frames in the voice information to obtain at least one audio clip with a playing time sequence, wherein the audio clip comprises: audio data and word timestamps corresponding to the audio segments;

with the word timestamp that audio data, audio frequency fragment correspond, with the show characteristic of the three-dimensional model that expression information corresponds and with the show characteristic of the three-dimensional model that action information corresponds is correlated, generates the present avatar's that needs to render data, with the avatar is issuing in the audio data the avatar's expression with every word in the audio data corresponds the change, and is issuing during the audio data the avatar carries out corresponding action information.

7. The method of claim 6, wherein determining presentation information for the current three-dimensional model comprises:

8. The method of claim 6, wherein the method further comprises:

identifying a phoneme feature of each audio segment in the speech information; and acquiring the expression information with a preset association relation with the phoneme characteristics according to the phoneme characteristics.

9. The method of claim 7, wherein the method further comprises:

10. A data processing method, comprising:

associating the action information and the expression information with the voice information played in time sequence to determine rendering data; wherein the expression information is obtained from the voice information;

3D rendering is carried out on the rendering data to obtain rendering data of a virtual image;

wherein the associating the action information and the expression information with the voice information played in time sequence to determine rendering data includes:

splitting an audio frame in the voice information to obtain at least one audio clip with a playing time sequence; decoding each audio segment of the at least one audio segment to obtain audio data and a word time stamp in each audio segment;

will audio data audio frequency fragment corresponding word time stamp, with expression information corresponding three-dimensional model's show characteristic and with action information corresponding three-dimensional model's show characteristic is correlated with, generates render data, so that the avatar is giving out audio data the while the avatar's expression with each word in audio data corresponds the change, and is giving out during audio data the avatar carries out corresponding action information.

11. The method of claim 10, wherein associating the audio data, a word timestamp corresponding to the audio clip, a presentation characteristic of a three-dimensional model corresponding to the expression information, and a presentation characteristic of a three-dimensional model corresponding to the action information, generates the rendering data, including:

and determining rendering data according to the association of the associated data and the audio data in each audio fragment.

12. A dialog processing device of an avatar, comprising:

the rendering module is used for rendering the rendering data to display the current virtual image; the virtual image data comprises expression information and action information;

the generation module is specifically configured to:

splitting an audio frame in the voice information to obtain at least one audio clip with a playing time sequence; the audio clip includes: audio data and word timestamps corresponding to the audio segments;

13. A processing apparatus of a three-dimensional model, comprising:

the processing module is used for determining display information of the current three-dimensional model, and the display information comprises voice information, expression information and action information;

the generation module is used for generating rendering data of a three-dimensional image model needing to be rendered currently according to the playing time sequence of the voice information and the display characteristics of the three-dimensional model corresponding to the expression information;

the rendering module is used for rendering the rendering data so as to display the conversation and the action of the current three-dimensional image model;

the generation module is specifically configured to:

with audio data, the corresponding word timestamp of audio frequency fragment, with the three-dimensional model's that expression information corresponds show the characteristic and with the three-dimensional model's that action information corresponds show the characteristic and correlate, generate the rendering data of the avatar that needs to render at present, with the avatar is giving when audio data the avatar's expression with every word in audio data corresponds the change, and is giving in the period of audio data the avatar carries out corresponding action information.

14. A data processing apparatus comprising:

the processing module is used for associating the action information and the expression information with the voice information played in time sequence and determining rendering data; wherein the expression information is obtained from the voice information;

the rendering module is used for performing 3D rendering on the rendering data to obtain rendering data of an avatar;

wherein, the processing module is specifically configured to:

will audio data the word time stamp that the audio frequency fragment corresponds, with the three-dimensional model's that expression information corresponds show the characteristic and with the three-dimensional model's that action information corresponds show the characteristic and is correlated, generate render data, so that the avatar is issuing in the audio data the avatar's expression with every word in the audio data corresponds the change, and is issuing during the audio data the avatar carries out corresponding action information.

15. A computing device, wherein the device comprises at least one processor and a memory for storing computer program instructions, the processor being configured to execute the program of the memory to control the computer device to implement a dialog processing method of an avatar according to any of claims 1-5, a processing method of a three-dimensional model according to any of claims 6-9, or a 10-11 data processing method.

16. A computer-readable storage medium having stored thereon a computer program, wherein if the computer program is executed in a computer, the computer program causes the computer to execute the dialogue processing method of the avatar of any one of claims 1-5, the processing method of the three-dimensional model of any one of claims 6-9, or the data processing method of 10-11.