CN114610158A

CN114610158A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN114610158A
Application number: CN202210302689.8A
Authority: CN
Inventors: 王辉
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-06-10

Abstract

The disclosed embodiment relates to a data processing method and device, an electronic device and a storage medium, and relates to the technical field of computers, wherein the data processing method comprises the following steps: receiving real-time input information sent by a client, and generating reply voice according to the real-time input information; segmenting the reply voice to obtain a plurality of voice segments, and obtaining appearance control data of a virtual object matched with each voice segment; generating a data packet according to the voice fragment and appearance control data of the virtual object matched with the voice fragment; and sending the data packet to a client so that the client carries out interactive control on the virtual object in the client according to the data packet. The technical scheme disclosed by the invention can improve the synchronism and the interaction accuracy of data transmission.

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data processing method, a data processing apparatus, an electronic device, and a computer-readable storage medium.

Background

The virtual human technology can generate a virtual object after modeling any real person, and the virtual object is displayed on the terminal equipment so as to interact with a user through the virtual object.

In the related art, due to the difference of transmission channels or the influence of network delay, data is difficult to be presented synchronously when virtual objects interact with each other, so that the interaction timeliness is poor and the interaction accuracy is influenced. In addition, in the interaction process, the calculation amount is large, more power consumption and flow are required to be consumed, and the interaction fluency is poor.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a data processing method and apparatus, an electronic device, and a storage medium, which overcome, at least to some extent, the problem that data cannot be synchronously transmitted due to the limitations and disadvantages of the related art.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of the present disclosure, there is provided a data processing method applied to a server, including: receiving real-time input information sent by a client, and generating reply voice according to the real-time input information; segmenting the reply voice to obtain a plurality of voice segments, and obtaining appearance control data of a virtual object matched with each voice segment; generating a data packet according to the voice fragments and appearance control data of the virtual object matched with each voice fragment; and sending the data packet to a client so that the client carries out interactive control on the virtual object in the client according to the data packet.

According to a second aspect of the present disclosure, there is provided a data processing method, applied to a client, including: sending real-time input information to a server so that the server generates reply voice for the real-time input information, and generating a data packet according to a plurality of voice segments obtained by dividing the reply voice and appearance control data of a virtual object matched with the voice segments; and receiving a data packet returned by the server, and performing interactive control on the virtual object according to the data packet.

According to a third aspect of the present disclosure, there is provided a data processing apparatus applied to a server, including: the reply voice generation module is used for receiving the real-time input information sent by the client and generating reply voice according to the real-time input information; the voice segmentation module is used for segmenting the reply voice to obtain a plurality of voice segments and obtaining appearance control data of the virtual object matched with each voice segment; the data packet generating module is used for generating data packets according to the voice fragments and appearance control data of the virtual objects matched with the voice fragments; and the data packet sending module is used for sending the data packet to the client so that the client can carry out interactive control on the virtual object in the client according to the data packet.

According to a fourth aspect of the present disclosure, there is provided a data processing apparatus, applied to a client, including: the information receiving module is used for sending real-time input information to the server so that the server generates reply voice aiming at the real-time input information, and generates a data packet according to a plurality of voice segments obtained by dividing the reply voice and appearance control data of a virtual object matched with the voice segments; and the interactive control module is used for receiving a data packet returned by the server and carrying out interactive control on the virtual object according to the data packet.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the data processing method of the first or second aspect and possible implementations thereof via execution of the executable instructions.

According to a sixth aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data processing method of the above first or second aspect and possible implementations thereof.

In the data processing method, the data processing apparatus, the electronic device, and the computer-readable storage medium provided in the embodiments of the present disclosure, on one hand, the reply voice is divided into a plurality of voice segments, appearance control data of a virtual object matched with each voice segment is generated, a plurality of data packets are obtained based on the voice segments and the matched appearance control data, and the plurality of data packets are further sent to the client, so that the client performs interactive control on the virtual object according to the data packets. The reply voice and the matched appearance control data can be segmented into data packets with fine granularity for transmission, and the reply voice and the appearance control data are synchronously transmitted to the client through the data packets, so that the problem of data transmission asynchronism possibly caused by network delay reasons or different transmission channels in the related technology is solved, the synchronism of data transmission can be improved, synchronous rendering can be realized, accurate interaction is realized, and the interaction synchronism, timeliness and stability are improved. On the other hand, because the data in each data packet is the lightweight data, the lightweight data is transmitted through the data packet, so that the resource consumption and the required flow are reduced, the calculated amount is reduced, the problem of system blockage is avoided, and the fluency of the interaction control of the virtual objects can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

Fig. 1 shows a schematic diagram of a system architecture to which the data processing method of the embodiments of the present disclosure can be applied.

Fig. 2 schematically illustrates a schematic diagram of a data processing method in an embodiment of the present disclosure.

Fig. 3 schematically illustrates a flow chart of dynamically adjusting the granularity of reference time division in an embodiment of the present disclosure.

Fig. 4 schematically illustrates a schematic diagram of time-slicing granularity in an embodiment of the present disclosure.

FIG. 5 schematically illustrates a flow chart for determining skin control data in an embodiment of the present disclosure.

Fig. 6 schematically illustrates a schematic diagram of a partitioned data packet in an embodiment of the present disclosure.

Fig. 7 schematically illustrates a schematic diagram of a server performing data transmission processing in an embodiment of the present disclosure.

Fig. 8 schematically shows a flow chart of another data processing method in the embodiment of the present disclosure.

Fig. 9 schematically shows a flow diagram of the overall interaction in the embodiment of the present disclosure.

Fig. 10 schematically illustrates a flow chart of a data flow-based interaction method in an embodiment of the present disclosure.

Fig. 11 schematically shows a block diagram of a data processing apparatus in an embodiment of the present disclosure.

Fig. 12 schematically shows a block diagram of another data processing apparatus in the embodiment of the present disclosure.

Fig. 13 schematically illustrates a block diagram of an electronic device in an embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

In the related art, when a virtual human is rendered, a plurality of multimedia data are input at the same time to drive the virtual human, which may cause a system to be stuck and affect fluency. And the rendering performance of the virtual human is inconsistent due to the fact that the body language, the facial emotional expression, the lip shape, the eyes and the voice are not synchronous in the transmission process, and the problem of delay cannot be solved, so that timely interaction cannot be achieved.

In order to solve the above technical problem, an embodiment of the present disclosure provides a data processing method, which may be applied to an application scenario in which a user and a virtual human perform a dialogue interaction.

Fig. 1 is a schematic diagram illustrating a system architecture to which the data processing method and apparatus according to the embodiments of the present disclosure can be applied.

As shown in fig. 1, the system architecture 100 may include a client 101, a server 102. The client 101 may be an intelligent device, for example, an intelligent device such as a smart phone, a computer, a tablet computer, and a smart speaker. A avatar application may be installed on the client 101 to interact with the user through a virtual object corresponding to the avatar application. The server 102 may be a background system providing data processing related services in the embodiments of the present disclosure, and may include one electronic device or a cluster formed by multiple electronic devices with computing functions, such as a portable computer, a desktop computer, and a smart phone, for processing data sent by a client.

The data processing method can be applied to a scene of interacting with a virtual human in a client side, and can also be applied to an application scene of transmitting and synchronizing various media information. Referring to fig. 1, a user 103 sends voice to a client 101, and the client 101 may send the voice to a server 102. The server 102 first determines whether real-time input information is received. If receiving the real-time input information, performing semantic analysis processing on the real-time input information to generate a reply voice; further segmenting the reply voice to obtain a plurality of voice segments, and generating appearance control data matched with each voice segment; then generating a plurality of data packets according to each voice segment and appearance control data of the virtual object matched with each voice segment so as to realize data synchronization through the plurality of data packets; and finally, sending the data packets to a client. Based on this, the client 101 may control the virtual object to perform corresponding interactive operations based on the data in the received multiple data packets, for example, control the virtual object to perform an action corresponding to the reply voice, and display an expression corresponding to the reply voice, and so on.

The server 102 may be the same as the client 101, i.e. both the client 101 and the server 102 are smart devices, which may be, for example, smartphones. The server 102 may also be different from the client 101, and is not particularly limited herein.

It should be noted that the data processing method provided by the embodiment of the present disclosure may be executed by the server 102. Accordingly, the data processing method may be provided in the server 102 by a program or the like.

Next, a data processing method in the embodiment of the present disclosure is explained in detail with reference to fig. 2.

In step S210, real-time input information sent by the client is received, and a reply voice is generated according to the real-time input information.

In the embodiment of the present disclosure, the client may be any intelligent device capable of implementing a voice conversation, such as a smart phone, a tablet computer, a smart speaker, a smart television, and a wearable device. A avatar application may be installed in the client, and the avatar application may build the virtual object. The virtual object refers to a virtual human or an avatar, and the virtual object can perform voice interaction with a user, thereby implementing an intelligent conversation function. The virtual human can generate the virtual image of the user after modeling any user so as to restore the effect of the real human, and the virtual human can be arranged on various intelligent devices such as a mobile phone, a computer and the like and can carry out interactive conversation with the user. The virtual objects in the client and the server and the user may constitute a dialog system to enable dialog interactions.

In an interactive scenario, the real-time input information may be information collected by the client, and the real-time input information may be information input by the user through an application installed on the client. The real-time input information can be voice information or text information. The real-time input information is used to interact with virtual objects on the client. When the real-time input information is text information, the text information can be converted into voice information in a text-to-voice mode. Here, the real-time input information is described as an example of real-time voice.

In an embodiment, in order to avoid the influence of other intelligent devices in the same space on real-time voice, voice recognition may be performed on the collected voice information, the voice information input by the user contained in the collected voice information is determined as real-time voice, and the voice information of different accents is recognized to obtain real-time voice, so as to improve the accuracy of the received voice. Illustratively, whether voice information input by other intelligent devices is received or not can be detected. If voice information input by other equipment is detected, the voice information can be filtered, and only the voice information input by the user is reserved as real-time voice. The other devices may be any type of smart device, such as a smart speaker or a smart terminal, etc.

The reply voice refers to answer information for the real-time input information, and may be an answer corresponding to the real-time voice or a search result, for example. When the real-time input information is real-time speech, the reply speech may be determined according to real-time semantics and real-time intent of the real-time speech. For example, the real-time speech may be analyzed semantically to obtain corresponding semantics, and the intention of the real-time speech may be determined by intention recognition.

Semantic analysis is used to represent the structure of a language using the semantic structure of a sentence. Intent recognition refers to the use and intent for recognizing real-time speech. Illustratively, for real-time voice input by a user, the probability of each intention is calculated according to the intention recognition model, and finally the intention with the highest probability is determined as the intention of the real-time voice. The intent recognition model may be used to determine an intent to which an input pertains in an interaction scenario. The input of the intention recognition model is dialogue data (real-time speech), and the output is an intention to which the dialogue data belongs (intention of real-time speech). The intention recognition model may be, for example, a convolutional neural network model, a long short term memory network model, a BERT (Bidirectional Encoder representation from Transformer) model, or other text classification models. It should be noted that the above listed models are only examples, and in practical applications, the intention recognition model may also adopt other types of text classification models, and the disclosure does not specifically limit this.

In the embodiment of the present disclosure, the server includes a background service representing a speech semantic service, and the background service includes an NLP (Natural language processing) calculation engine. For example, a background service in the server may be invoked to receive real-time speech collected by the client and perform semantic analysis and intent recognition on the real-time speech. The NLP computation engine may be used to compute text information for a reply voice corresponding to the real-time voice. Meanwhile, semantic analysis can be carried out on the text information of the reply voice to obtain corresponding reply semantics, and intention recognition can be carried out on the text information of the reply voice to obtain corresponding reply intention. Further, Text-to-Speech (Text-to-Speech) conversion may be performed on the Text information of the reply voice by TTS to generate a corresponding reply voice.

In step S220, the reply voice is segmented to obtain a plurality of voice segments, and appearance control data of a virtual object matching each of the voice segments is obtained.

In the embodiment of the present disclosure, segmenting the reply voice refers to segmenting a complete reply voice into a plurality of voice segments. The speech segments may be part of the content of the reply speech and the lengths of the speech segments may be the same or different. When the reply voice is divided, the division may be performed sequentially on the basis of the time order.

In one embodiment, the reply speech may be segmented in time order based on a time segmentation granularity to obtain a plurality of speech segments. The reply voice for division here may be a unit time reply voice. The unit time may be any time length, such as per second or per minute, and the like, and is not particularly limited herein. Time-slicing granularity refers to a criterion for dividing speech, which can be used to determine the number of speech segments. The time segmentation granularity is different, the length of the divided voice segments and the number of the obtained voice segments are also different, the size of the time segmentation granularity is in negative correlation with the number of the voice segments, and the size of the time segmentation granularity is in positive correlation with the length of the voice segments. That is, the larger the time division granularity is, the longer the length of the voice segment is, and the smaller the number of voice segments is; the smaller the granularity of time segmentation, the shorter the length of the speech segment and the greater the number of speech segments. In the embodiment of the disclosure, under the condition that the time division granularity is determined, the reply voice can be quickly divided based on the time division granularity, so that the response speed of the voice segment and the data transmission efficiency are improved.

For the same segment of return speech, the partitioning may be performed with one or more time-slicing granularities. In the embodiment of the present disclosure, in order to improve the speech segmentation accuracy, the time segmentation granularity may be dynamically adjusted, and then the reply speech may be effectively divided by combining multiple time segmentation granularities. When performing speech segmentation, a reference time segmentation granularity may be obtained first. Further, the reference time-slicing granularity may be dynamically adjusted to obtain an appropriate time-slicing granularity. Wherein the reference time partition granularity refers to a default partition granularity. Illustratively, the reference time segmentation granularity may be determined from an evaluation parameter, which is used to represent an evaluation criterion for speech segmentation. Wherein, the evaluation parameter can be one or more of performance parameter, reduction degree and integrity degree. The performance parameters can be power consumption parameters, segmentation efficiency parameters and the like; the reduction degree refers to the similarity between a plurality of voice fragments and the reply voice; integrity refers to the integrity of multiple speech segments with respect to the reply speech. The test speech can be segmented by a plurality of candidate segmentation granularities to respectively obtain a plurality of test segments of the test speech corresponding to each candidate segmentation granularity. The plurality of test segments may then be compared to the test speech to determine evaluation parameters of the test speech, i.e., to determine performance parameters, restorability, and integrity. Further, the performance parameters, the reduction degree and the integrity are arranged according to an arrangement sequence, and the candidate segmentation granularity with the maximum performance parameters, the maximum reduction degree and/or the highest integrity is determined as the reference time segmentation granularity. The default segmentation granularity can be accurately determined by determining the reference time segmentation granularity through the performance parameters, the reduction degree and the integrity degree.

In the embodiment of the present disclosure, the unit time is per second, and the reference time division granularity may be, for example, 50ms, and may also be, of course, 20ms or other granularities, and the reference time division granularity is described as 50 ms. When the reference time division granularity is 50ms, the reply voice per unit time can be divided into 20 voice fragments.

To achieve accurate segmentation of speech, the reference time segmentation granularity may be dynamically adjusted based on real-time attribute information of the reply speech. The real-time attribute information may include the speech rate, emotion tag status, and integrity of the reply speech, among others. The description is given by taking the real-time attribute information as the speech speed of the reply speech as an example.

A flow chart for dynamically adjusting the granularity of reference time partition is schematically shown in fig. 3, and referring to the flow chart shown in fig. 3, mainly includes the following steps:

in step S310, it is determined whether the speech rate of the reply speech satisfies the first adjustment condition. If yes, go to step S320. If not, go to step S330.

In step S320, the reference time-division granularity is reduced to obtain the time-division granularity.

In step S330, it is determined whether the speech rate of the reply speech satisfies the second adjustment condition. If yes, go to step S340. If not, go to step S350.

In step S340, the reference time-division granularity is increased to obtain the time-division granularity.

In step S350, the reference time-division granularity is taken as the time-division granularity.

In the embodiment of the present disclosure, when the real-time attribute information is a speech rate of the reply speech, the first adjustment condition is greater than a standard speech rate, for example, the first adjustment condition may be 2 times of the speech rate, 4 times of the speech rate, and so on. Specifically, the speech rate of the reply speech is determined according to information such as the frequency and the blank time length of the real-time speech or the reply speech, and then whether the speech rate of the reply speech meets the first adjustment condition or the second adjustment condition is determined. If it is determined that the speech rate of the reply speech satisfies the first adjustment condition, i.e., is greater than the standard speech rate, the reference time partition granularity may be reduced to determine the reduced determination as the time partition granularity. When the reduction is performed, the degree of reduction may be determined according to the speech rate of the reply speech, and the degree of reduction may be positively correlated with the speech rate of the reply speech, and the time division granularity may be negatively correlated with the speech rate of the reply speech. I.e., the faster the speech rate, the greater the degree of reduction, and the smaller the time-division granularity. If the speed of the reply voice is not greater than the standard speed, whether the speed of the reply voice meets the second adjustment condition is continuously judged.

The second adjustment condition is that the speech rate is smaller than the standard speech rate, and in order to avoid invalid segmentation, the reference time segmentation granularity can be increased. Since too large a partition granularity may affect the synchronization effect, the reference time partition granularity may be increased by a predetermined multiple to obtain the time partition granularity. The preset multiple may be determined according to actual requirements, and may be, for example, 1 time or 2 times, and so on. It should be noted that, when the speech rate of the reply speech is smaller than the standard speech rate, the reference time partition granularity may be increased by 1 time or 2 times, and at this time, the reference time partition granularity may not be positively or negatively correlated with the speech rate of the reply speech.

If the speech rate of the reply speech does not satisfy the first adjustment condition or does not satisfy the second adjustment condition, that is, the speech rate of the reply speech is the standard speech rate, the reference time division granularity is used as the time division granularity.

For example, referring to fig. 4, if it is determined that the first 0.4 seconds in the reply speech belong to the normal speech rate, the adjacent 0.4 seconds belong to the 2 times speed, and the second 0.2 seconds belong to the 0.5 times speed, according to the information of the frequency, the blank time length, and the like of the real-time speech. The division may be performed using the reference time division granularity as the time division granularity 1 for the first 0.4 seconds. For 0.4 seconds adjacent to this, the reference time division granularity may be reduced by half to obtain time division granularity 2, and the division may be performed using time division granularity 2. For the last 0.2 seconds, the reference time-cut granularity may be increased by a factor of 2 to obtain a time-cut granularity 3, and the division may be performed using the time-cut granularity 3. Wherein the time division granularity 1 is greater than the time division granularity 2, and the time division granularity 3 is greater than the time division granularity 1.

In addition, the real-time attribute information may be an emotion tag status of the reply speech, and the emotion tag status may be used to indicate whether the reply speech includes an emotion tag, so that the reference time partition granularity may be dynamically adjusted to obtain the time partition granularity based on whether the reply speech includes an emotion tag. If not, the reference time partition granularity is increased. And if so, reducing the reference time division granularity according to the number of the emotion labels, wherein the number of the emotion labels is inversely related to the time division granularity. That is, the larger the number of emotion labels, the smaller the time division granularity.

In addition, the reference time partition granularity can be dynamically adjusted according to the completeness of the reply voice. Illustratively, when the completeness is low, the reference time partition granularity may be reduced.

In the embodiment of the disclosure, the reference time segmentation granularity is dynamically adjusted through the real-time attribute information of the reply voice to obtain the accurate time segmentation granularity, the reply voice can be accurately divided, a plurality of voice segments which accord with the actual situation are obtained, and the reasonability is increased.

After the reply voice is segmented into a plurality of voice segments, appearance control data of a virtual object matching each voice segment may be generated. The appearance control data may be used to represent rendering data for the virtual object. For a virtual object, its appearance control data may include data of a plurality of preset parts, which may include, for example, but not limited to, motion data (body data) and expression driving data. The motion data may be data of a part capable of performing a motion, and may include, for example and without limitation, hand motion, foot motion, leg motion, and the like. The expression driving data may be expressed by driving data of a plurality of parts capable of adjusting states, and thus the expression driving parameters may include a plurality of parameters. Being able to adjust the state means that the state can be changed rather than fixed. The driving data of the status-adjustable parts may include, but is not limited to, facial expression data, lip data, and eye data. Based on this, the appearance control data may include, but is not limited to, motion data as well as facial expression data, lip data, eye data, and the like, driving data.

Illustratively, an emotion tag may be determined according to text information of the reply voice, and the appearance control data matched with each voice segment may be obtained by combining each voice segment and the emotion tag. The emotion tag refers to a tag for expressing an expression of a reply voice generated for a real-time voice. The emotion tags may be, for example, happy, smiling, etc. tags. Wherein, the emotion label can be generated according to the emotion model in the emotion engine. The emotion model can be a trained machine learning model or any type of classification model. The machine learning model can be trained based on training data and real labels of the training data, and the training data is specifically input into the machine learning model to obtain corresponding prediction labels. And adjusting model parameters of the machine learning model according to the predicted label and the real label until the predicted label is consistent with the real label, and obtaining the trained machine learning model as the emotion model. The training data may be any type of speech data. On the basis, after the text information of the reply voice corresponding to the real-time voice is generated, the text information of the reply voice can be input into the emotion engine, and the emotion label of the text information of the reply voice is determined through an emotion model in the emotion engine.

Since the appearance control data may include motion data as well as expression driving data, it may be determined in different ways when determining the appearance control data. A flow chart for determining the skin control data is schematically shown in fig. 5, and with reference to the flow chart shown in fig. 5, mainly comprises the following steps:

in step S510, voice action conversion is performed on each of the voice segments, and action data of each of the voice segments is acquired.

In this step, first, voice action conversion may be performed on each voice segment, and the voice segment may be converted into corresponding action data. The action data here may be represented as a floating point type array or in a string form. The voice motion conversion means converting the reply voice into driving data for predicting a motion that the virtual object may perform, and the motion data may include motion amplitude data for representing a motion amplitude and motion posture data for representing a motion posture.

It should be noted that, for the same reply voice, the corresponding action data may be the same or different. The action data may specifically be determined according to voice attributes of voice segments into which the reply voice is cut. The voice attributes may include the volume and speed of the reply voice, and so on. The volume of the reply voice may be positively correlated with the motion amplitude data in the motion data, and the speech rate of the reply voice may also be positively correlated with the motion amplitude data in the motion data. The same reply voice may correspond to a plurality of candidate motion poses, and one of the candidate motion poses may be selected as motion pose data. For example, it may be determined whether the candidate motion poses have been executed, and if all the candidate motion poses have been executed, one candidate motion pose may be randomly selected as the motion pose. And if the candidate motion gestures which are not executed exist, randomly selecting one candidate motion gesture from the candidate motion gestures which are not executed as the motion gesture. Based on this, motion magnitude data and motion pose data can be determined from voice attributes of the voice segments, thereby accurately determining motion data based on multiple dimensions.

In step S520, determining expression driving data corresponding to each voice segment according to each voice segment and the emotion tag.

In this step, while determining the motion data, the emotion driving data corresponding to the voice segment may be determined according to the voice segment and the emotion tag determined by the text information of the reply voice. The expression driving data may be generated by combining a plurality of driving data of the portion capable of adjusting the state. Generating driving data of a plurality of parts capable of adjusting states for representing the state of each part, for example, facial expression data may be data corresponding to smiling, hurting, or the like; eye driving data may include, but is not limited to, data corresponding to blinking, eye closure; the lip driving data may be, for example, opening, closing corresponding data, and the like.

In the embodiment of the disclosure, the reply voice is divided into a plurality of voice segments according to the time division granularity, and the action data and the expression driving data corresponding to each voice segment are generated by combining the emotion tags. Based on this, the motion data and the emotion driving data are generated in small segments, and each voice segment is synchronized with its corresponding appearance control data because each voice segment is matched with the appearance control data.

Next, with continuing reference to fig. 2, in step S230, a data packet is generated according to the voice segments and appearance control data of the virtual object matched with each voice segment.

In the embodiment of the present disclosure, a plurality of data packets may be generated according to a plurality of voice segments and appearance control data matched with each voice segment, and specifically, each voice segment and appearance control data matched with each voice segment may be synchronously combined to obtain a data packet corresponding to each voice segment. Based on this, each data packet includes a voice segment, motion data corresponding to the voice segment, and expression driving data, and the expression driving data may include a plurality of parameters, which refer to driving data of a portion capable of adjusting a state, based on which a plurality of parameters may include, but are not limited to, facial expression data, lip data, and eye data. It is supplementary to maintain synchronization between the speech segments and the skin control data, and the skin control data itself. The appearance control data are kept synchronous, namely expression driving data and action data contained in the appearance control data are kept synchronous, and a plurality of parameters contained in the expression driving data are kept synchronous with each other. Therefore, the data packet can synchronize the voice field with the corresponding appearance control data, and the problem of data asynchronization caused by network abnormity or other reasons in the related art is avoided.

For example, referring to fig. 6, the reply voice may be divided into packet 1, packet 2 …, and packet n. For the data packet 1, a voice segment 1, motion data 1 of the voice segment 1, and emotion driving data 1 may be included therein. For the data packet n, a voice segment n, motion data n of the voice segment n, and emotion driving data n may be included therein. It should be noted that the types of data contained in each data packet are the same (for example, all the types may be voice segments, motion data, and emotion driving data), and the specific value of each type of data may be the same or different.

Continuing to refer to fig. 2, in step S240, the data packet is sent to the client, so that the client performs interactive control on the virtual object in the client according to the data packet.

In the embodiment of the present disclosure, after a plurality of data packets are generated, the data packets may be sent to the client in a time sequence. Since the data in each data packet is synchronized data, the voice segment and the appearance control data can be kept synchronized. After receiving the data packet, the client can send the action data and the expression driving data in the data packet to a rendering engine of the virtual object, so that the rendering engine generates a corresponding instruction according to the data in the data packet to render the virtual object, drives the virtual object to execute an action corresponding to the action data, and drives the virtual object to display an expression corresponding to the expression driving data, thereby realizing interaction with a user.

Referring to fig. 7, a user 701 sends real-time input information a to a client 702, the client 702 sends the real-time input information to a server 703, and the server 703 parses the real-time input information to generate a reply voice B corresponding to the real-time input information, and segments the reply voice B to obtain a data packet 1, a data packet 2, and a data packet n. The server 703 returns packets to the client 702 in chronological order to control the interaction of the virtual object 704 in the client 702 with the user 701.

In the embodiment of the present disclosure, in the process of sending the data packet to the client, the reply voice is also transmitted to the audio player for synchronous playing. Because each frame of data in the reply voice is cut into smaller voice segments and each frame of data is synchronized and perfected, each frame of data received by the client is also synchronized with the voice and the appearance control data, so that the consistency of the states of the virtual object at the positions of the appearance, facial expressions, lips, eyes and the like can be ensured, and the states of the lips, the faces and the eyes are also consistent with the voice synchronization. The small-granularity segmentation of the data flow per second, the voice data and the appearance control data are synchronized at the server side, and the data received by the client side are ensured to be synchronized all the time, so that the data cannot be out of synchronization caused by network influence and the like. Complex data is processed through the back end of the server, and expression algorithms such as actions, facial expressions, lips and eyes are deployed at the server end, and intelligent voice semantic services are also deployed at the server end. The client side is only responsible for rendering the 3D appearance of the virtual human, receiving the driving data to ensure the light weight of data transmission, saving the power consumption of the client side, avoiding the situations of blocking and the like caused by rendering, improving the rendering smoothness and stability and also improving the interaction accuracy.

In the embodiment of the present disclosure, a data processing method is further provided, which may be applied to a client, and as shown in fig. 8, the method mainly includes the following steps S810 and S820:

in step S810, sending the real-time input information to a server, so that the server generates a reply voice for the real-time input information, and generates a data packet according to a plurality of voice segments obtained by segmenting the reply voice and appearance control data of a virtual object matched with the voice segments;

in step S820, a data packet returned by the server is received, and the virtual object is interactively controlled according to the data packet.

The client can collect real-time input information output by a user and send the real-time input information to the server. The server may generate a reply voice for the real-time input information according to the methods in steps S210 to S240, divide the reply voice into a plurality of voice segments according to the time division granularity, further generate appearance control data corresponding to each voice segment, and combine the voice segments and the appearance control data corresponding thereto into one data packet. Further, the data packets are returned to the client in chronological order.

The client can receive the data packets, analyze the voice fragments and the appearance control data in each data packet, and control and render the virtual objects installed in the client through the voice fragments and the appearance control data generation instructions.

In the embodiment of the present disclosure, each synchronized data packet only contains driving data, and the driving data only contains 1224 Float data. The data transmission efficiency is improved through the light-weight network transmission, the rendering fluency of the virtual object is improved, the expression action rendered by the virtual object is more accurate, and the power consumption is reduced. The client can receive the data packet which is transmitted by the server and is switched into a fine-grained data packet according to the reply voice and the matched appearance control data, and interaction control is carried out through the data packet, so that the problem that data transmission is asynchronous due to network delay or different transmission channels in the related technology is solved, the synchronism of data transmission can be improved, synchronous rendering can be realized, accurate interaction is further realized, and the timeliness and the stability of interaction are improved.

It is added that the whole interaction process is mainly realized by virtual objects in the server and the client. Referring to the service architecture shown in fig. 9, the server 901 includes an ASR (Automatic Speech Recognition) background service, an emotion markup engine, a TTS (Text to Speech), data segmentation, an action engine, and data synchronization. The client 902 contains UI display (3D rendering engine) and audio playback.

Based on the service framework, the background service receives the real-time input information sent by the client, identifies the semantics and the intention of the real-time input information, and sends the semantics and the intention to the NLP calculation engine so as to calculate the text information of the reply voice through the NLP calculation engine. And the emotion marking engine converts the semantic and intention information and the like corresponding to the text information of the response voice transmitted by the background service ASR into emotion labels. The TTS receives the text information of the reply voice, performs text-to-voice conversion to generate the reply voice, and converts the reply voice into a voice stream to be transmitted to the action engine.

And the action engine converts the voice stream and the emotion label corresponding to the reply voice into appearance control data. The appearance control data includes motion data and euler data (expression driving data of facial expression, lip, eye composition). Furthermore, the action engine divides the reply voice according to the time division granularity to obtain a plurality of voice fragments, and performs data synchronization on each voice fragment and the appearance control data corresponding to the voice fragment to generate a plurality of data packets. For example, the reply voice is divided into 50ms one frame data packets, one second corresponds to 20 packets, and the data packets are returned to the client. The data in each data packet is synchronized data, that is, the driving data of the body posture, facial expression, lip shape, eye part and the like in each data packet is synchronized, and the appearance control data and the sound track of the voice data are also synchronized.

After receiving each data packet, the client analyzes the appearance control data in the data packet and sends the appearance control data to a 3D rendering engine of the virtual object so as to render images in real time, specifically including but not limited to driving preset parts of the virtual object such as actions, facial expressions, lips and eyes, so that the virtual object can display the actions and expressions corresponding to the voice fragments.

Fig. 10 schematically shows a data flow based interaction flow chart, and referring to fig. 10, the method mainly includes the following steps:

in step S1010, the client receives real-time input information sent by the user, which may be real-time voice, for example. An application in the client can configure the virtual object.

In step S1020, the client sends the real-time input information (real-time speech) to the background service of the server and the TTS to generate text information of the reply speech through NLP, and converts the text information of the reply speech into a speech stream through the TTS.

In step S1030, the voice stream is transmitted to the emotion markup engine, and an emotion tag is generated.

In step S1040, the voice stream is divided into a plurality of voice segments according to the time division granularity.

In step S1050, the emotion tag is sent to the action engine, and the voice stream is sent to the action engine, so as to generate appearance control data corresponding to the voice segment.

In step S1060, a plurality of packets are generated from the voice segment and the corresponding skin control data.

In step S1070, the voice stream and the skin control data in the plurality of data packets are synchronized and transmitted to the client.

In step S1080, the client renders according to the facade control data.

According to the technical scheme, the server processes high-computation-quantity operations such as the action data, the expression driving data and the like, so that the computation power consumption of the client can be reduced, and the operation fluency of the client is improved. The reply voice is divided into voice segments with fine granularity, the voice segments and the appearance control data corresponding to the voice segments are synchronously combined into a plurality of data packets, the data packets are sent to the client side for interaction, the problem of data transmission asynchronism caused by network problems or other reasons in the related technology is avoided, the problem of inconsistency of the voice and the appearance control data is avoided, the synchronism and the fluency of actions are improved, and timely interaction and accurate interaction can be realized.

In an embodiment of the present disclosure, a data processing apparatus is provided, and referring to fig. 11, the data processing apparatus 1100 may include:

the reply voice generation module 1101 is configured to receive real-time input information sent by the client, and generate a reply voice according to the real-time input information;

a voice segmentation module 1102, configured to segment the reply voice to obtain a plurality of voice segments, and obtain appearance control data matched with each of the voice segments;

a data packet generating module 1103, configured to generate a data packet according to the voice segments and appearance control data of the virtual object matched with each voice segment;

and the data packet sending module 1104 is configured to send the data packet to the client, so that the client performs interactive control on the virtual object in the client according to the data packet.

In an exemplary embodiment of the present disclosure, the voice segmentation module includes: and the segmentation control module is used for segmenting the reply voice according to the time sequence based on the time segmentation granularity to obtain a plurality of voice segments.

In an exemplary embodiment of the present disclosure, the segmentation control module includes: and the dynamic adjustment module is used for dynamically adjusting the reference time segmentation granularity according to the real-time attribute information of the reply voice to obtain the time segmentation granularity, and segmenting the reply voice according to the time segmentation granularity to obtain the plurality of voice fragments.

In an exemplary embodiment of the present disclosure, the dynamic adjustment module includes: the first adjusting module is used for reducing the reference time segmentation granularity to obtain the time segmentation granularity if the real-time attribute information of the reply voice meets a first adjusting condition; and the second adjusting module is used for increasing the reference time segmentation granularity to obtain the time segmentation granularity if the real-time attribute information of the reply voice meets a second adjusting condition.

In an exemplary embodiment of the present disclosure, the apparatus further includes: a reference granularity determination module for determining the reference time segmentation granularity according to the evaluation parameter; the evaluation parameter comprises at least one of a performance parameter, a reduction degree and a completeness degree.

In an exemplary embodiment of the present disclosure, the voice segmentation module includes: and the appearance control data determining module is used for determining an emotion label according to the reply voice and acquiring the appearance control data matched with each voice segment by combining each voice segment and the emotion label.

In an exemplary embodiment of the present disclosure, the appearance control data determining module includes: the action data acquisition module is used for carrying out voice action conversion on each voice segment to acquire action data of each voice segment; and the expression driving data acquisition module is used for determining expression driving data corresponding to each voice segment according to each voice segment and the emotion label.

In an exemplary embodiment of the present disclosure, the packet generation module includes: and the combination module is used for synchronously combining each voice segment and the appearance control data corresponding to each voice segment to obtain a plurality of data packets.

It should be noted that, the specific details of each module in the data processing apparatus have been described in detail in the corresponding data processing method, and therefore are not described herein again.

Further, in an embodiment of the present disclosure, there is provided a data processing apparatus, and referring to fig. 12, the data processing apparatus 1200 may include:

the information receiving module 1201 is configured to send real-time input information to a server, so that the server generates a reply voice for the real-time input information, and generates a data packet according to a plurality of voice segments obtained by dividing the reply voice and appearance control data of a virtual object matched with the voice segments;

and the interaction control module 1202 is configured to receive a data packet returned by the server, and perform interaction control on the virtual object according to the data packet.

FIG. 13 shows a schematic diagram of an electronic device suitable for use in implementing exemplary embodiments of the present disclosure. The terminal of the present disclosure may be configured in the form of an electronic device as shown in fig. 13, however, it should be noted that the electronic device shown in fig. 13 is only one example, and should not bring any limitation to the functions and the use range of the embodiment of the present disclosure.

The electronic device of the present disclosure includes at least a processor and a memory for storing one or more programs, which when executed by the processor, cause the processor to implement the method of the exemplary embodiments of the present disclosure.

Specifically, as shown in fig. 13, the electronic device 1300 may include: processor 1310, internal memory 1321, external memory interface 1322, Universal Serial Bus (USB) interface 1330, charging management Module 1340, power management Module 1341, battery 1342, antenna 1, antenna 2, mobile communication Module 1350, wireless communication Module 1360, audio Module 1370, speaker 1371, receiver 1372, microphone 1373, headphone interface 1374, sensor Module 1380, display 1390, camera Module 1391, indicator 1392, motor 1393, button 1394, and Subscriber Identity Module (SIM) card interface 1395, etc. The sensor module 1380 may include a depth sensor, a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, and the like.

It is to be understood that the illustrated structure of the embodiments of the present application does not limit the electronic device 1300 specifically. In other embodiments of the present application, the electronic device 1300 may include more or fewer components than illustrated, or combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 1310 may include one or more processing units, such as: the processor 1310 may include an application processor, a modem processor, a graphics processor, an image signal processor, a controller, a video codec, a digital signal processor, a baseband processor, and/or a Neural-Network Processing Unit (NPU), among others. Wherein, the different processing units may be independent devices or may be integrated in one or more processors. Additionally, a memory may be provided in processor 1310 for storing instructions and data. The data processing method in the present exemplary embodiment may be performed by an application processor, a graphic processor, or an image signal processor, and when the method involves neural network related processing, may be performed by an NPU.

Internal memory 1321 may be used to store computer-executable program code, including instructions. The internal memory 1321 may include a program storage area and a data storage area. The external memory interface 1322 may be used for connecting an external memory card, such as a Micro SD card, to extend the memory capability of the electronic device 1300.

The communication function of the mobile terminal 1300 may be implemented by a mobile communication module, an antenna 1, a wireless communication module, an antenna 2, a modem processor, a baseband processor, and the like. The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. The mobile communication module may provide a mobile communication solution of 2G, 3G, 4G, 5G, etc. applied to the mobile terminal 1300. The wireless communication module may provide wireless communication solutions such as wireless lan, bluetooth, near field communication, etc. applied to the mobile terminal 200.

The display screen is used for realizing display functions, such as displaying user interfaces, images, videos and the like. The camera module is used for realizing shooting functions, such as shooting images, videos and the like. The audio module is used for realizing audio functions, such as audio playing, voice acquisition and the like. The power module is used for realizing power management functions, such as charging a battery, supplying power to equipment, monitoring the state of the battery and the like.

The present application also provides a computer-readable storage medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device.

A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable storage medium may transmit, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The computer readable storage medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to implement the method as described in the embodiments below.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims. It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A data processing method is applied to a server and is characterized by comprising the following steps:

receiving real-time input information sent by a client, and generating reply voice according to the real-time input information;

segmenting the reply voice to obtain a plurality of voice segments, and obtaining appearance control data of a virtual object matched with each voice segment;

generating a data packet according to the voice segments and appearance control data of the virtual object matched with each voice segment;

and sending the data packet to a client so that the client carries out interactive control on the virtual object in the client according to the data packet.

2. The data processing method of claim 1, wherein the segmenting the reply speech to obtain a plurality of speech segments comprises:

and based on the time segmentation granularity, segmenting the reply voice according to a time sequence to obtain a plurality of voice segments.

3. The data processing method of claim 2, wherein the time-slicing the reply speech based on the time-slicing granularity sequentially slices to obtain a plurality of speech segments, comprises:

and dynamically adjusting the reference time segmentation granularity according to the real-time attribute information of the reply voice to obtain the time segmentation granularity, and segmenting the reply voice according to the time segmentation granularity to obtain the plurality of voice segments.

4. The data processing method according to claim 3, wherein the dynamically adjusting a reference time-division granularity according to the real-time attribute information of the reply voice to obtain the time-division granularity comprises:

if the real-time attribute information of the reply voice meets a first adjusting condition, reducing the reference time segmentation granularity to obtain the time segmentation granularity;

and if the real-time attribute information of the reply voice meets a second adjustment condition, increasing the reference time segmentation granularity to obtain the time segmentation granularity.

5. The data processing method of claim 3, wherein the method further comprises:

determining the reference time segmentation granularity according to the evaluation parameters; the evaluation parameter comprises at least one of a performance parameter, a reduction degree and a completeness degree.

6. The data processing method according to claim 1, wherein said obtaining appearance control data of the virtual object matching each of the voice segments comprises:

and determining an emotion tag according to the reply voice, and acquiring the appearance control data matched with each voice segment by combining each voice segment and the emotion tag.

7. The data processing method of claim 6, wherein the appearance control data includes motion data and expression driving data; the obtaining the appearance control data matched with each voice segment by combining each voice segment and the emotion label comprises:

performing voice action conversion on each voice segment to acquire action data of each voice segment;

and determining expression driving data corresponding to each voice segment according to each voice segment and the emotion label.

8. The data processing method according to claim 1, wherein the generating a plurality of data packets according to the appearance control data of each of the voice segments and the virtual object matched with each of the voice segments comprises:

and synchronously combining each voice segment and the appearance control data corresponding to each voice segment to obtain a plurality of data packets.

9. A data processing method is applied to a client, and is characterized by comprising the following steps:

sending real-time input information to a server so that the server generates reply voice for the real-time input information, and generating a data packet according to a plurality of voice segments obtained by dividing the reply voice and appearance control data of a virtual object matched with the voice segments;

and receiving a data packet returned by the server, and performing interactive control on the virtual object according to the data packet.

10. A data processing device applied to a server is characterized by comprising:

the reply voice generation module is used for receiving the real-time input information sent by the client and generating reply voice according to the real-time input information;

the voice segmentation module is used for segmenting the reply voice to obtain a plurality of voice segments and obtaining appearance control data of the virtual object matched with each voice segment;

the data packet generating module is used for generating data packets according to the voice fragments and appearance control data of the virtual objects matched with the voice fragments;

and the data packet sending module is used for sending the data packet to the client so that the client can carry out interactive control on the virtual object in the client according to the data packet.

11. A data processing device applied to a client side is characterized by comprising:

the information receiving module is used for sending real-time input information to the server so that the server generates reply voice aiming at the real-time input information, and generates a data packet according to a plurality of voice segments obtained by dividing the reply voice and appearance control data of a virtual object matched with the voice segments;

and the interactive control module is used for receiving a data packet returned by the server and carrying out interactive control on the virtual object according to the data packet.

12. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the data processing method of any one of claims 1-9 via execution of the executable instructions.

13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the data processing method of any one of claims 1 to 9.