CN111459451A

CN111459451A - Interactive object driving method, device, equipment and storage medium

Info

Publication number: CN111459451A
Application number: CN202010245772.7A
Authority: CN
Inventors: 孙林
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2020-07-28
Also published as: KR20210124306A; TWI759039B; JP2022531056A; SG11202109201XA; TW202138987A; WO2021196647A1

Abstract

A driving method, a device, equipment and a storage medium of an interactive object are disclosed, wherein the method comprises the following steps: acquiring sound driving data of an interactive object displayed by display equipment; acquiring control parameters of a set action of an interactive object matched with the target data based on at least one target data contained in the voice driving data; and controlling the action of the interactive object displayed by the display equipment according to the obtained control parameter.

Description

Interactive object driving method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for driving an interactive object.

Background

The man-machine interaction mode is mostly based on key pressing, touch and voice input, and responses are carried out by presenting images, texts or virtual characters on a display screen. At present, the virtual character is improved on the basis of a voice assistant, and the interaction between a user and the virtual character is still on the surface.

Disclosure of Invention

The embodiment of the disclosure provides a driving scheme for an interactive object.

According to an aspect of the present disclosure, there is provided a driving method of an interactive object, the method including: acquiring sound driving data of an interactive object displayed by display equipment; acquiring control parameters of a set action of an interactive object matched with the target data based on at least one target data contained in the voice driving data; and controlling the action of the interactive object displayed by the display equipment according to the obtained control parameter.

In combination with any embodiment provided by the present disclosure, the method further comprises: and controlling the display equipment to output voice according to the voice information corresponding to the voice driving data, and/or displaying a text according to the text information corresponding to the voice driving data.

In combination with any embodiment provided by the present disclosure, the controlling, according to the obtained control parameter, an action of the interactive object displayed by the display device includes: determining voice information corresponding to the target data; acquiring time information for outputting the voice information; determining the execution time of a set action corresponding to the target data according to the time information; and controlling the interactive object to execute the set action according to the control parameter corresponding to the target data according to the execution time.

In combination with any one of the embodiments provided in this disclosure, the control parameter of the setting action includes a control parameter sequence; the controlling the interactive object to execute the set action according to the control parameter corresponding to the target data according to the execution time includes: and scanning each group of control parameters in the control parameter sequence at a set rate, so that the interactive object displays actions corresponding to each group of control parameters.

In combination with any one of the embodiments provided in this disclosure, the control parameter of the setting action includes a control parameter sequence; the controlling the interactive object to execute the set action according to the control parameter corresponding to the target data according to the execution time includes: determining the scanning rate of the control parameter sequence according to the execution time; and scanning each group of control parameters in the control parameter sequence at the scanning speed, so that the interactive object displays the action corresponding to each group of control parameters.

In combination with any one of the embodiments provided in this disclosure, the control parameter of the setting action includes a control parameter sequence; and controlling the interactive object to execute the set action according to the execution time and the control parameter corresponding to the target data: and starting to scan the control parameter sequence corresponding to the target data at a set time before the voice information corresponding to the target data is output, so that the interactive object starts to execute the set action.

In combination with any one of the embodiments provided by the present disclosure, the sound driving data includes a plurality of target data, and the controlling the action of the interactive object displayed on the display device according to the obtained control parameter includes: detecting that adjacent target data in the plurality of target data have overlap; and controlling the interactive object to execute the set action according to the control parameters corresponding to the target data arranged in front, and ignoring the target data arranged behind.

In combination with any one of the embodiments provided by the present disclosure, the sound driving data includes a plurality of target data, and the controlling the interactive object displayed by the display device to execute a setting action according to the control parameter corresponding to the target data includes: and detecting that the control parameter sequences corresponding to the adjacent target data in the plurality of target data are overlapped in execution time, and fusing the overlapped control parameters.

In combination with any embodiment provided by the present disclosure, the acquiring, based on at least one target data included in the sound driving data, a control parameter of a setting action of an interactive object that matches the target data includes: performing voice recognition on the audio data in response to the fact that the sound driving data comprise audio data, and determining target data contained in the audio data according to voice content contained in the audio data; and in response to the sound driving data comprising text data, determining target data contained in the text data according to text content contained in the text data.

In connection with any embodiment provided by the disclosure, the target data includes target syllable data, and the control parameter includes a control parameter for setting a mouth shape; the target syllable data belongs to a pre-divided syllable type, and the syllable type is matched with a set mouth shape; the acquiring, based on at least one target data included in the voice driving data, a control parameter of a setting action of an interactive object matched with the target data includes: determining at least one target syllable data contained in the sound driving data; and acquiring control parameters for setting mouth shapes matched with the target syllable data based on the syllable type to which the at least one target syllable data belongs.

In combination with any embodiment provided by the present disclosure, the method further comprises: acquiring first data except target data in the sound driving data; acquiring attitude control parameters matched with the acoustic features of the first data; and controlling the posture of the interactive object according to the posture control parameter.

According to an aspect of the present disclosure, there is provided an apparatus for driving an interactive object, the apparatus including: the first acquisition unit is used for acquiring sound driving data of an interactive object displayed by the display equipment; a second obtaining unit, configured to obtain, based on at least one target data included in the sound driving data, a control parameter of a setting action of an interactive object that matches the target data; and the driving unit is used for controlling the action of the interactive object displayed by the display equipment according to the obtained control parameter.

In combination with any one of the embodiments provided by the present disclosure, the apparatus further includes an output unit, configured to control the display device to output a voice according to the voice information corresponding to the sound driving data, and/or display a text according to the text information corresponding to the sound driving data.

In combination with any one of the embodiments provided by the present disclosure, the driving unit is specifically configured to: determining voice information corresponding to the target data; acquiring time information for outputting the voice information; determining the execution time of a set action corresponding to the target data according to the time information; and controlling the interactive object to execute the set action according to the control parameter corresponding to the target data according to the execution time.

In combination with any one of the embodiments provided in this disclosure, the control parameter of the setting action includes a control parameter sequence; when the driving unit is configured to control the interactive object to execute the set action according to the execution time and the control parameter corresponding to the target data, the driving unit is specifically configured to: and scanning each group of control parameters in the control parameter sequence at a set rate, so that the interactive object displays actions corresponding to each group of control parameters.

In combination with any one of the embodiments provided in this disclosure, the control parameter of the setting action includes a control parameter sequence; when the driving unit is configured to control the interactive object to execute the set action according to the execution time and the control parameter corresponding to the target data, the driving unit is specifically configured to: determining the scanning rate of the control parameter sequence according to the execution time; and scanning each group of control parameters in the control parameter sequence at the scanning speed, so that the interactive object displays the action corresponding to each group of control parameters.

In combination with any one of the embodiments provided in this disclosure, the control parameter of the setting action includes a control parameter sequence; when the driving unit is configured to control the interactive object to execute the set action according to the execution time and the control parameter corresponding to the target data, the driving unit is specifically configured to: and starting to scan the control parameter sequence corresponding to the target data at a set time before the voice information corresponding to the target data is output, so that the interactive object starts to execute the set action.

In combination with any one of the embodiments provided by the present disclosure, the sound driving data includes a plurality of target data, and the driving unit is specifically configured to: detecting that adjacent target data in the plurality of target data have overlap; and controlling the interactive object to execute the set action according to the control parameters corresponding to the target data arranged in front, and ignoring the target data arranged behind.

In combination with any one of the embodiments provided by the present disclosure, the sound driving data includes a plurality of target data, and the driving unit is specifically configured to: and detecting that the control parameter sequences corresponding to the adjacent target data in the plurality of target data are overlapped in execution time, and fusing the overlapped control parameters.

In combination with any one of the embodiments provided by the present disclosure, the second obtaining unit is specifically configured to: performing voice recognition on the audio data in response to the fact that the sound driving data comprise audio data, and determining target data contained in the audio data according to voice content contained in the audio data; and in response to the sound driving data comprising text data, determining target data contained in the text data according to text content contained in the text data.

In connection with any embodiment provided by the disclosure, the target data includes target syllable data, and the control parameter includes a control parameter for setting a mouth shape; the target syllable data belongs to a pre-divided syllable type, and the syllable type is matched with a set mouth shape; the second obtaining unit is specifically configured to: determining at least one target syllable data contained in the sound driving data; and acquiring control parameters for setting mouth shapes matched with the target syllable data based on the syllable type to which the at least one target syllable data belongs.

In combination with any one of the embodiments provided by the present disclosure, the apparatus further includes an attitude control unit configured to: acquiring first data except target data in the sound driving data; acquiring attitude control parameters matched with the acoustic features of the first data; and controlling the posture of the interactive object according to the posture control parameter.

According to an aspect of the present disclosure, there is provided an electronic device, the device including a memory for storing computer instructions executable on a processor, and the processor being configured to implement a driving method of an interactive object according to any one of the embodiments provided in the present disclosure when executing the computer instructions.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the driving method of an interactive object according to any one of the embodiments provided in the present disclosure.

According to the driving method, the driving device, the driving equipment and the computer-readable storage medium of the interactive object, according to at least one target data contained in the sound driving data of the interactive object displayed by the display equipment, the control parameter of the set action of the interactive object matched with the target data is obtained to control the action of the interactive object displayed by the display equipment, so that the interactive object can make the action corresponding to the target data contained in the sound driving data, the speaking state of the interactive object is naturally vivid, and the interactive experience of the target object is improved.

Drawings

In order to more clearly illustrate one or more embodiments or technical solutions in the prior art in the present specification, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in one or more embodiments of the present specification, and other drawings can be obtained by those skilled in the art without inventive exercise.

Fig. 1 is a schematic diagram of a display device in a driving method of an interactive object according to at least one embodiment of the present disclosure;

fig. 2 is a flowchart of a driving method of an interactive object according to at least one embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a driving apparatus for an interactive object according to at least one embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to at least one embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

At least one embodiment of the present disclosure provides a driving method for an interactive object, where the driving method may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet computer, a game console, a desktop computer, an advertisement machine, a kiosk, a vehicle-mounted terminal, and the like, and the server includes a local server or a cloud server, and the method may also be implemented by a way that a processor calls a computer-readable instruction stored in a memory.

In the embodiment of the present disclosure, the interactive object may be any interactive object capable of interacting with the target object, and may be a virtual character, a virtual animal, a virtual article, a cartoon image, or other virtual images capable of implementing an interactive function, where the presentation form of the virtual image may be a 2D form or a 3D form, and the present disclosure is not limited thereto. The target object can be a user, a robot or other intelligent equipment. The interaction mode between the interaction object and the target object can be an active interaction mode or a passive interaction mode. In one example, the target object may issue a demand by making a gesture or a limb action, and the interaction object is triggered to interact with the target object by active interaction. In another example, the interactive object may interact with the interactive object in a passive manner by actively calling a call, prompting the target object to make an action, and the like.

The interactive object may be displayed through a terminal device, and the terminal device may be a television, an all-in-one machine with a display function, a projector, a Virtual Reality (VR) device, an Augmented Reality (AR) device, or the like.

Fig. 1 illustrates a display device proposed by at least one embodiment of the present disclosure. As shown in fig. 1, the display device has a display device of a transparent display screen, which can display a stereoscopic picture on the transparent display screen to present a virtual scene with a stereoscopic effect and an interactive object. For example, the interactive objects displayed on the transparent display screen in fig. 1 are virtual cartoon characters. In some embodiments, the terminal device described in the present disclosure may also be the display device with the transparent display screen, where the display device is configured with a memory and a processor, the memory is used to store computer instructions executable on the processor, and the processor is used to implement the driving method for the interactive object provided in the present disclosure when executing the computer instructions, so as to drive the interactive object displayed in the transparent display screen to respond to the target object.

In some embodiments, the interactive object may emit a specified voice to the target object in response to the terminal device receiving sound driving data for driving the interactive object to output the voice. The voice driving data can be generated according to the action, expression, identity, preference and the like of the target object around the terminal equipment, so that the interactive object is driven to respond by sending out the specified voice, and therefore the anthropomorphic service is provided for the target object. In the interaction process of the interaction object and the target object, the interaction object is driven to emit the specified voice according to the voice driving data, and meanwhile, the interaction object cannot be driven to make the face action synchronous with the specified voice, so that the interaction object is stiff and unnatural when the voice is emitted, and the target object and the interaction experience are influenced. Based on this, at least one embodiment of the present disclosure provides a driving method for an interactive object, so as to improve the experience of interaction between a target object and the interactive object.

Fig. 2 shows a flowchart of a driving method of an interactive object according to at least one embodiment of the present disclosure, and as shown in fig. 2, the method includes steps 201 to 203.

In step 201, sound driving data of an interactive object presented by a display device is acquired.

In the disclosed embodiment, the sound driving data may include audio data (voice data), text, and the like. The voice driving data may be driving data generated by the server or the terminal device according to an action, an expression, an identity, a preference, and the like of a target object interacting with the interaction object, or may be directly acquired by the terminal device, such as voice driving data called from an internal memory. The present disclosure does not limit the manner of acquiring the sound drive data.

In step 202, control parameters of a setting operation of an interactive object matching the target data are acquired based on at least one target data included in the voice drive data.

In the embodiment of the present disclosure, a setting action is matched to each target data in advance, and each setting action is realized by controlling through a corresponding control parameter, so that each target data is matched to the control parameter of the setting action. The target data may be set keywords, words, sentences, and the like. Taking a keyword as "waving" as an example, if the voice-driven data includes "waving" in a text form and/or "waving" in a voice form, it can be determined that the voice-driven data includes target data.

The setting action can be realized by using a general unit animation, the unit animation can comprise a sequence of image frames, each image frame in the sequence corresponds to one posture of the interactive object, and the interactive object can realize the setting action through the change of the postures between the image frames. Wherein the interactive object pose in an image frame may be achieved by a set of control parameters, e.g. a set of control parameters formed by displacements of a plurality of bone points. Therefore, the posture change of the interactive object is controlled by using the control parameter sequence formed by the plurality of groups of control parameters, and the interactive object can be controlled to realize the setting action.

In some embodiments, the target data may include target syllable data corresponding to control parameters for setting the mouth style, the target syllable data belongs to one syllable type which is pre-divided, and the one syllable type matches with one set mouth style.

Wherein the syllable data is a unit of speech formed by combining at least one phoneme, and the syllable data includes syllable data of a pinyin language and syllable data of a non-pinyin language (e.g., chinese). A syllable type is syllable data with consistent or basically consistent pronunciation action, one syllable type can correspond to one action of an interactive object, specifically, one syllable type can correspond to a set mouth shape when the interactive object speaks, namely, one pronunciation action, so that different types of syllable data are respectively matched with control parameters of the set mouth shape, such as syllable data of pinyin types 'ma', 'man' and 'ang', and the syllable data can be regarded as the same type due to basically consistent pronunciation action, and can correspond to the control parameters of the mouth shape with 'mouth open' when the interactive object speaks, so that when the voice driving data is detected to comprise the target data, the interactive object can be controlled to make the corresponding mouth shape. Furthermore, a plurality of groups of control parameters of different mouth shapes can be matched through a plurality of groups of pinyin data of different types, so that the mouth shape change of the interactive object can be controlled by utilizing a control parameter sequence formed by the plurality of groups of control parameters, and the interactive object can be controlled to realize the anthropomorphic speaking state.

In step 203, the action of the interactive object displayed by the display device is controlled according to the obtained control parameter.

For each target data included in the voice drive data, a control parameter of a corresponding setting action can be obtained. And controlling the action of the interactive object according to the obtained control parameters, namely realizing the setting action corresponding to each target data in the sound time driving data.

In the embodiment of the disclosure, according to at least one target data included in the voice driving data of the interactive object displayed by the display device, the control parameter of the set action of the interactive object matched with the target data is obtained to control the action of the interactive object displayed by the display device, so that the interactive object can make an action corresponding to the target data included in the voice driving data, thereby enabling the speaking state of the interactive object to be naturally vivid and improving the interactive experience of the target object.

In this embodiment of the present disclosure, the display device may be further controlled to output voice according to the voice information corresponding to the sound driving data, or the display device may be controlled to output voice according to the voice information corresponding to the sound driving data, and a text may be displayed according to the text information corresponding to the sound driving data.

When the voice corresponding to the voice driving data is controlled to be output by the display equipment, the interactive objects are sequentially controlled to execute corresponding actions according to the control parameters matched with the target data in the voice driving data, so that the interactive objects can make actions according to the content contained in the voice while outputting the voice, the speaking state of the interactive objects is natural, and the interactive experience of the target objects is improved.

The display device can be controlled to display the text corresponding to the sound driving data while outputting the voice corresponding to the sound driving data, and then the interactive objects are sequentially controlled to execute corresponding actions according to the control parameters matched with the target data in the sound driving data, so that the interactive objects can make actions according to the content contained in the voice and the text while outputting the voice and displaying the text, the expression state of the interactive objects is natural and vivid, and the interactive experience of the target objects is improved.

In the embodiment of the disclosure, the image frame sequence corresponding to the variable content can be formed only by setting the control parameters for the specified actions, so that the iteration cost is reduced, and the driving efficiency of the interactive object is improved. In addition, the target data can be added or modified as required to deal with the changed content, so that the maintenance and the updating of the driving system are facilitated.

In some embodiments, the method is applied to a server, including a local server or a cloud server, and the server processes sound driving data of an interactive object, generates a posture parameter value of the interactive object, and performs rendering by using a three-dimensional rendering engine according to the posture parameter value to obtain a response animation of the interactive object. The server can send the response animation to the terminal for displaying to respond to the target object, and can also send the response animation to the cloud end, so that the terminal can obtain the response animation from the cloud end to respond to the target object. After the server generates the attitude parameter value of the interactive object, the attitude parameter value can be sent to the terminal so that the terminal can complete the processes of rendering, generating response animation and displaying.

In some embodiments, the method is applied to a terminal, the terminal processes sound driving data of an interactive object, generates a posture parameter value of the interactive object, and renders the interactive object by using a three-dimensional rendering engine according to the posture parameter value to obtain a response animation of the interactive object, and the terminal can display the response animation to respond to a target object.

In response to that the sound driving data comprises audio data, the voice content contained in the audio data can be obtained by performing voice recognition on the sound driving data, and the target data contained in the audio data is determined. By matching the voice content with the target data, the target data contained in the sound driving data can be determined.

In response to the voice driving data comprising text, determining target data contained in the text data according to text content contained in the text data.

In some embodiments, in the case that the target data comprises target syllable data, splitting the sound driving data results in at least one syllable data. It should be understood by those skilled in the art that there may be more than one splitting mode for the voice-driven data, different splitting modes may result in different syllable data combinations, and different splitting modes may be prioritized to combine syllable data obtained by the splitting mode with higher priority as a splitting result.

Matching the divided syllable data with the target syllable data, and determining that the syllable data matches with the target syllable data if the syllable data matches with any one of the syllable types included in the target syllable data. For example, the target syllable data includes syllable data of the type "ma", "man", and "mang", and in response to the drive data containing syllable data matching any one of "ma", "man", and "mang", it is determined that the drive data contains the target syllable data.

And under the condition that the voice driving data contains target syllable data, acquiring control parameters of a set mouth shape matched with the target syllable data according to the syllable type of the target syllable data, and controlling an interactive object to make a corresponding mouth shape. Through the method, the mouth shape change of the interactive object can be controlled according to the mouth shape control parameter sequence corresponding to the voice driving data, so that the interactive object can realize the anthropomorphic speaking state.

In some embodiments, voice information corresponding to the target data may be determined; acquiring time information for outputting the voice information; determining the execution time of a set action corresponding to the target data according to the time information; and controlling the interactive object to execute the set action according to the control parameter corresponding to the target data according to the execution time.

In the case where the display device is controlled to output a voice according to the voice information corresponding to the sound driving data, time information of outputting the voice information corresponding to the target data, such as a time when the voice information corresponding to the target data starts to be output, a time when the output is finished, and a duration, may be determined. The execution time of the setting action corresponding to the target data can be determined according to the time information, and the interactive object is controlled to execute the setting action by the control parameter corresponding to the target data during the execution time or within a certain range of the execution time.

In the disclosed embodiment, the duration of outputting the voice according to the sound driving data is identical or similar to the duration of controlling the action of the interactive object according to the control parameter; and for each target data, outputting the duration of the corresponding voice, and controlling the duration of the action according to the corresponding control parameter, wherein the duration is consistent or similar, so that the speaking time of the interactive object is matched with the action-making time, and the voice and the action of the interactive object are synchronized and coordinated.

In the embodiment of the present disclosure, the attitude change of the interactive object may be controlled by using a control parameter sequence formed by a plurality of sets of control parameters, so that the interactive object realizes the setting action.

In some embodiments, each set of control parameters in the sequence of control parameters may be scanned at a set rate such that the interactive object exhibits a pose corresponding to each set of control parameters. That is, the control parameter sequence corresponding to each target data is always scanned at a constant speed.

When the number of phonemes corresponding to the target data is small and the control parameter sequence of the setting action matched with the target data is long, that is, when the time for the interactive object to speak the target data is short and the time for executing the action is long, the scanning of the control parameter sequence may be stopped at the same time when the output of the voice is finished, and the output of the setting action may be stopped. And, for the gesture of the set action execution end, and the gesture of the next designated action execution start, making smooth transition, so as to make the action of the interactive object smooth and natural and improve the interactive experience of the target object.

In some embodiments, according to the execution time, a scanning rate of the control parameter sequence is determined, and each group of control parameters in the control parameter sequence is scanned at the scanning rate, so that the interactive object exhibits a posture corresponding to each group of control parameters.

When the execution time is short, the scanning rate of the control parameter sequence is relatively high; otherwise, it is lower. While the scan rate of the sequence of control parameters determines the rate at which the interactive object performs an action. For example, in the case of scanning the control parameter sequence at a higher speed, the posture change speed of the interactive object is correspondingly faster, so that the setting operation can be completed in a shorter time.

The scanning rate of the control parameter sequence is determined according to the execution time, and the set execution time can be adjusted, for example, compressed or expanded, according to the time of outputting the voice of the target data, so that the time of executing the set action by the interactive object is matched with the time of outputting the voice of the target data, thereby synchronizing and coordinating the voice and the action of the interactive object.

In one example, scanning of a control parameter sequence corresponding to the target data may be started at a set time before outputting a voice according to a phoneme corresponding to the target data, so that the interactive object starts to exhibit a posture corresponding to a control parameter.

The control sequence parameters of the target data are scanned in a very short time, for example, 0.1 second, before the interactive object starts to output the voice of the target data, so that the interactive object starts to act and better accords with the speaking state of a real character, the speaking of the interactive object is more natural and vivid, and the interactive experience of the target object is improved.

In some embodiments, when it is detected that adjacent target data in the plurality of target data overlap, the interactive object may be controlled to execute the setting action according to a control parameter corresponding to a previously arranged target data, and a subsequently arranged target data is ignored.

And storing each target data contained in the sound driving data in an array form, wherein each target data is an element in the sound driving data. It should be noted that, since different target data can be obtained by combining morphemes in different ways, two adjacent target data in the plurality of target data may overlap with each other. For example, when the text corresponding to the voice drive data is "weather is really good", the corresponding target data are: 1. day, 2, weather, 3, true good. For adjacent target data 1 and 2, a common morpheme "day" is included between them, and target data 1 and 2 may match the same specified action, e.g., pointing upward with a finger.

Which of the target data to perform the overlapping may be determined according to the priorities by setting priorities for the respective target data, respectively.

In one example, the first occurring target data may be set to a higher priority than the following target data. For the above example of "weather is true, the priority of" day "is higher than" weather ", so the delivery object is controlled to execute the action according to the control parameter of the setting action corresponding to" day ", and the remaining morpheme" qi "is ignored, and then" true good "is directly matched.

In the embodiment of the disclosure, by setting the matching rule for the case that the adjacent target data are overlapped, the interaction object can be prevented from repeatedly executing the action.

In some embodiments, in the case that the control parameter sequences corresponding to the adjacent target data in the plurality of target data are detected to overlap in execution time, the overlapping control parameters may be fused.

In one embodiment, the overlapping control parameters may be averaged or weighted averaged to achieve fusion of the overlapping control parameters.

In one example, the method of interpolation may be used to interpolate and transition the last frame of the previous action to the next action according to the transition time until the transition is coincident with the beginning of a certain frame in the next action, so as to implement the fusion of the overlapped control parameters.

By fusing the control parameters of the overlapped part, the actions of the interactive objects can be smoothly transited, so that the actions of the interactive objects are smooth and natural, and the interactive experience of the target object is improved.

In some embodiments, for data other than the respective target data in the sound driving data, for example, referred to as first data, an attitude control parameter matched according to an acoustic feature of the first data may be used, and an attitude of the interactive object may be controlled according to the attitude control parameter.

In response to that the sound driving data includes audio data, a sequence of speech frames included in the first data may be obtained, an acoustic feature corresponding to at least one speech frame is obtained, and the gesture of the interactive object is controlled according to a gesture control parameter, such as a gesture control vector, of the interactive object corresponding to the acoustic feature.

In response to that the sound driving data includes text data, acoustic features corresponding to phonemes may be obtained according to the phonemes corresponding to the morphemes in the text data, and the gesture of the interactive object may be controlled according to gesture control parameters, such as a gesture control vector, of the interactive object corresponding to the acoustic features.

In the disclosed embodiment, the acoustic features may be features related to speech emotion, such as fundamental Frequency features, co-peak features, Mel-Frequency cepstral coefficients (MFCCs), and so on.

Since the value of the gesture control parameter is matched to the sequence of speech frames of the speech segment, the gesture made by the interactive object is synchronized with the output speech and/or text, giving the target object the sensation that the interactive object is speaking, in case the speech and/or presented text output according to the first data is synchronized with the gesture of the interactive object controlled according to the value of the gesture parameter. And because the attitude control vector is related to the acoustic characteristics of the output sound, the expression and the limb action of the interactive object have emotional factors by driving according to the attitude control vector, so that the speaking process of the interactive object is more natural and vivid, and the interactive experience of the target object is improved.

In some embodiments, the sound driving data includes at least one target data, and first data other than the target data. For the first data, determining a posture control parameter according to the acoustic characteristics of the first data to control the posture of the interactive object; and for the target data, controlling the interactive object to make the set action according to the control parameter of the set action matched with the target data.

Fig. 3 illustrates a schematic structural diagram of a driving apparatus for interacting with an object according to at least one embodiment of the present disclosure, and as shown in fig. 3, the apparatus may include: a first obtaining unit 301, configured to obtain sound driving data of an interactive object displayed by a display device; a second obtaining unit 302, configured to obtain, based on at least one target data included in the sound driving data, a control parameter of a setting action of an interactive object that matches the target data; a driving unit 303, configured to control an action of the interactive object displayed by the display device according to the obtained control parameter.

In some embodiments, the apparatus further includes an output unit, configured to control the display device to output voice according to voice information corresponding to the sound driving data, and/or display text according to text information corresponding to the sound driving data.

In some embodiments, the drive unit is specifically configured to: determining voice information corresponding to the target data; acquiring time information for outputting the voice information; determining the execution time of a set action corresponding to the target data according to the time information; and controlling the interactive object to execute the set action according to the control parameter corresponding to the target data according to the execution time.

In some embodiments, the control parameters of the set action comprise a sequence of control parameters; when the driving unit is configured to control the interactive object to execute the set action according to the execution time and the control parameter corresponding to the target data, the driving unit is specifically configured to: and scanning each group of control parameters in the control parameter sequence at a set rate, so that the interactive object displays actions corresponding to each group of control parameters.

In some embodiments, the control parameters of the set action comprise a sequence of control parameters; when the driving unit is configured to control the interactive object to execute the set action according to the execution time and the control parameter corresponding to the target data, the driving unit is specifically configured to: determining the scanning rate of the control parameter sequence according to the execution time; and scanning each group of control parameters in the control parameter sequence at the scanning speed, so that the interactive object displays the action corresponding to each group of control parameters.

In some embodiments, the control parameters of the set action comprise a sequence of control parameters; when the driving unit is configured to control the interactive object to execute the set action according to the execution time and the control parameter corresponding to the target data, the driving unit is specifically configured to: and starting to scan the control parameter sequence corresponding to the target data at a set time before the voice information corresponding to the target data is output, so that the interactive object starts to execute the set action.

In some embodiments, the sound driving data includes a plurality of target data, and the driving unit is specifically configured to: detecting that adjacent target data in the plurality of target data have overlap; and controlling the interactive object to execute the set action according to the control parameters corresponding to the target data arranged in front, and ignoring the target data arranged behind.

In some embodiments, the sound driving data includes a plurality of target data, and the driving unit is specifically configured to: and detecting that the control parameter sequences corresponding to the adjacent target data in the plurality of target data are overlapped in execution time, and fusing the overlapped control parameters.

In some embodiments, the second obtaining unit is specifically configured to: performing voice recognition on the audio data in response to the fact that the sound driving data comprise audio data, and determining target data contained in the audio data according to voice content contained in the audio data; and in response to the sound driving data comprising text data, determining target data contained in the text data according to text content contained in the text data.

In some embodiments, the target data comprises target syllable data, and the control parameters comprise mouth-setting control parameters; the target syllable data belongs to a pre-divided syllable type, and the syllable type is matched with a set mouth shape; the second obtaining unit is specifically configured to: determining at least one target syllable data contained in the sound driving data; and acquiring control parameters for setting mouth shapes matched with the target syllable data based on the syllable type to which the at least one target syllable data belongs.

In some embodiments, the apparatus further comprises an attitude control unit for: acquiring first data except target data in the sound driving data; acquiring attitude control parameters matched with the acoustic features of the first data; and controlling the posture of the interactive object according to the posture control parameter.

At least one embodiment of the present specification further provides an electronic device, as shown in fig. 4, where the device includes a memory and a processor, the memory is used to store computer instructions executable on the processor, and the processor is used to implement the driving method of the interactive object according to any embodiment of the present disclosure when executing the computer instructions. At least one embodiment of the present specification also provides a computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the driving method of the interactive object according to any one of the embodiments of the present disclosure.

As will be appreciated by one skilled in the art, one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the data processing apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims

1. A method of driving an interactive object, the method comprising:

acquiring sound driving data of an interactive object displayed by display equipment;

acquiring control parameters of a set action of an interactive object matched with the target data based on at least one target data contained in the voice driving data;

and controlling the action of the interactive object displayed by the display equipment according to the obtained control parameter.

2. The method of claim 1, further comprising:

and controlling the display equipment to output voice according to the voice information corresponding to the voice driving data, and/or displaying a text according to the text information corresponding to the voice driving data.

3. The method according to claim 1 or 2, wherein the act of controlling the interactive object presented by the display device according to the obtained control parameter comprises:

determining voice information corresponding to the target data;

acquiring time information for outputting the voice information;

determining the execution time of a set action corresponding to the target data according to the time information;

and controlling the interactive object to execute the set action according to the control parameter corresponding to the target data according to the execution time.

4. The method of claim 3, wherein the control parameters of the set action comprise a sequence of control parameters; the controlling the interactive object to execute the set action according to the control parameter corresponding to the target data according to the execution time includes:

and scanning each group of control parameters in the control parameter sequence at a set rate, so that the interactive object displays actions corresponding to each group of control parameters.

5. The method of claim 3, wherein the control parameters of the set action comprise a sequence of control parameters; the controlling the interactive object to execute the set action according to the control parameter corresponding to the target data according to the execution time includes:

determining the scanning rate of the control parameter sequence according to the execution time;

and scanning each group of control parameters in the control parameter sequence at the scanning speed, so that the interactive object displays the action corresponding to each group of control parameters.

6. The method of claim 3, wherein the control parameters of the set action comprise a sequence of control parameters; the controlling the interactive object to execute the set action according to the control parameter corresponding to the target data according to the execution time includes:

and starting to scan the control parameter sequence corresponding to the target data at a set time before the voice information corresponding to the target data is output, so that the interactive object starts to execute the set action.

7. The method according to any one of claims 1 to 6, wherein the sound driving data comprises a plurality of target data, and the controlling the action of the interactive object displayed on the display device according to the obtained control parameters comprises:

detecting that adjacent target data in the plurality of target data have overlap;

and controlling the interactive object to execute the set action according to the control parameters corresponding to the target data arranged in front, and ignoring the target data arranged behind.

8. The method according to any one of claims 1 to 6, wherein the sound driving data includes a plurality of target data, and the controlling the interactive object displayed on the display device to perform the setting action according to the control parameter corresponding to the target data includes:

and detecting that the control parameter sequences corresponding to the adjacent target data in the plurality of target data are overlapped in execution time, and fusing the overlapped control parameters.

9. The method according to any one of claims 1 to 8, wherein the obtaining of the control parameter of the setting action of the interactive object matched with the target data based on at least one target data included in the voice-driven data comprises:

performing voice recognition on the audio data in response to the fact that the sound driving data comprise audio data, and determining target data contained in the audio data according to voice content contained in the audio data;

and in response to the sound driving data comprising text data, determining target data contained in the text data according to text content contained in the text data.

10. The method according to any one of claims 1 to 9, wherein the target data comprises target syllable data, and the control parameter comprises a control parameter for setting a mouth shape; the target syllable data belongs to a pre-divided syllable type, and the syllable type is matched with a set mouth shape;

the acquiring, based on at least one target data included in the voice driving data, a control parameter of a setting action of an interactive object matched with the target data includes:

determining at least one target syllable data contained in the sound driving data;

and acquiring control parameters for setting mouth shapes matched with the target syllable data based on the syllable type to which the at least one target syllable data belongs.

11. The method according to any one of claims 1 to 10, further comprising:

acquiring first data except target data in the sound driving data;

acquiring attitude control parameters matched with the acoustic features of the first data;

and controlling the posture of the interactive object according to the posture control parameter.

12. An apparatus for driving an interactive object, the apparatus comprising:

the first acquisition unit is used for acquiring sound driving data of an interactive object displayed by the display equipment;

a second obtaining unit, configured to obtain, based on at least one target data included in the sound driving data, a control parameter of a setting action of an interactive object that matches the target data;

and the driving unit is used for controlling the action of the interactive object displayed by the display equipment according to the obtained control parameter.

13. The apparatus according to claim 12, further comprising an output unit, configured to control the display device to output voice according to the voice information corresponding to the sound driving data, and/or display text according to the text information corresponding to the sound driving data.

14. The device according to claim 12 or 13, characterized in that the drive unit is specifically configured to:

determining voice information corresponding to the target data;

acquiring time information for outputting the voice information;

15. The apparatus of claim 14, wherein the control parameters of the set action comprise a sequence of control parameters; when the driving unit is configured to control the interactive object to execute the set action according to the execution time and the control parameter corresponding to the target data, the driving unit is specifically configured to:

scanning each group of control parameters in the control parameter sequence at a set rate to enable the interactive object to display actions corresponding to each group of control parameters; or the like, or, alternatively,

scanning each group of control parameters in the control parameter sequence at the scanning rate, so that the interactive object displays actions corresponding to each group of control parameters; or the like, or, alternatively,

16. The apparatus according to any one of claims 12 to 15, wherein the sound driving data comprises a plurality of target data, and the driving unit is specifically configured to:

controlling the interactive object to execute the set action according to the control parameters corresponding to the previously arranged target data, and ignoring the subsequently arranged target data; or the like, or, alternatively,

17. The apparatus according to any one of claims 12 to 16, wherein the target data comprises target syllable data, and the control parameter comprises a control parameter for setting a mouth shape; the target syllable data belongs to a pre-divided syllable type, and the syllable type is matched with a set mouth shape;

the second obtaining unit is specifically configured to:

18. The apparatus according to any one of claims 12 to 17, further comprising an attitude control unit for:

acquiring first data except target data in the sound driving data;

19. An electronic device, comprising a memory for storing computer instructions executable on a processor, the processor being configured to implement the method of any one of claims 1 to 11 when executing the computer instructions.

20. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 11.