CN114818609A

CN114818609A - Interaction method for virtual object, electronic device and computer storage medium

Info

Publication number: CN114818609A
Application number: CN202210745609.6A
Authority: CN
Inventors: 王雄威; 林旭鸣; 崔雨豪; 赵中州; 周伟
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-07-29
Anticipated expiration: 2042-06-29
Also published as: CN114818609B

Abstract

The embodiment of the application provides an interaction method for a virtual object, electronic equipment and a computer storage medium, wherein the interaction method for the virtual object comprises the following steps: acquiring an interactive text to be processed in the online interaction process of the virtual object and a user; detecting whether a part which is not matched with the image of the virtual object set by the image setting information of the virtual object exists in the interactive text; if the virtual object exists, performing mask processing based on the characteristics corresponding to the unmatched part, and modifying the interactive text after mask processing based on the image setting information to obtain a modified interactive text, so that the virtual object interacts with the user based on the modified interactive text. By the embodiment of the application, the interactive text used by the virtual object can be more accordant and matched with the image setting of the virtual object, and further the interactive experience of a user is improved.

Description

Interaction method for virtual object, electronic device and computer storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to an interaction method for virtual objects, electronic equipment and a computer storage medium.

Background

With the continuous development of virtual technology, more and more industries and fields start to use digital virtual products, also referred to as virtual objects in this application, to interact with users, such as virtual hosts, virtual customer service, chat robots, and so on.

For better interaction with the user, these virtual objects are each provided with certain attributes (also referred to as "settings"), such as a virtual host with a certain tone and style of speech, or a virtual customer service with a certain tone and style of speech, or a chat robot with a certain tone and style of speech, etc. However, with the development of such virtual object-based interaction technology, skills of virtual objects are increasing, such as various styles of script explanations, various styles of question and answer, and chatting, and so on. In practical applications, it often happens that the text or script used by these extension skills does not conform to the property to which the virtual object was originally set, thereby further resulting in a poor interaction experience for the user.

Disclosure of Invention

In view of the above, embodiments of the present application provide an interaction scheme for virtual objects to at least partially solve the above problems.

According to a first aspect of embodiments of the present application, there is provided an interaction method for virtual objects, including: acquiring an interactive text to be processed in the online interaction process of the virtual object and a user; detecting whether a part which is not matched with the image of the virtual object set by the image setting information of the virtual object exists in the interactive text; if the virtual object exists, performing mask processing based on the characteristics corresponding to the unmatched part, and modifying the interactive text after mask processing based on the image setting information to obtain a modified interactive text, so that the virtual object interacts with the user based on the modified interactive text.

According to a second aspect of embodiments of the present application, there is provided an electronic apparatus, including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus; the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the corresponding operation of the method according to the first aspect.

According to a third aspect of embodiments herein, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, performs the method according to the first aspect.

According to a fourth aspect of embodiments of the present application, there is provided a computer program product comprising computer instructions for instructing a computing device to perform operations corresponding to the method according to the first aspect.

According to the scheme provided by the embodiment of the application, in the process of online interaction between the virtual object and the user, unlike the conventional process of directly using the determined interactive text by the virtual object for interaction, the scheme provided by the embodiment of the application firstly detects the interactive text to judge whether parts which are not matched with the image setting of the virtual object exist, including but not limited to expression style mismatch, character or characteristic mismatch represented by the text, and the like; these unmatched places are then modified to match the avatar settings of the virtual object. During specific modification, mask and mask recovery (namely modification) processing based on image setting information are adopted, so that unmatched parts can be modified more accurately, the whole interactive text does not need to be modified, the modification accuracy is improved, the modification processing burden is reduced, and the modification efficiency is improved integrally.

Therefore, through the scheme of the embodiment of the application, the interactive text used by the virtual object can be more accordant and matched with the image setting of the virtual object, and the interactive experience of the user is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a schematic diagram of an exemplary system for an interactive method for virtual objects to which embodiments of the present application are applicable;

FIG. 2 is a flow chart illustrating steps of an interaction method for virtual objects according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating creation of an interactive text style library according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a detection model according to an embodiment of the present application;

FIG. 5 is a diagram illustrating an example of a rewrite model according to an embodiment of the present application;

FIG. 6 is a process diagram of an interaction process for virtual objects according to an embodiment of the present application;

FIG. 7 is an exemplary diagram of an interactive system framework according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application shall fall within the scope of the protection of the embodiments in the present application.

The following further describes specific implementations of embodiments of the present application with reference to the drawings of the embodiments of the present application.

Fig. 1 illustrates an exemplary system to which embodiments of the present application may be applied. As shown in fig. 1, the system 100 may include a cloud server 102, a communication network 104, and/or one or more user devices 106, illustrated in fig. 1 as a plurality of user devices. It should be noted that the solution of the present embodiment may be deployed in the cloud service end 102, or may be deployed in the user equipment 106 with high software and hardware performance.

Cloud server 102 may be any suitable device for storing information, data, programs, and/or any other suitable type of content, including but not limited to distributed storage system devices, server clusters, computing cloud server clusters, and the like. In some embodiments, cloud server 102 may perform any suitable functions. For example, when the solution of the embodiment of the present application is deployed in the cloud server 102, in some embodiments, the cloud server 102 may be configured to rewrite the interactive text based on the avatar setting of the virtual object so as to make the interactive text more matched with the avatar setting of the virtual object. As an optional example, in some embodiments, the cloud server 102 may be configured to detect whether there is a part that does not conform to the avatar setting of the virtual object in the interactive text to be used by the virtual object, and rewrite the part, so that the rewritten interactive text conforms to the avatar setting of the virtual object. As another example, in some embodiments, the cloud server 102 may rewrite based on a mask processing technique and avatar setting information of the virtual object when rewriting a portion that does not conform to the avatar setting of the virtual object, to improve rewriting accuracy and efficiency. In some embodiments, the cloud server 102 may send the rewritten interactive text to the user equipment 106, so that the virtual object at the user equipment 106 interacts with the user after performing audio conversion based on the interactive text; or, the cloud service side 102 may convert the rewritten interactive text into audio data and send the audio data to the user equipment 106, so that the virtual object at the user equipment 106 interacts with the user based on the audio data.

In some embodiments, the communication network 104 may be any suitable combination of one or more wired and/or wireless networks. For example, the communication network 104 can include any one or more of the following: the network may include, but is not limited to, the internet, an intranet, a Wide Area Network (WAN), a Local Area Network (LAN), a wireless network, a Digital Subscriber Line (DSL) network, a frame relay network, an Asynchronous Transfer Mode (ATM) network, a Virtual Private Network (VPN), and/or any other suitable communication network. The user device 106 can be connected to the communication network 104 via one or more communication links (e.g., communication link 112), and the communication network 104 can be linked to the cloud server 102 via one or more communication links (e.g., communication link 114). The communication link may be any communication link suitable for communicating data between the user device 106 and the cloud service 102, such as a network link, a dial-up link, a wireless link, a hardwired link, any other suitable communication link, or any suitable combination of such links.

User device 106 may include any one or more user devices suitable for interacting with a user. When the scheme of the embodiment of the application is deployed in the user equipment 106, the user equipment 106 can rewrite the image setting based on the virtual object for the interactive text after determining that the interactive text needs to be rewritten, so that the interactive text is more matched with the image setting of the virtual object; in turn, the rewritten interactive text is converted into audio data to interact with the user through the virtual object. As an optional example, in some embodiments, the user equipment 106 may detect whether there is a part that does not conform to the avatar setting of the virtual object in the interactive text to be used by the virtual object, and rewrite the part so that the rewritten interactive text conforms to the avatar setting of the virtual object; and converting the rewritten interactive text into audio data so as to interact with the user through the virtual object. As another example, in some embodiments, the user device 106 may, when rewriting portions that do not conform to avatar settings of virtual objects, rewrite based on masking techniques and avatar setting information of virtual objects to improve rewriting accuracy and efficiency. In some embodiments, the user device 106 may upload the data in the above process (including the original interactive text and the rewritten interactive text) to the cloud server 102 for storage and subsequent data processing. Further, in some embodiments, user device 106 may comprise any suitable type of device. For example, in some embodiments, the user device 106 may include a mobile device, a tablet computer, a laptop computer, a desktop computer, a wearable computer, a game console, a media player, a vehicle entertainment system, and/or any other suitable type of user device.

Based on the above system, the interaction scheme provided by the present application is described below by way of an embodiment.

Referring to FIG. 2, a flow chart of steps of an interaction method for virtual objects is shown, according to an embodiment of the present application.

As shown in fig. 2, the interaction method for virtual objects includes the following steps:

step S202: and acquiring an interactive text to be processed in the online interaction process of the virtual object and the user.

In the embodiment of the present application, the virtual object may be any suitable virtual object, including but not limited to: a virtual object of a character image, an avatar of a anthropomorphic animal or plant, and the like. The virtual object can be applied to any human-computer interaction scene, such as a live broadcast scene, a customer service scene, a chat robot scene, and the like.

In many cases, virtual objects need to interact with actual users based on text, such as product introduction or promotion, campaign promotions, human-to-speech chat, comment interactions, talk show sessions, question replies, and so on. Different interactive texts are correspondingly used under different conditions, such as product introduction or promotion, activity propaganda, talk show segments and the like, which need corresponding scripts; human-machine interaction with spoken chat or comment or question reply requires determining the corresponding reply text based on user input, and so on.

Therefore, in a feasible manner, all texts in the online interaction process of the virtual object and the user can be determined as the interactive texts to be processed without distinction so as to perform subsequent detection processing, thereby avoiding text omission and simplifying the specific technical implementation process for determining the interactive texts to be processed.

However, in order to reduce the data processing burden and improve the overall efficiency and performance of the human-computer interaction system, in a feasible manner, an interactive text style library can be created in advance, and when determining interactive texts to be processed, candidate interactive texts to be used in the online interaction process of a virtual object and a user can be obtained; judging whether a text which corresponds to the candidate interactive text and is matched with the image of the virtual object exists in a preset interactive text style library; and if not, determining the candidate interactive texts as interactive texts to be processed. The interactive text style library stores texts matched with various set image styles, the texts are stored correspondingly to original interactive texts, and a corresponding relation of 1 to N exists between the texts and the original interactive texts, for example, the original interactive texts X correspond to style texts with X1, X2 and … … XN different image styles. By the method, the image style of some interactive texts which are frequently used by the virtual object under different scenes is processed in advance and stored in the interactive text style library, so that the interactive texts can be quickly searched and determined when needed, the virtual object can interact with the user based on the searched and determined style texts, and the interaction efficiency is improved.

Optionally, the interactive text style library may be created in advance by: acquiring a plurality of source interactive texts and a plurality of style texts carrying different image setting styles; combining a plurality of source interactive texts and a plurality of style texts to form different text style pairs, wherein each text style pair comprises a style text of one source interactive text; respectively obtaining text features corresponding to the source interactive text and style features corresponding to the style text for each text style pair; fusing the text features and the style features, and generating an interactive text with an image setting style corresponding to the style features according to a fusion result; and generating an interactive text style library based on the interactive text corresponding to each text style pair. Further optionally, the avatar-setting style may include at least one of: a language expression style determined based on the avatar setting information of the virtual object, a character style determined based on the avatar setting information of the virtual object, and a characteristic style determined based on the avatar setting information of the virtual object. By the method, different style characteristics can be effectively fused into the interactive text to form the interactive text with a corresponding style, so that the requirements of different virtual objects in interaction with the user are met, and the interaction efficiency is improved.

An example of the creation of an interactive text style library is shown in fig. 3, which uses a style text creation model that includes two branches, the first branch for style feature extraction and the second branch for text feature extraction of interactive text. As shown in fig. 3, a certain source interactive text and a certain style text are first obtained; as for the style text, the style text is coded by a coder to obtain the corresponding style text code (namely style text characteristic), and then the text code and the style code in the style text code are split by a splitter to obtain the style code (namely style characteristic) mainly used for representing the style; for the source interactive text, encoding the source interactive text through an encoder to obtain corresponding text encoding (namely text characteristics); after the two parts of coding features (namely the text feature and the style feature) are obtained, the two parts of coding features can be fused in a feature fusion mode, and a fused coding vector is output, namely the text vector with the style represented by the style feature; furthermore, the fused encoding vector can be decoded by a decoder, so as to finally obtain the interactive text with the corresponding image setting style.

Fig. 3 shows only a process of performing one avatar setting style transformation on one source interactive text, and in practical applications, the source interactive text and the style texts with multiple different avatar setting styles may be processed as shown in fig. 3, so as to obtain interactive texts with multiple different avatar setting styles corresponding to the source interactive text. And processing each source interactive text in such a way, so as to obtain a plurality of interactive texts with different image setting styles corresponding to each source interactive text. Based on the interactive text style library, a corresponding interactive text style library can be constructed and generated for subsequent use.

In the subsequent human-computer interaction process, whether a corresponding source interactive text exists in an interactive text style library or not can be matched according to the interactive text to be used by the virtual object; if the same or similar source interactive texts exist, determining the interactive texts with the corresponding image setting style from a plurality of interactive texts with different image setting styles corresponding to the source interactive texts according to the image setting information of the virtual object so as to provide the interactive texts with the virtual object for use.

However, it should be clear to those skilled in the art that even if the above-mentioned interactive text style library is established, the actual human-computer interaction situation is complex and variable, and the interactive text style library cannot cover all possible interaction situations. Therefore, there is still a need for subsequent processing of interactive text in human-computer interaction, especially interactive text that fails to be matched from an interactive text style library.

In the embodiments of the present invention, the numbers "plural", and the like in relation to "plural" mean two or more.

Step S204: whether a part which is not matched with the image of the virtual object set by the image setting information of the virtual object exists in the interactive text is detected.

Assume that the avatar of the virtual anchor is set to an a-style anchor. Then if it interacts with the user, it takes words or moods or tones that do not conform to the a-style. It may be caused that the virtual anchor employs an interaction that does not conform to its avatar setting, and in the actual application scene, the non-conforming portion needs to be identified.

Therefore, in a feasible mode, the interactive characteristics corresponding to the interactive text to be processed and the image characteristics corresponding to the image setting information of the virtual object can be respectively obtained; and determining whether a part which is not matched with the image of the virtual object set by the image setting information of the virtual object exists in the interactive text or not based on the result of the self-attention processing of the interactive feature and the image feature. By obtaining the two parts of features and carrying out self-attention processing, the incidence relation and the matching degree between the two parts of features can be effectively obtained, and the self-attention processing is processing capable of combining context information, so that the obtained incidence relation and the obtained matching degree are more objective and accurate, and the accuracy of identifying the unmatched parts can be guaranteed.

Since the interactive text may come from different situations, such as a dialog situation (including but not limited to talk, comment, question reply, etc.), an explanation situation (e.g., a script explanation, a talk show passage), etc., in a feasible manner, the interactive text to be processed includes information for indicating the type of interaction of the interactive text and interactive content information. The information of the interaction type is used for representing the text used in the situation of the interaction text; the interactive content information is used for representing the specific content of the interactive text. Thus, the efficiency of processing (such as feature extraction processing) the interactive text can be further improved.

In one specific example, the detection and determination of the unmatched part in the interactive text can be realized by adopting a machine learning model, which is called a detection model in the embodiment of the application. That is, whether there is a portion in the interactive text that does not match the avatar of the virtual object set by the avatar setting information of the virtual object is detected by the detection model.

An exemplary detection model is shown in fig. 4, and includes a first interactive feature extraction part, a first character feature extraction part, and a matching detection part.

The first interactive feature extraction part is used for vectorizing the interactive text to generate an interactive vector; and extracting the features of the interaction vector to obtain a first interaction feature.

As can be seen from fig. 4, the interactive text includes two parts, i.e., an interactive type information part and an interactive contents information part. The interaction type information part is illustrated as "policy output" in the figure, because the different types of interaction texts are generated or determined through different interaction policies (such as question and answer policies, explanation policies, and the like) based on the current interaction situation, and the corresponding types can be determined through the corresponding policies. In a specific implementation, the "policy output" information may be represented by specific policy output information, such as "question-answering policy output", or may be represented by a label corresponding to various policies, such as the label CLS01 corresponding to "question-answering policy output", and so on. The interactive content information part is the specific content of the interactive text, and the illustration in the figure is "thank you, i.e. XX …".

And vectorizing the interactive text through a vectorization layer to generate an interactive vector. Then, the interactive vector is input into an encoder to be encoded so as to realize feature extraction, and first interactive features corresponding to the interactive text are obtained after the interactive vector passes through the encoder.

The first image characteristic extraction part is used for vectorizing the image setting information of the virtual object to generate an image vector; and extracting the features of the image vector to obtain a first image feature.

In the example shown in fig. 4, the avatar setting information of the virtual object is set to include name information, age information, characteristic information, and expression style information. However, in practical applications, the image setting information may be more and more complex, and the embodiment of the present application is only a simple example. The image setting information is vectorized through the corresponding vectorization layer to generate an image vector. Then, the character vector is input into an encoder to be encoded so as to realize feature extraction, and a first character feature corresponding to the character setting information is obtained after passing through the encoder.

The matching detection part is used for splicing the first interactive feature and the first image feature and performing self-attention processing on the spliced vector; and after the result of the self-attention processing is processed by the classification layer, outputting information indicating whether the first image characteristics of the first interaction characteristics are matched or not as a detection result of whether a part which is not matched with the image of the virtual object exists or not in the interaction text.

As shown in fig. 4, the vectors output by the two encoders encoder are first spliced (shown schematically as circles with a "+" sign). Then, the vectors obtained by the splicing are input into an interaction layer, and the interaction layer is realized by adopting a self-attention mechanism, so that the spliced vectors are subjected to self-attention processing after being input into the interaction layer. After the interaction layer processing, the association probability between different parts of the first interaction feature and the first image feature is obtained, and the similarity between specific feature values of the associated feature parts is used. The information is input into an output layer, and the output layer outputs a matching classification between the corresponding first interactive feature and the first character feature based on the information and a preset classification label, wherein the three categories are illustrated in the figure, namely 'consistent' (indicating matching), 'inconsistent' (indicating non-matching) and 'other' (indicating other conditions). The classification result can be used as a detection result of whether a part which is not matched with the image of the virtual object exists in the interactive text or not.

If the classification result is 'consistent', the interactive text does not need to be subjected to subsequent processing, and the virtual object can be directly interacted with the user based on the interactive text; if the classification result is "inconsistent", the interactive text needs to be processed as described in the following step S206; in addition, in the embodiment of the application, if the classification result is "other", subsequent processing on the interactive text is not needed, so that the virtual object can be directly interacted with the user based on the interactive text.

Step S206: and if the unmatched part exists, performing mask processing on the basis of the characteristics corresponding to the unmatched part, and modifying the interactive text after mask processing on the basis of the image setting information of the virtual object to obtain a modified interactive text so as to enable the virtual object to interact with the user on the basis of the modified interactive text.

After determining the part of the interactive text which is not matched with the image setting of the virtual object, processing the part of the text so that the processed interactive text can meet the image setting of the virtual object. In the embodiment of the present application, the processing mainly includes rewriting processing based on the avatar setting information. In specific implementation, mask processing is performed on the unmatched part, and then mask recovery is performed to modify the interactive text.

For this reason, in a possible manner, the masking process based on the features corresponding to the unmatched parts may be implemented as: respectively acquiring interactive features corresponding to the interactive texts and image features corresponding to the image setting information; determining the part of the interactive features which are not matched with the image features and marking the part of the interactive features which are not matched with the image features based on the result of self-attention processing on the interactive features and the image features; the part of the mark that does not match is masked. Based on the above, the unmatched part of the two parts of the features, mainly the unmatched part of the features in the interactive features, can be accurately and objectively determined, the part of the features in the interactive features are marked, and the corresponding features are subjected to mask processing based on the marks. These masked feature portions, along with other feature portions and avatar feature portions in the interactive features, form a mask vector.

On the basis, the modified interactive text is modified based on the image setting information of the virtual object, and the modified interactive text can be obtained by: acquiring an image characteristic part corresponding to the interactive characteristic part processed by the mask from image characteristics corresponding to image setting information of the virtual object; and performing mask recovery processing on the interactive characteristic part subjected to mask processing by using the image characteristic part, and acquiring the modified interactive characteristic according to the result of the mask recovery processing. The image characteristic part is used for carrying out mask recovery processing on the interactive characteristic part subjected to mask processing, and a mode of copying the image characteristic part to the interactive characteristic part subjected to mask processing to carry out mask recovery processing can be adopted; or, on the basis of the character feature part, generating a new feature corresponding to the character feature part in the interactive feature part subjected to mask processing so as to perform mask recovery processing. By the method, the part of the characteristics processed by the mask can be effectively and quickly recovered, so that the finally obtained interactive text meets the image setting of the virtual object. The method is simple to realize and high in realization speed in a mode of copying image characteristics; and by generating new features, the image feature part can be better and effectively fused with the features except the part processed by the mask in the interactive features, so that the finally obtained interactive text is closer to the natural context.

In a specific example, the modification of the interactive text may be implemented using a machine learning model, i.e., a manner of rewriting the model in the embodiments of the present application. Namely, the re-writing model performs mask processing based on the features corresponding to the unmatched parts, and performs modification processing on the interactive text after mask processing based on the image setting information of the virtual object, so as to obtain the modified interactive text.

An exemplary rewrite model is schematically illustrated in FIG. 5, and includes a second interactive feature extraction section, a second appearance feature extraction section, a mask section, and a rewrite section.

The second interactive feature extraction part is used for carrying out self-attention processing on the vector corresponding to the interactive text to obtain a second interactive feature.

Illustratively, as shown in fig. 5, the interactive text includes two parts, i.e., an interactive type information part and an interactive contents information part. The interaction type information part is illustrated as "policy output" in the figure, and the interaction content information part is the specific content of the interaction text, and is illustrated as "thank you, i.e. XX …". The interactive text is self-attentive processed by a self-attentive layer. Then, the vector obtained after the attention processing is processed by the forward propagation layer, and is output as a vector of a preset dimension, that is, a second interactive feature. The preset dimension can be set by a person skilled in the art according to actual conditions, and is suitable for the interaction features after the self-attention processing and the image features after the self-attention processing.

The second image feature extraction section is configured to perform self-attention processing on a vector corresponding to image setting information of the virtual object to obtain a second image feature.

As shown in fig. 5, the avatar setting information of the virtual object is set to include name information, age information, feature information, and expression style information. The partial character setting information is self-attentive processed by a self-attentive layer. Then, the vector obtained after the attention processing is processed by the forward propagation layer, and is output as a vector with the same dimension as the preset dimension, i.e., a second shape feature.

The mask portion is used for determining a non-matching feature portion between the second interactive feature and the second morphological feature and masking the non-matching feature portion.

As shown in fig. 5, the mask portion includes a mask layer and a mask output layer. After the second interactive feature and the second morphological feature are obtained, the association relationship and the matching degree between the feature parts in the two types of features can be determined in the mask layer, and the self-attention computing mode or the attention computing mode can be used for example. Based on the association relationship and the matching degree between the characteristic parts, the unmatched characteristic part between the second interactive characteristic and the second image characteristic can be determined, mainly the characteristic part unmatched with the second image characteristic in the second interactive characteristic, and then the unmatched characteristic part is marked.

The marked part of the features, i.e. the part of the features which do not match, is input to a mask output layer for mask processing, so as to output a mask vector including the masked feature part and other feature parts (the second shape features and the part of the second interactive features which are not masked).

The rewriting section is configured to perform mask restoration processing on the masked unmatched feature portion based on the second image feature, and output a result of the restoration processing as the modified interactive text.

In fig. 5, the rewriting unit performs a restoration process on the obtained mask vector in the part, for example, copies the part of the second shape feature corresponding to the masked feature part to the mask part, or generates a new feature based on the part of the second shape feature and the feature part that is not masked, so as to replace the masked feature part.

For example, in the example shown in fig. 5, the interactive content information of the interactive text is "thank you, i.e., XX …", but the avatar setting information of the virtual object indicates that the name of the virtual object is "YY". The corresponding feature portion of "XX" will be masked, and in the mask restoration phase, mask restoration will be performed based on "YY" in the avatar setting information, and finally output as "thank you, i is YY …" through the output layer.

Therefore, through the process, the modification or rewriting of the interactive text is effectively realized so as to be matched with the image setting of the virtual object.

Hereinafter, a scene in which one virtual object is a virtual anchor is taken as an example, and the above-described procedure is explained in a scene example with reference to fig. 6 and 7. Fig. 6 shows a process diagram of an interaction process for virtual objects, and fig. 7 shows an example diagram of an interaction system framework.

In fig. 7, the interactive system includes a user layer, a perception understanding layer, a multi-round state tracking layer, an interactive strategy layer, an algorithm layer, and a skill layer.

The user layer is a human-computer interaction interface, and information input by a user, such as behavior information of the user in a live broadcast room, comment information input by the user and the like, can be received and collected through the user layer.

The perception understanding layer is mainly used for performing natural language understanding on information input by a user, mainly comment information, namely NLU processing, and comprises but is not limited to: identification of the type of problem entered by the user, identification/designation of the item entered by the user, identification/alignment of the attribute of the item entered by the user, etc.

The multi-round state tracking layer comprises scene state tracking based on behavior data of a user directly, such as real-time state and behavior tracking of the user, and multi-round DST (session state tracking) processing based on a result of NLU processing on information input by the user, including DST between a single user and a virtual anchor and DST between a plurality of users and the virtual anchor.

On the technical layer, a plurality of technologies are set for the virtual anchor in advance, including but not limited to broadcast explanation skills (for structured scripts and for anthropomorphic scripts), general skills (such as talk show chapters and human language chat), question and answer skills (such as anchor explanation and auxiliary broadcast interactive question and answer). After the skills are correspondingly processed by the algorithm layer, a basis is provided for determining each strategy in the interaction strategy layer.

For example, after the broadcast explanation technology is processed by the structured generation algorithm of the algorithm layer, the talkback strategy provides a basis for determination; the general skills can directly provide basis for explaining the strategy; the person-set language chat can directly provide a basis for the conversation strategy; and after the question and answer skills are processed by the matching algorithm of the algorithm layer, a basis is provided for the conversation strategy.

The interactive system is arranged in an algorithm layer aiming at a human-set center of the virtual anchor, and realizes the final human setting of the virtual anchor based on various preset image setting information (namely human setting information) for the virtual anchor. The method comprises the following steps: based on the appearance image information of the virtual anchor, a 2D/3D modeling algorithm is adopted to keep the appearance image of the virtual anchor consistent with the appearance image information; based on the interactive image information of the virtual anchor, an action/voice modeling algorithm is adopted, so that the interactive image of the virtual anchor is consistent with the interactive image information when the virtual anchor interacts with a user; based on the professional image information of the virtual anchor, an expressive force modeling algorithm is adopted, so that when the virtual anchor interacts with a user, the interactive image and the interactive content of the virtual anchor are consistent with the professional image information. And the human-set center completes the setting of the characters of the virtual anchor based on the information and the algorithm result to form a human-set map.

And in the interaction strategy layer, based on multiple rounds of DST, human language chat and matching algorithm results aiming at question and answer skills, determining corresponding conversation strategies, such as what conversation style is used, what BOT (conversation interaction platform) solution is used and the like. And determining a corresponding explanation strategy, such as what script explanation is used, what action/intonation explanation is used and the like, based on the broadcast explanation skills preset for the virtual anchor at the skill layer, the result obtained by the structured generation algorithm of the algorithm layer and the general skills. Combining the explanation strategy, the conversation strategy and the result of the scene state tracking, the interaction strategy of the virtual object for interacting with the user can be determined, including but not limited to: dancing, calling, starting, talking, etc.

Based on the structure of the above-described interactive system shown in fig. 7, the interactive process shown in fig. 6 includes:

firstly, determining the human set of the virtual anchor (namely the image set of the virtual anchor), and setting a human set pivot.

The determination of the person settings of the virtual anchor comprises:

1. modeling the relevant images/actions/voices, etc. according to the human settings of the virtual anchor.

As in fig. 7, the algorithm layer models the avatar through a 2D/3D modeling algorithm based on the avatar information of the avatar such that the avatar of the avatar coincides with the appearance information, including but not limited to "2D/3D image", "gender age", "hair style clothes", etc.

Based on the interactive image information of the virtual anchor, modeling is carried out on the virtual anchor through an action voice modeling algorithm, so that the interactive image of the virtual anchor during interaction is consistent with the interactive image information, including but not limited to 'voice expression', 'facial expression', 'limb action' and the like.

2. And generating a related script according to the personal setting of the virtual anchor.

The related screenplay can be generated according to the material related to the commodity set by the virtual anchor. Illustratively, the generated structured script is consistent with the human set of the virtual main broadcast through a structured generation algorithm based on the material related to the commodities, such as the technical layer-based presentation explaining skills.

3. And generating related talk show segments according to the person setting of the virtual anchor.

This part can be divided into three levels, according to the demographics of the virtual anchor, to generate the relevant talk show segments:

the entry level is general type, and the generated or mined talk show sub-content is various styles and styles. The entry level can be implemented by means of the existing related art.

And the generation or excavation of the talk show segment can be divided into a plurality of preset styles, such as dead sprout, fun and the like. This progression may be implemented based on the stylistic text creation model described previously in the embodiments of the subject application.

And the high-order is that a person sets a theme type, the theme type is generated according to data such as the character of the virtual anchor, and if the person sets the theme type, such as the female family meeting life meeting money saving, the person brings the background information of the virtual anchor. This higher level may also be implemented based on the stylistic text creation model described previously in the embodiments of the subject application.

4. And generating related language chatting according to the personal device of the virtual anchor.

The part can generate related chatting according to the personal device of the virtual anchor, and the part can also be divided into three levels:

entry level can be generated by using a universal CHAT tool, and corresponding answers can be given to user input, but the answer content is of various styles. The entry level can be implemented by means of the existing related art.

And (4) grading, namely stylizing generation, wherein the content of the answer can be divided into several styles. This progression may be implemented based on the stylistic text creation model described previously in the embodiments of the subject application.

And a high-level step of establishing a language chat library by people, generating the language chat library according to the characters of the virtual anchor and other data, for example, saving money for the life meeting of the female at home and the like. This high level may be implemented based on the stylistic text creation model described previously in the embodiments of the subject application.

5. And mining conversation contents according to the personnel set of the virtual anchor and rewriting the conversation contents.

The component may rewrite the answer content to be associated with the character based on the persona of the virtual anchor. For example, QA (question-answer-reply) pairs may be mined from "live trending questions," "store operation configuration," "comments from users," and so on. Moreover, partial answer content can be abstracted and rewritten.

And in the setting aiming at the artificial centre, besides the functions of the artificial centre, the artificial centre can be further set, comprising: human consistency determination (i.e., detection of whether there is a mismatch with the avatar of the avatar in the interactive text) and human consistency rewriting (i.e., rewriting the mismatch). The specific implementation of these two functions can be realized in the manner described in the foregoing embodiments.

After the above setting is completed, the virtual anchor can be used in implementing an anchor scene, and the following operation of the second step is performed.

And secondly, in the actual interaction process of the virtual anchor and the user, the consistency of the virtual anchor set by the user is kept.

Illustratively, such as:

2.1 if the user comment is received, determining a conversation strategy, and outputting reply content aiming at the user comment based on the conversation strategy.

For example, user intent recognition may be performed on user comments, and answers to the user may be found through a dialog strategy.

2.2 if the user comment is not received, determining an explanation strategy, and outputting a corresponding explanation scenario based on the explanation strategy.

For example, the content of the explanation can be decided by the explanation policy.

And 2.3, judging whether the content output by the conversation strategy or the explanation strategy is consistent with the personnel set by the virtual anchor through the consistency judgment of the personnel set.

For example, the determination of whether the content is consistent with the virtual host setting may be implemented by the detection manner of whether there is an unmatched portion in the interactive text described in the foregoing step S204.

And 2.4, if the conversation strategy is inconsistent with the virtual anchor, rewriting the content output by the conversation strategy or the explanation strategy into the content which is consistent with the human device of the virtual anchor through the human device consistency rewriting.

For example, rewriting of the content may be implemented in such a manner that the interactive text is rewritten as described in the foregoing step S206 so as to be in agreement with the virtual host setting.

In addition, the scheme of the embodiment of the application can also be suitable for virtual anchor provided by a plurality of different people, and the virtual anchor subjected to the expansion and refinement of the people, for example, the former virtual anchor is provided with 'name YY age: 25 specialty: beauty Makeup Daihen', the 'specialty: dress and fashion Daihen' is added after the expansion, the expanded and refined people only need not violate the former people, and the effectiveness of the scheme of the embodiment of the application can also be ensured.

According to the embodiment of the application, in the process of online interaction between the virtual object and the user, different from the traditional process that the virtual object directly uses the determined interactive text for interaction, the scheme of the embodiment of the application firstly detects the interactive text to judge whether parts which are not matched with the image setting of the virtual object exist, including but not limited to expression style mismatch, character or characteristic mismatch represented by the text, and the like; these unmatched places are then modified to match the avatar settings of the virtual object. During specific modification, mask and mask recovery (namely modification) processing based on image setting information are adopted, so that unmatched parts can be modified more accurately, the whole interactive text does not need to be modified, the modification accuracy is improved, the modification processing burden is reduced, and the modification efficiency is improved integrally.

Referring to fig. 8, a schematic structural diagram of an electronic device according to an embodiment of the present application is shown, and the specific embodiment of the present application does not limit a specific implementation of the electronic device.

As shown in fig. 8, the electronic device may include: a processor (processor)302, a communication Interface 304, a memory 306, and a communication bus 308.

Wherein:

the processor 302, communication interface 304, and memory 306 communicate with each other via a communication bus 308.

A communication interface 304 for communicating with other electronic devices or servers.

The processor 302 is configured to execute the program 310, and may specifically perform the relevant steps in the above method embodiments.

In particular, program 310 may include program code comprising computer operating instructions.

The processor 302 may be a CPU, or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present application. The intelligent device comprises one or more processors which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And a memory 306 for storing a program 310. Memory 306 may comprise high-speed RAM memory and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 310 may be specifically configured to enable the processor 302 to execute operations corresponding to the method described in any of the method embodiments.

For specific implementation of each step in the program 310, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing method embodiments, and corresponding beneficial effects are provided, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

The embodiment of the present application further provides a computer program product, which includes a computer instruction, where the computer instruction instructs a computing device to execute an operation corresponding to any one of the interaction methods for a virtual object in the foregoing method embodiments.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present application may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present application.

The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by a computer, processor, or hardware, implements the methods described herein. Further, when a general-purpose computer accesses code for implementing the methods illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the methods illustrated herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

Claims

1. An interaction method for virtual objects, comprising:

acquiring an interactive text to be processed in the online interaction process of the virtual object and a user;

detecting whether a part which is not matched with the image of the virtual object set by the image setting information of the virtual object exists in the interactive text;

if the virtual object exists, performing mask processing based on the characteristics corresponding to the unmatched part, and modifying the interactive text after mask processing based on the image setting information to obtain a modified interactive text, so that the virtual object interacts with the user based on the modified interactive text.

2. The method according to claim 1, wherein the detecting whether there is a portion in the interactive text that does not match with the avatar of the virtual object set by the avatar setting information of the virtual object includes:

respectively acquiring interactive features corresponding to the interactive texts and image features corresponding to the image setting information;

determining whether there is a portion in the interactive text that does not match the avatar of the virtual object set by the avatar setting information of the virtual object, based on a result of the self-attention processing on the interactive feature and the avatar feature.

3. The method of claim 1, wherein the interactive text includes information indicating an interaction type of the interactive text and interactive content information.

4. The method according to claim 2 or 3, wherein the detecting whether there is a portion in the interactive text that does not match with the avatar of the virtual object set by the avatar setting information of the virtual object comprises:

detecting whether a part which is not matched with the image of the virtual object set by the image setting information of the virtual object exists in the interactive text or not through a detection model;

the detection model comprises a first interactive feature extraction part, a first image feature extraction part and a matching detection part;

the first interactive feature extraction part is used for vectorizing the interactive text to generate an interactive vector; extracting features of the interaction vector to obtain first interaction features;

the first image feature extraction part is used for vectorizing the image setting information to generate an image vector; extracting the features of the image vector to obtain a first image feature;

the matching detection part is used for splicing the first interactive feature and the first image feature and performing self-attention processing on the spliced vector; and after the result of the self-attention processing is processed by a classification layer, outputting information indicating whether the first interaction feature is matched with the first image feature as a detection result of whether a part which is not matched with the image of the virtual object exists in the interaction text.

5. The method of claim 1, wherein the masking based on the features corresponding to the unmatched portions comprises:

determining and marking the part of the interactive feature which is not matched with the image feature based on the result of self-attention processing on the interactive feature and the image feature;

masking the portion of the marks that do not match.

6. The method of claim 5, wherein the modifying the mask-processed interactive text based on the avatar setting information to obtain a modified interactive text comprises:

acquiring an image characteristic part corresponding to the interactive characteristic part processed by the mask code from the image characteristics corresponding to the image setting information;

and performing mask recovery processing on the interactive characteristic part subjected to mask processing by using the image characteristic part, and acquiring the modified interactive characteristic according to the result of the mask recovery processing.

7. The method of claim 6, wherein said mask restoring said masked interactive feature portion using said avatar feature portion comprises:

copying the image characteristic part to the interactive characteristic part subjected to mask processing to perform mask recovery processing;

alternatively, the first and second electrodes may be,

and generating a new feature corresponding to the character feature part in the interaction feature part subjected to mask processing based on the character feature part so as to perform mask recovery processing.

8. The method according to any one of claims 5 to 7, wherein the masking processing based on the features corresponding to the unmatched parts and modifying the masked interactive text based on the avatar setting information to obtain a modified interactive text comprises:

performing mask processing on the basis of the characteristics corresponding to the unmatched part through a rewriting model, and modifying the interactive text after the mask processing on the basis of the image setting information to obtain a modified interactive text;

wherein the rewriting model comprises a second interactive feature extraction part, a second image feature extraction part, a mask part and a rewriting part;

the second interactive feature extraction part is used for carrying out self-attention processing on the vector corresponding to the interactive text to obtain a second interactive feature;

the second image feature extraction part is used for performing self-attention processing on the vector corresponding to the image setting information to obtain second image features;

the mask part is used for determining a unmatched feature part between the second interactive feature and the second image feature and performing mask processing on the unmatched feature part;

the rewriting section is configured to perform mask restoration processing on the masked unmatched feature portion based on the second image feature, and output a result of the restoration processing as a modified interactive text.

9. The method of claim 1, wherein the obtaining of the interactive text to be processed in the process of the online interaction between the virtual object and the user comprises:

acquiring a candidate interactive text to be used in the online interaction process of the virtual object and the user;

judging whether a text which corresponds to the candidate interactive text and is matched with the image of the virtual object exists in a preset interactive text style library;

and if the candidate interactive texts do not exist, determining the candidate interactive texts as interactive texts to be processed.

10. The method of claim 9, wherein the interactive text style library is pre-created by:

acquiring a plurality of source interactive texts and a plurality of style texts carrying different image setting styles;

combining the source interactive texts and the style texts to form different text style pairs, wherein each text style pair comprises a style text of one source interactive text;

respectively obtaining text features corresponding to the source interactive text and style features corresponding to the style text for each text style pair; fusing the text features and the style features, and generating an interactive text with an image setting style corresponding to the style features according to a fusion result;

and generating an interactive text style library based on the interactive text corresponding to each text style pair.

11. The method of claim 10, wherein the avatar-setting style includes at least one of:

a language expression style determined based on the avatar setting information of the virtual object, a character style determined based on the avatar setting information of the virtual object, and a characteristic style determined based on the avatar setting information of the virtual object.

12. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction which causes the processor to execute the corresponding operation of the method according to any one of claims 1-11.

13. A computer storage medium having stored thereon a computer program which, when executed by a processor, carries out the method of any one of claims 1 to 11.

14. A computer program product comprising computer instructions for instructing a computing device to perform operations corresponding to the method of any one of claims 1 to 11.