CN114546326A

CN114546326A - Virtual human sign language generation method and system

Info

Publication number: CN114546326A
Application number: CN202210162257.1A
Authority: CN
Inventors: 易峥; 王兆浪
Original assignee: Hithink Royalflush Information Network Co Ltd
Current assignee: Hithink Royalflush Information Network Co Ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2022-05-27

Abstract

The specification relates to the technical field of information, in particular to a virtual human sign language generation method and a system, wherein the method comprises the following steps: identifying whether a response preset condition is met, and acquiring an original text based on the content of the response preset condition in response to the response preset condition being met; determining keywords related to the original text based on the knowledge graph; classifying the original text, and determining a simplified text based on the classification to which the original text belongs; the simplified text can reflect the intention of the original text; the target sign language is determined based on at least one of the keywords and the simplified text.

Description

Virtual human sign language generation method and system

Technical Field

The present disclosure relates to the field of information technologies, and in particular, to a method and a system for generating a virtual human sign language.

Background

With the development of science and technology, virtual people become more and more popular in human life, and people can see the shadow of the virtual people in various industries, such as robots for providing consultation services in the service industry, 3D or plane virtual characters performing in the entertainment industry, and the like. However, because the behavior of the virtual human is usually set in advance through a program, the behavior of the virtual human is relatively hard and even strange when the virtual human talks with the user, so that the experience of the user is not good.

Therefore, it is desirable to provide a method for generating a sign language of a virtual human, so that the virtual human can make a suitable sign language during conversation, the communication expression effect of the virtual human is optimized, and the use experience of a user is improved.

Disclosure of Invention

One embodiment of the present specification provides a virtual human sign language generating method. The virtual human sign language generation method comprises the following steps: identifying whether a response preset condition is met, and acquiring an original text based on the content of the response preset condition in response to the response preset condition being met; determining keywords related to the original text based on a knowledge graph; classifying the original text, and determining a simplified text based on the classification to which the original text belongs; the simplified text can reflect an intent of the original text; and determining a target sign language based on at least one of the keywords and the simplified text.

One embodiment of the present specification provides a virtual human sign language generating system. The virtual human sign language generating system comprises: the judging module is used for identifying whether a preset answering condition is met or not, responding to the preset answering condition, and acquiring an original text based on the content of the preset answering condition; a keyword determination module for determining keywords related to the original text based on a knowledge graph; the simplified text determination module is used for classifying the original text and determining the simplified text based on the classification to which the original text belongs; the simplified text can reflect an intent of the original text; and the target sign language determining module is used for determining the target sign language based on at least one of the keywords and the simplified text.

One of the embodiments of the present specification provides a computer-readable storage medium, which stores computer instructions that, when executed by a processor, implement any one of the above-mentioned virtual human sign language generation methods.

Drawings

The present description will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:

FIG. 1 is a schematic diagram of an application scenario of a virtual human sign language generation system according to some embodiments of the present description;

FIG. 2 is an exemplary flow diagram of a method for virtual human sign language generation, according to some embodiments of the present description;

FIG. 3 is yet another exemplary flow diagram of a method for virtual human sign language generation, according to some embodiments of the present description;

FIG. 4 is an exemplary flow diagram illustrating the determination of keywords based on a knowledge-graph according to some embodiments of the present description;

FIG. 5 is an exemplary block diagram of a virtual human sign language generation system, shown in some embodiments herein.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.

It should be understood that "system", "apparatus", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts, portions or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

With the widespread use of virtual people, people want the virtual people to be more vivid and more similar to people in appearance and behavior. When talking with a virtual human, a user wants to explain the talking with voice and make corresponding limb motions according to voice contents like human beings so as to improve the use feeling of the user.

In some embodiments, the behavior that the avatar should make in the face of different conversation contents can be preset on the avatar, so that the avatar can make corresponding sign language while answering the user. However, due to the complexity of languages, the conversation contents cannot be exhausted, and a large memory space is needed to store huge conversation contents and corresponding data of behaviors, so that a general virtual human only sets corresponding sign language for some fixed sentences, however, the way of setting the sign language for only a few fixed sentences lacks systematic scheme design, and only a sign language scheme with a few rules can be heuristically designed, so that the sign language actions of the virtual human when the language expresses the contents are very limited, and the conversation requirement with high freedom of a user cannot be met.

In view of the above, in some embodiments, the communication expression effect of the virtual human can be optimized by extracting the speech expression content of the virtual human and determining the sign language to be made by the virtual human based on the speech expression content.

Fig. 1 is a schematic view of an application scenario of a virtual human sign language generation system according to some embodiments of the present specification. As shown in fig. 1, an application scenario of the avatar sign language generation system may include a server 110, an avatar 120, a database 130, a network 140, and the like.

Server 110 refers to a system having computing capabilities. A processing device may be included in the server 110 to determine the sign language to be made by the avatar for the speech expression content of the avatar. For example, the server 110 may acquire an original text of the avatar 120, process the acquired original text, and determine a target sign language to be made by the avatar based on a processing result, and the avatar talks with the user in conjunction with the processed target sign language. For more contents of the original text and the target sign language, refer to fig. 2 and the related description thereof, which are not described herein again.

The avatar 120 may be a program-controlled object capable of interacting with the user, and may be represented by a mobile phone, a tablet computer, a desktop computer, a laptop computer, etc., and in some embodiments, may be a robot device (e.g., a service robot for providing consultation services to the user), a three-dimensional character or virtual idol implemented by VR/AR/MR technology, etc. In some embodiments, avatar 120 may include a sign language execution module, wherein the sign language execution module may be a module for avatar 120 to execute sign language, such as a robotic arm, etc.; the avatar 120 may execute the sign language related to the speech expression contents through the sign language execution module. For example, the avatar makes complimentary sign language 120-1, directional sign language 120-2, clapped sign language 120-3, etc. through the robotic arm.

Database 130 may be a device that provides data support. For example, server 110 may extract keywords in the original text based on the original text and data in database 130. As another example, a knowledge graph is stored in database 130 for use by server 110.

The network 140 may connect various components of the system and/or connect the system with external resource components. The network 140 allows communication between the various components, as well as with other components outside the system. For example, the virtual human 120 transmits the speech expression content to the server 110 through the network 140 for processing, the server 110 obtains the data in the database 130 through the network 140 to process the received speech expression content, and sends the sign language corresponding to the processed speech expression content to the virtual human 120 through the network 140.

Fig. 2 is an exemplary flowchart of a virtual human sign language generation method according to some embodiments of the present description. In some embodiments, one or more steps in flow 200 may be performed by server 110 in fig. 1. As shown in fig. 2, the process 200 may include the following steps:

and 210, identifying whether a preset answering condition is met, and acquiring the original text based on the content of the preset answering condition in response to the preset answering condition being met. In some embodiments, step 210 may be performed by decision module 510 in FIG. 5.

The preset response condition can be a preset condition for communication expression of the virtual human. For example, for a dummy for reception, the answer preset condition may be to open a door, and when it is recognized that the door is opened, it is determined that the answer preset condition is satisfied. For another example, the response preset condition may be whether the user communicates with the avatar, and when it is recognized that the user communicates with the avatar, it is determined that the response preset condition is satisfied. In some embodiments, the answer preset condition may be a preset wake-up word, or a preset voiceprint.

The original text can be the content to be expressed by the virtual human. In some embodiments, the avatar may identify the content of the response preset condition, and obtain the original text to be responded to for the response preset condition based on the content of the response preset condition. For example, when recognizing that the door is opened, a welcome word "cheerful your arrival at my home" may be used as the original text. For another example, when it is recognized that the user wants to purchase a certain product, "purchase of order" prompting the user to pay may be used as the original text. In some embodiments, the original text may be constructed based on the form of the text.

In some embodiments, the avatar may also generate raw text through a machine learning model. For example, the machine learning model may process the acquired speech of the user talking with the avatar, resulting in original text for answering the user. Illustratively, when the user calls "hello" to the avatar, the avatar correspondingly determines that the original text is "cheering your coming home".

Based on the knowledge-graph, keywords associated with the original text are determined, step 220. In some embodiments, step 220 may be performed by keyword determination module 520 in FIG. 5.

The knowledge-graph may include entity-to-entity relationships, attributes of entities, and the like.

The keywords may be words in the original text that describe specific content. In some embodiments, the keyword may be a verb and/or a name in the original text, and for a piece of original text, the original text may be divided into at least one word by using a word splitter, for each word, its attribute is searched in the knowledge graph, and when the attribute of the word is a verb or a noun, the word is taken as the keyword. For example, continuing with the above example, for the original text "cheering your coming to my home," you, "me," and "home" therein can be known as nouns through the knowledge-graph, and thus the three words "you," "me," and "home" can be used as keywords.

In some embodiments, entity information in the original text may also be obtained based on the knowledge-graph, and keywords may be determined based on the entity information. For more contents of the entity information and the determined keywords, refer to fig. 3 and the related description thereof, which are not repeated herein.

Step 230, classifying the original text, and determining a simplified text based on the classification to which the original text belongs. In some embodiments, step 230 may be performed by simplified text determination module 530 in fig. 5.

The simplified text can reflect the intent of the original text, which may be understood as an intent to achieve some purpose.

In some embodiments, the original text may be classified as intended, appropriate words determined based on the classification, and combined into simplified text. For example, the intent to express for the original text "you and me together" is a connection between you and me, and thus, may be categorized into a "connection" class, whose simplified text may include "you connect me".

In some embodiments, repeated words in the original text can be removed, the removed text is classified, and the simplified text is determined based on the classification. For example, the intended expressions of "order" and "purchase" in "order placement purchase" are similar in meaning, and thus "order placement purchase" may be reduced to "order placement" or "purchase" and then classified into the "payment" class, the reduced text of which may include "payment".

In some embodiments, the simplified text may also be determined based on image schema classification of the original text. For more details on classifying and determining the simplified text based on the image schema, refer to fig. 3 and the related description thereof, which are not repeated herein.

Step 240, determining the target sign language based on at least one of the keyword and the simplified text. In some embodiments, step 240 may be performed by target sign language determination module 540 in fig. 5.

In some embodiments, the target sign language may include hand actions made while the avatar is verbally expressed (if the avatar is provided with components similar to a human's "hand" or "arm"); in some embodiments, the target sign language may also include other limb actions made while the virtual human is verbally speaking (if the virtual human does not have components like a human's "hand" or "arm"). In some embodiments, the avatar may simultaneously represent speech and/or text corresponding to the target sign language and the original text. For example, the avatar may point to the talker and himself in sequence while talking to the user to express "you and me" sign language. In some embodiments, the medium for the avatar to perform the target sign language may not only be limited to the portion related to the hand, but also be a display screen or other device. Such as hand movements made by a hand in the display screen, etc.

In some embodiments, when determining the target sign language based on the keywords and/or the simplified text, the target sign language may also be determined based on conditions such as user preference or duration of the candidate sign language. For more details about user preferences, duration of the candidate sign language and determination of the target sign language, refer to fig. 3 and its related description, which are not repeated herein.

In some embodiments, the target sign language may also be generated by the fourth model when based on keywords or simplified text. For more details about the fourth model and the target sign language determination, refer to fig. 3 and its related description, which are not repeated herein.

Some embodiments in the description respectively determine keywords for expressing an image and simplified texts for expressing intentions based on original texts at a character level, and determine target sign languages to be made by the virtual human during expression based on the keywords or the simplified texts, so that the communication expression effect of the virtual human is optimized, and the use experience of a user is improved.

Fig. 3 is yet another exemplary flow diagram of a method for generating a virtual human sign language according to some embodiments of the present description. In some embodiments, one or more steps in flow 300 may be performed by server 110 in fig. 1. As shown in fig. 3, the process 300 may include the following steps:

step 310, entity information in the original text is obtained based on the first model.

The entity information may be words or phrases related to the entity. For example, "you", "me", and "home" in the original text "welcome you to home" may be extracted as the entity information, and optionally, if only one entity information is selected, "home" may be selected as the entity information of the original text "welcome you to home".

The first model may be used to obtain entity information. In some embodiments, the input of the first model is original text, and the output is entity information corresponding to the original text. Continuing with the example above, if "welcome you to me home" is input to the first model, the first model may output "you", "me" and "home", or simply "home".

In some embodiments, the first model type may be a model obtained by training a keyword extraction algorithm, including but not limited to, for example, Named Entity Recognition (NER) or TF-IDF algorithm, and the like, which is not limited in this specification.

In some embodiments, the first model may be trained using a plurality of labeled first training samples. For example, a plurality of first training samples with labels may be input into an initial first model, a loss function may be constructed from the labels and the results of the initial first model, and parameters of the initial first model may be iteratively updated based on the loss function. And finishing model training when the loss function of the initial first model meets the preset condition to obtain the trained first model. The preset condition may be that the loss function converges, the number of iterations reaches a threshold, and the like. It should be noted that in some other embodiments, the first model may also be trained by an unsupervised learning algorithm.

In some embodiments, the first training sample may include a plurality of segments of dialog content; the first training sample may be extracted through daily communication, and its label may be a keyword in each dialog content. In some embodiments, the label may be derived by at least manual labeling.

At step 320, keywords are determined based on the entity information.

For more about the keywords, refer to fig. 2 and the related description thereof, which are not repeated herein.

In some other embodiments, information related to the entity may be used as a keyword. For example, for the keyword "place an order", the entity "money" related thereto may be taken as its keyword.

In some embodiments, whether the entity information is the image description or not can be judged based on the knowledge graph, and the keyword is determined based on the judgment result. For more details of determining keywords based on knowledge graph, refer to fig. 4 and its related description, which are not repeated herein.

Some embodiments in this specification may improve extraction efficiency and accuracy of keywords by extracting entity information in an original text using a first model and determining keywords based on the entity information.

At step 330, image schema classification of the original text is obtained based on the second model.

The image schema classification can be a plurality of classification results obtained by classifying the texts. In some embodiments, multiple types of image schema classifications may be determined by cognitive linguistics, including, but not limited to, origin-path-object, center-edge, part-whole, container, connection, force, linear, and/or balanced graphical representations, and the like.

In some embodiments, the image of the original text may be identified and then classified based on the image of the original text. Illustratively, for example, for the original text "you and me together," the contact whose expression is "you" and "me" may be expressed, and thus, the text may be classified into the "connection" class. Also for example, "i am at home," which may be categorized into the "container" category.

The second model may be used to obtain image schema classification. In some embodiments, the input to the second model is the original text and the output is the image schema classification of the original text. For example, the original text "i and you are together" is input into the second model, and the probability values of all classification results corresponding to the original text are obtained, such as the probability of the "container" classification is 47%, the probability of the "connection" classification is 80%, the probability of the "strength" classification is 3%, and the like.

In some embodiments, the second model may output the one classification with the highest probability. For example, for the original text "you and me together" described above, the second model may classify the "join" with the highest probability as the output of the second model. In some embodiments, the second model may set a preset output threshold and output a probability of the intent schema classification greater than the preset threshold. For example, the preset threshold may be 40%, and "container" and "connection" may be classified and output for the above-described original text "you and me together". For another example, the preset output threshold may be set to 90%, and no image schema classification may be output if there is no output that meets the condition.

In some embodiments, the type of the second model may be a NLP (natural Language processing) based pre-trained Language model (pre-training), implementing classification tasks such as BERT model or XLNET model, etc.

In some embodiments, the second model may be derived from a plurality of second training samples through supervised training or unsupervised training. For example, a plurality of labeled second training samples may be input into an initial second model (e.g., a BERT pre-training language model), a loss function may be constructed from the labels and the results of the initial second model, and parameters of the initial second model may be iteratively updated based on the loss function. And completing model training when the loss function of the initial second model meets the preset condition to obtain a trained second model. The preset condition may be that the loss function converges, the number of iterations reaches a threshold, and the like. In some embodiments, the second training sample may include multiple segments of dialog content, similar to the first training sample. The second training sample can be extracted through daily communication, and the labels can be one or more image schema classifications corresponding to each dialog determined by manual classification.

At step 340, simplified text is determined based at least on the image schema classification.

In some embodiments, the ideographic classification portion may replace content in the original text that is similar in meaning to the classification expression or generate new reduced content. For example, for the original text "you are me" with the image schema classified as "connected," we can reduce "you are me" to "connected you" based on the original text and image schema classification.

In some embodiments, the image schema classification and the original text may also be input into a third model, through which the simplified text is output.

For example, the third model may classify "join" based on the original text "you and me together" image schema corresponding to the original text, recompile the sentence, resulting in a simplified text "you and me" that expresses more simply.

In some embodiments, the type of the third model may be a model including an Encoder-decoder structure may include, but is not limited to, such as a Seq2Seq model and a BERT model.

In some embodiments, the third model may be derived from a plurality of third training samples through supervised training or unsupervised training.

Some embodiments in this specification obtain a simplified text with a simpler structure by rewriting the original text, so that the content of the text is more concentrated, the structure of the original text is simplified, irrelevant words are reduced, and the subsequent conversion of the text content into the target sign language is facilitated.

And 350, generating the target sign language based on at least one of the keywords and the simplified text through a fourth model.

The fourth model may be used to determine the target sign language. In some embodiments, the keywords or the simplified text may not be generated in

steps

320 and 340 due to the setting of the preset threshold or the policy, the input of the fourth model is the keywords and/or the simplified text, and the output may be the target sign language determined according to the keywords and/or the simplified text. For example, for the original text "i and you together", the keywords may be "i", "you", and "together", the simplified text may be "i connect you", the keywords or simplified text of the original text is entered into a fourth model, the fourth model outputs a first candidate target sign language "finger points to itself to represent 'i' and then to the talker to represent 'you' and then to represent 'together' with a hug, and outputs a second candidate target sign language" finger points to itself to represent 'i' and then to the talker to represent 'you' according to the simplified text "i connect you", and then slides back and forth between the talker and itself to represent 'connect' ".

In some embodiments, the fourth model may be a model that converts text into sign language, such as a GODIVA model or the like. In some embodiments, the fourth model may be trained using a fourth training sample. For example, the sign language and the corresponding text may be obtained through media information with sign language or a sign language course, and the obtained sign language and the corresponding text may be used as a fourth training sample to train the model.

In some embodiments, the fourth model may not have a suitable sign language for the input keywords and/or simplified text, or the generated target sign language may not meet the user preference, and in this case, the keywords and/or simplified text may be re-determined based on the original text, and a more relaxed generation strategy (e.g., lowering the aforementioned preset threshold) may be followed during the re-determination of the keywords and/or simplified text, so that the fourth model may output the target sign language based on the re-determined keywords and/or simplified text.

Some embodiments in this specification improve the efficiency and accuracy of generating the target sign language by determining the target sign language through the fourth model.

In some embodiments, the manner of determining the target sign language may include: at least one of the keywords and the simplified text is determined based on user preferences.

The user preference may be a user's preference for sign language made by the avatar. For example, an older user may prefer sign language for longer periods of sign language.

In some embodiments, user preferences may be determined based on historical behavior of the user. For example, the user preference is determined by counting the forming basis corresponding to the target sign language selected by the user history (e.g., whether the target sign language is generated based on keywords or based on simplified text).

In some embodiments, the avatar may determine whether to use keywords or simplified text-assisted verbal expressions in communicating with the user based on the user's preference for the target sign language for keyword formation and the target sign language for simplified text formation.

In some embodiments, the server 110 may obtain the user preferences based on the recommendation information.

The recommendation information may be a system that recommends a target sign language for the user. In some embodiments, the recommendation information may automatically recommend the target sign language for the user when the virtual human is in speech expression with the virtual human, and update the recommendation information based on the feedback of the user to the recommended target sign language. For example, for the original text "purchase order" the keywords "money" and "down" may be obtained, the recommendation information may randomly show the target sign language formed based on "money" or "down" to the user a, and the user a may feed back "like" or "dislike" to form a preference for the user a. For another example, for the user group B, it may be that there is a part of the user group B that tends to form the target sign language based on "money" and another part that tends to form the target sign language based on "down", and the recommendation information may set a threshold value, for example, 50%, based on the preference rate of the user group B for the two target sign languages, if the preference rate of the user group B for the target sign language based on "money" exceeds 50%, then "money" is taken as the target sign language that the virtual human makes for the user group B when expressing "place order", and "money" is taken as the preference of the user group B for this kind of text (e.g., place order, purchase, payment, etc.).

In some embodiments, there may be multiple keywords and simplified texts determined based on the original text, and the recommendation information may determine subsequently generated keywords and simplified texts based on user feedback on multiple simplified texts or target sign language formed by the keywords. For example, for "placing an order", the keyword may be "money" or "down", the recommendation information may randomly select the keyword (e.g., money) to generate the target sign language, and when the user communicates with the virtual human, if the user feedback does not like the target sign language, the user may subsequently form the target sign language of "placing an order" based on "down".

A target sign language is determined based on at least one of the determined keywords or the simplified text.

In some embodiments, the preferences of the user for the target sign language generated based on the keywords or the simplified text may be different, and therefore, the avatar may also determine how to generate the target sign language for the user based on the user's preference for the target sign language generated based on the keywords or the simplified text. For example, the user C prefers the target sign language generated based on the keyword, and thus the avatar can use the target sign language generated based on the keyword all at the time of communication with the user C.

Some embodiments in this specification enhance the user experience by determining the target sign language based on user preferences.

In some embodiments, the manner of determining the target sign language may further include: candidate sign languages are generated based on the keywords and the simplified text, respectively.

The candidate sign language may be a sign language that is expected to be selected as the target sign language. In some embodiments, the candidate sign languages may include a first candidate target sign language generated based on the keyword and a second candidate target sign language generated based on the simplified text.

In some embodiments, the first candidate target sign language and the second candidate target sign language may be generated by a fourth model.

And determining the target sign language based on the duration of the candidate sign language.

The duration of the sign language may refer to the time consumed by the virtual human to make the sign language. The duration of the candidate sign language can be the time consumed by the virtual human to make the candidate sign language.

In some embodiments, to ensure the simplicity of the action, the candidate sign language with shorter duration of the candidate sign language may be selected as the target sign language. In some embodiments, a candidate sign language with a suitable duration may also be selected as the target sign language based on the user's preference, such as choosing a sign language with a longer duration for the hearing-impaired user.

FIG. 4 is an exemplary flow diagram for determining entity information and determining keywords based on a knowledge-graph, as shown in some embodiments of the present description. In some embodiments, one or more steps in flow 400 may be performed by server 110 in fig. 1. As shown in fig. 4, the process 400 may include the following steps:

and step 410, judging whether the entity information is the image description or not based on the knowledge graph.

The knowledge-graph includes nodes and edges between the nodes, and in some embodiments, the nodes correspond to entities, and the edges between the nodes reflect relationships between the entities, so that the knowledge-graph can reflect information of the entities and the relationships of the entities, attributes of the entities, and the like. For example, the entities "place an order" and "money" may be included in the knowledge-graph, and the edges formed by both "place an order" and "money" may be used to reflect the relationship between "place an order" and "money", including but not limited to relationships such as "money is a tool for placing an order". In some embodiments, the nodes in the knowledge-graph may further include attribute values, and edges formed between the entities and the attribute values may be attributes corresponding to the attribute values, and the attributes of the entities may be reflected by the entities, the attributes, and the attribute values. For example, for an entity "home" in the knowledge graph, since "home" is an image expression, the "home" has an attribute of the image expression, so the attribute value of "home" for the attribute "whether image" is "yes"; similarly, for the entity "order placement", since "order placement" is not an image expression, the "order placement" does not have the attribute of the "image expression", so that the attribute value of "order placement" for the attribute "whether image" is "no". In some embodiments, the knowledge-graph may be constructed from expert experience.

In some embodiments, nodes and edges in the knowledge-graph that reflect entities and relationships between entities may be referred to as relationship triples, which may be denoted as (entities, relationships, entities). For example, a relationship triplet in the knowledge graph that reflects "order tool is money" may include nodes "order", "money", and a relationship "tool", which may be denoted as (order, tool, money). In some embodiments, nodes and edges in the knowledge-graph that reflect attributes of an entity may be referred to as attribute triples, which may be denoted as (entity, attribute value). For example, the nodes "home", "yes" and edges "whether to be in image" in the knowledge-graph reflecting the "home in image expression" may be noted as (home, whether to be in image, yes).

The avatar description may be a description of the avatar, as opposed to an abstract description. For example, for "home," it is avatar, and thus "home" can be considered an avatar description. As another example, for "ordering," it is abstract, and thus "ordering" may not be considered a visual depiction.

In some embodiments, whether the entity information is a character description can be judged through map query.

In step 420, if the entity information is the image description, the entity information is determined to be the keyword. For example, for an entity "home", an attribute triple that determines whether it is an avatar (home, avatar, yes) may be queried through a graph, and "home" may be extracted as a keyword in response to the attribute value being "yes".

And step 430, if the entity information is non-image description, determining other image description entities related to the entity information as keywords based on the knowledge graph. For example, for an entity to "place an order", an attribute triple indicating whether the entity is in an image or not (place an order, image or not, no) can be inquired in the knowledge graph, since the attribute value is "no", the "place an order" cannot be directly used as a keyword, and other images related to the "place an order" need to be determined based on the knowledge graph to describe the entity as the keyword.

The other character description entity may be entity information of other characters related to the entity information. For example, by querying the attribute values of whether the entity represented by the other nodes to which the edge is connected is in the shape of the image.

In some embodiments, other visual descriptive entities related to entity information may be determined by relationship triples in the knowledge-graph. For example, for an entity "order", the relation triples of (order, tool, money) and (order, tendency, down) can be queried through the map association, so that the "money" and "down" can be extracted, and the attribute triples of "money" and "down" can be queried, and the attribute triples of "money" and "down" are (money, whether image, yes) and (down, whether image, yes) respectively, so that the "money" and "down" can be used as other image description entities related to the "order" down ".

In some embodiments, it may happen that other image description entities related to entity information are not found after multiple queries, so to avoid the case of unlimited searches or too long search time, a search upper limit may be set, so that the virtual human can react within a certain time range, where the search upper limit may be limited by the number of traversed edges, for example, the number of edges is limited to not more than 4 during query.

In some embodiments, it may happen that no other image description entity related to the entity information is found up to the upper limit of the search, i.e. the search fails. In some embodiments, when the search fails, other entity information related to the entity information may be used as new entity information, and other image description entities related to the new entity information may be searched based on the new entity information, or selected not to be output.

Some embodiments in the present description may improve the accuracy and efficiency of querying other image description entities related to entity information by searching other image description entities related to non-image description entity information through a knowledge graph.

Some embodiments in this specification determine whether the entity information is an image description by using a knowledge graph, and use an entity of the image description as a keyword, and use other image description entities that are not the entity of the image description as keywords, so that a target sign language may be generated according to the entity of the image description, and since the entity of the image description is more easily expressed in a sign language, the vividness of the subsequently generated target sign language is improved, so that the generated target sign language is more easily accepted by a user. Some embodiments in the present specification search for attributes of entities and other image description entities through a knowledge graph, so that complex calculation is not required, and query efficiency is improved.

FIG. 5 is an exemplary block diagram of a virtual human sign language generation system, shown in some embodiments herein. In some embodiments, one or more modules in system 500 may be located in server 110 in FIG. 1. As shown in fig. 5, the system 500 may include the following modules:

the determining module 510 is configured to identify whether a preset response condition is met, and in response to that the preset response condition is met, obtain an original text based on the content of the preset response condition. For more contents of the answer preset condition, the original text and the obtained original text, refer to fig. 2 and the related description thereof, and are not described herein again.

A keyword determination module 520 for determining keywords related to the original text based on a knowledge graph. For more contents of the keywords and the determined keywords, refer to fig. 2 and the related description thereof, which are not repeated herein.

A simplified text determination module 530, configured to classify the original text, and determine a simplified text based on the classification to which the original text belongs; the simplified text can reflect the intent of the original text. For more details on the simplified text and determining the simplified text, refer to fig. 2 and the related description thereof, which are not repeated herein.

A target sign language determining module 540, configured to determine a target sign language based on at least one of the keyword and the simplified text. For more contents of the target sign language and the target sign language determination, refer to fig. 2 and the related description thereof, which are not described herein again.

The embodiment of the specification also provides a computer readable storage medium. The storage medium stores computer instructions, and after the computer reads the computer instructions in the storage medium, the computer realizes the virtual human sign language generation method.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.

Also, the description uses specific words to describe embodiments of the specification. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.

Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows

A change of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

For each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this specification, the entire contents of each are hereby incorporated by reference into this specification. Except where the application history document does not conform to or conflict with the contents of the present specification, it is to be understood that the application history document, as used herein in the present specification or appended claims, is intended to define the broadest scope of the present specification (whether presently or later in the specification) rather than the broadest scope of the present specification. It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of this specification shall control if they are inconsistent or contrary to the descriptions and/or uses of terms in this specification.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments described herein. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

Claims

1. A virtual human sign language generation method comprises the following steps:

identifying whether a response preset condition is met, and acquiring an original text based on the content of the response preset condition in response to the response preset condition being met;

determining keywords related to the original text based on a knowledge graph;

classifying the original text, and determining a simplified text based on the classification to which the original text belongs; the simplified text can reflect an intent of the original text;

and determining a target sign language based on at least one of the keywords and the simplified text.

2. The method of claim 1, wherein determining keywords related to the original text based on a knowledge-graph comprises:

acquiring entity information in the original text based on a first model;

determining the keyword based on the entity information and the knowledge-graph.

3. The method of claim 2, the determining keywords based on the entity information and the knowledge-graph, comprising:

judging whether the entity information is image description or not based on a knowledge graph; nodes in the knowledge graph correspond to entities, and edges between the nodes reflect relationships between the entities;

if yes, determining the entity information as the keyword;

if not, determining other image description entities related to the entity information as the keywords based on the knowledge graph.

4. The method of claim 1, wherein classifying the original text, determining simplified text based on the classification to which the original text belongs, comprises:

acquiring image schema classification of the original text based on a second model;

the simplified text is determined based at least on the image schema classification to which the original text belongs.

5. The method of claim 4, wherein said determining the simplified text based at least on a image schema classification to which the original text belongs comprises: and inputting the image schema classification and the original text into a third model to obtain the simplified text.

6. The method of claim 1, wherein determining a target sign language based on at least one of the keywords and the simplified text comprises:

determining at least one of the keyword and the simplified text based on user preferences;

determining the target sign language based on the determined at least one of the keywords or the simplified text.

7. The method of claim 1, wherein determining a target sign language based on at least one of the keywords and the simplified text comprises:

generating candidate sign languages based on the keywords and the simplified texts respectively;

8. The method of claim 1, wherein determining a target sign language based on at least one of the keywords and the simplified text comprises: generating, by a fourth model, the target sign language based on at least one of the keyword and simplified text.

9. A virtual human sign language generation system, comprising:

the judging module is used for identifying whether a preset answering condition is met or not, responding to the preset answering condition, and acquiring an original text based on the content of the preset answering condition;

a keyword determination module for determining keywords related to the original text based on a knowledge graph;

the simplified text determination module is used for classifying the original text and determining the simplified text based on the classification to which the original text belongs; the simplified text can reflect an intent of the original text;

and the target sign language determining module is used for determining the target sign language based on at least one of the keywords and the simplified text.

10. A computer-readable storage medium storing computer instructions, wherein when the computer instructions in the storage medium are read by a computer, the computer executes the virtual human sign language generating method according to any one of claims 1 to 8.