CN117540706A - Action insertion method and device, storage medium and electronic equipment - Google Patents

Action insertion method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN117540706A
CN117540706A CN202311056014.0A CN202311056014A CN117540706A CN 117540706 A CN117540706 A CN 117540706A CN 202311056014 A CN202311056014 A CN 202311056014A CN 117540706 A CN117540706 A CN 117540706A
Authority
CN
China
Prior art keywords
target
action
target word
digital person
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311056014.0A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Moore Threads Technology Co Ltd
Original Assignee
Moore Threads Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Moore Threads Technology Co Ltd filed Critical Moore Threads Technology Co Ltd
Priority to CN202311056014.0A priority Critical patent/CN117540706A/en
Publication of CN117540706A publication Critical patent/CN117540706A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The specification discloses an action insertion method, an action insertion device, a storage medium and electronic equipment. In the action inserting method provided by the specification, a text to be output of a digital person is obtained, and word segmentation processing is carried out on the text to be output to obtain target word segmentation; determining the text characteristics of each target word; aiming at each target word, determining the similarity between the text characteristic of each target word and each preset standard characteristic; determining target actions matched with each target word according to the similarity; and inserting the target action for the digital person when the digital person outputs the text to be output.

Description

Action insertion method and device, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to a method and apparatus for motion insertion, a storage medium, and an electronic device.
Background
In recent years, with the continuous development of technology, digital persons are increasingly used in various fields, such as live broadcasting, lecturing, etc., using digital persons. In order to make the performance of the digital person more lively, the digital person often needs to cooperate with the language to make corresponding actions in the speaking process.
When the existing method generates the actions of the digital person, some keywords are preset, a corresponding action is set for each keyword, and after the digital person is detected to speak a certain keyword, the digital person can make the corresponding action. However, the conventional method for inserting the digital person does not consider the specific situation of speaking the digital person, and may cause erroneous recognition and thus improper insertion.
Therefore, how to combine the scenario when speaking with more accurate digital person insertion actions is a challenge to be addressed.
Disclosure of Invention
The present specification provides an action inserting method, apparatus, storage medium, and electronic device to at least partially solve the above-mentioned problems of the prior art.
The technical scheme adopted in the specification is as follows:
the present specification provides an action insertion method, including:
acquiring a text to be output of a digital person, and performing word segmentation processing on the text to be output to obtain target word segmentation;
determining the text characteristics of each target word;
aiming at each target word, determining the similarity between the text characteristic of each target word and each preset standard characteristic;
Determining target actions matched with each target word according to the similarity;
and inserting the target action for the digital person when the digital person outputs the text to be output.
Optionally, determining the text feature of each target word includes:
inputting each target word into a pre-trained semantic recognition model to obtain text characteristics of each target word output by the semantic recognition model.
Optionally, determining a target action matched with each target word according to the similarity, including:
and when the maximum value of the similarity between the text feature of each target word and each standard feature is determined to meet a preset condition, determining a target action matched with each target word according to each standard feature.
Optionally, inserting the target action for the digital person when the digital person outputs the text to be output includes:
determining, for each target action, a duration of said each target action;
inserting each target action for the digital person when the digital person outputs the target word matched with each target action, and enabling the digital person to maintain each target action for the duration.
Optionally, inserting the target action for the digital person when the digital person outputs the text to be output includes:
inserting a target action matching a first target word for the digital person when the digital person outputs the first target word with a duration of the target action matching the first target word in response to the duration of the target action matching the first target word not being greater than an interval time between the digital person outputting the first target word and a second target word;
inserting the target action matched with the first target word for the digital person according to the priority of the target action matched with the first target word when the duration of the target action matched with the first target word is longer than the interval time between the output of the first target word and the output of the second target word by the digital person;
the second target word is the next target word of the first target word in the text to be output.
Optionally, inserting the target action matched with the first target word for the digital person according to the priority of the target action matched with the first target word, including:
Responsive to the priority of the target action matching the first target word being greater than the priority of the target action matching the second target word, inserting the target action matching the first target word for the digital person when the digital person outputs the first target word and not inserting the action when the digital person outputs the second target word for the duration of the target action matching the first target word;
and in response to the priority of the target action matching the first target word being less than the priority of the target action matching the second target word, not inserting an action when the digital person outputs the first target word.
Optionally, inserting the target action for the digital person when the digital person outputs the text to be output includes:
for each target action, inserting each target action for the digital person before a specified duration when the digital person outputs a target word segment matched with each target action.
An apparatus for action insertion provided herein, the apparatus comprising:
the acquisition module is used for acquiring a text to be output of a digital person, and performing word segmentation on the text to be output to obtain target word segmentation;
The feature determining module is used for determining the text features of the target word;
the similarity determining module is used for determining the similarity between the text characteristics of each target word and preset standard characteristics aiming at each target word;
the action determining module is used for determining target actions matched with each target word according to the similarity;
and the inserting module is used for inserting the target action for the digital person when the digital person outputs the text to be output.
The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the above-described action insertion method.
The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-described action insertion method when executing the program.
The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:
in the action inserting method provided by the specification, a text to be output of a digital person is obtained, and word segmentation processing is carried out on the text to be output to obtain target word segmentation; determining the text characteristics of each target word; aiming at each target word, determining the similarity between the text characteristic of each target word and each preset standard characteristic; determining target actions matched with each target word according to the similarity; and inserting the target action for the digital person when the digital person outputs the text to be output.
When the action inserting method provided by the application is used for inserting actions of a digital person, the target word which accords with a speaking scene and a context can be obtained by word segmentation of the text to be output according to semantics, and the target word is matched with a proper target action according to the text characteristics of the target word, so that the action is inserted when the digital person outputs the text to be output. By adopting the method, the situation that actions inconsistent with scenes or contexts are inserted due to strict matching in the traditional method can be avoided. Meanwhile, by adopting a semantic similarity matching mode, all possible words do not need to be listed under each scene, only a small number of standard words need to be listed for matching, and the workload of constructing a standard word stock in the early stage can be greatly reduced.
Drawings
The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:
FIG. 1 is a flow chart of an action inserting method in the present specification;
FIG. 2 is a schematic diagram of a library of standard actions, standard words, and durations and priorities in this specification;
FIG. 3 is a schematic view of an action insertion device provided herein;
fig. 4 is a schematic view of the electronic device corresponding to fig. 1 provided in the present specification.
Detailed Description
In the existing method for implementing action insertion for the digital person, keywords and actions corresponding to the keywords are preset, and the corresponding actions are inserted when the keywords are detected in the output text of the digital person. However, this matching strategy for inserting actions is too sculptured, and as long as keywords are detected, corresponding actions are inserted into the digital person, no matter what scene and context is currently in. Such a strictly matched action insertion method may not insert actions timely in many cases.
For example, the term "hello" may refer to a greeting to the other party, and accordingly, it may be that the keyword "hello" matches the action of "waving a hand". In the existing action insertion method, when a digital person speaks a word of "hello smart", the prior art detects adjacent "hello" two words and recognizes the words as keywords, so that the digital person makes a corresponding action of "waving hands". In practice, however, the meaning that the digit is supposed to express when speaking the sentence is not a greeting, and it is not advisable to insert the action "waving the hand".
In order to solve the above technical problems, the present specification provides an action insertion method capable of more accurately inserting actions for a digital person in combination with a context.
For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present application based on the embodiments herein.
The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.
Fig. 1 is a flow chart of an action inserting method in the present specification, specifically including the following steps:
s100: and obtaining a text to be output of the digital person, and performing word segmentation processing on the text to be output to obtain each target word segmentation.
All steps in the action insertion method provided in the present specification may be implemented by any electronic device having a computing function, such as a terminal, a server, or the like.
In the method, the text to be output needs to be acquired before the digital person outputs the text to be output. Therefore, the text to be output by the digital person may be generated by AI or may be preset, which is not particularly limited in this specification.
When the text to be output is subjected to word segmentation, a word segmentation model trained in advance can be adopted to segment the text to be output, so that each target word segmentation is obtained. The word segmentation model may be any word segmentation model capable of performing word segmentation according to semantics, such as jieba, and the like, and the specification is not limited in particular.
In order to avoid the condition of untimely action insertion caused by direct recognition without considering scenes and semantics when the keywords are recognized by the traditional method, in the step, firstly, the keywords are segmented to the text to be output according to the context of the text to be output through a segmentation model to obtain target segmented words, and in the subsequent steps, matching actions are removed according to the segmented words, so that the occurrence of false recognition condition is avoided.
For example, assuming that the text to be output is "the air temperature is low for several days", if word segmentation is not performed according to the conventional method, the word "weather" is detected to exist in the text to be output, and obviously, the part of the content in the text to be output should not be understood as the word "weather" in the current context. After the word segmentation of the word segmentation model, the text to be output is changed into 'days/air temperature/very low', and the 'days' and the 'air' are separated and respectively segmented into different target word segments, so that the text to be output can still be restored to express correct meanings after word segmentation.
S102: and determining the text characteristics of each target word.
In this step, the text feature of each target word determined in step S100 may be extracted for use in a subsequent step.
S104: and determining the similarity between the text characteristics of each target word and preset standard characteristics according to each target word.
In the traditional method, only the matching strategy of matching the keywords with the actions is too strict, and for any action, the insertion of the action is triggered only if the keywords corresponding to the action are detected to be contained in the text of the digital person. However, in practical applications, there are more than one scenario in which many actions can be applied. For example, the action of "bow" can be applied to a scene in which a user wants to express a thank you, or to a scene in which a user wants to express an apology. Meanwhile, for a scene, there may be a plurality of words expressing similar meanings, for example, words such as "thank you", "thank you" and the like can all express a thank you, and words such as "dumb", "sorry", "bad meaning" and the like can all express a sorry you. When using the strict matching strategy in the conventional method, all words must be enumerated if all cases are to be covered. This approach can create a huge amount of work when enumerating words and a huge amount of computation when matching words for subsequent applications; at the same time, it is difficult to ensure that the enumerated vocabulary is complete, and it is likely that omissions will occur.
In order to solve the problems, the method adopts the idea of similarity matching to replace the direct matching of texts in the traditional method.
Before practical application, some standard features corresponding to the actions can be preset. Specifically, each standard action which can be realized by the digital person can be determined; for each standard action, determining at least one standard word corresponding to each standard action according to the applicable scene of each standard action; inputting each standard word into a pre-trained semantic recognition model to obtain standard characteristics of each standard word output by the semantic recognition model.
All standard actions that a digital person can achieve can be determined first. Wherein the standard actions are available from a library of actions for the digital person. Then, for each standard action, all standard words in each scene to which each standard action is applicable are determined. Because the scenes suitable for different standard actions and the words available in each scene are different, the number of standard words which can be determined by each standard action may be different, and the method is not particularly limited to the number of standard words.
For the determined standard words, each standard word can be input into a pre-trained semantic recognition model, and standard features of each standard word are output through the semantic recognition model. The semantic recognition model may be any natural language processing model, such as BERT, and the method is not particularly limited.
In addition, when determining the text feature of each target word in step S102, the same semantic recognition model as that used in determining the standard feature of the standard word may be used. Specifically, the target word segments can be input into a pre-trained semantic recognition model, and text features of the target word segments output by the semantic recognition model are obtained.
In this step, the similarity between the text feature of each target word and each preset standard feature can be determined. There are various ways to determine the similarity, and this specification provides an embodiment for reference. The similarity between the text features and the standard features of the target word can be determined by adopting a cosine similarity calculation mode, and the specific formula can be as follows:
s=(v t ·v m )/(‖v t ‖×‖v m ‖)
wherein s represents the similarity between the text feature of the target word and the standard feature of the standard word; v t Representing text features, v m Representing standard features.
S106: and determining target actions matched with each target word according to the similarity.
In this step, the target actions matching each target word may be determined according to the similarity between the text feature and the standard feature determined in step S108. Wherein the target action is determined from the standard actions. Specifically, when it is determined that the maximum value of the similarity between the text feature of each target word and each standard feature meets a preset condition, determining a target action matched with each target word according to each standard feature.
The maximum value of the similarity between the text feature of each target word and each standard feature may be the maximum similarity among the text feature of each target word and each standard feature for each target word; for example, assuming that there are 30 standard features, 30 similarities between the target word and the 30 standard features are determined first, and then the largest one of the 30 similarities is taken as the maximum value of the similarity between the target word and each standard feature. The preset conditions may be set according to specific requirements, which is not specifically limited in this application, and specific examples of the preset conditions are provided herein for reference.
Specifically, a standard feature with highest similarity with the text feature of each target word can be determined from the standard features and used as a candidate feature; and when the similarity between the candidate feature and the text feature of each target word is not smaller than a specified threshold, determining the standard action corresponding to the standard word to which the candidate feature belongs as the target action matched with each target word.
When determining the target action matched with one target word, determining the standard word with the highest similarity between the standard feature and the text feature of the target word, and then determining the standard action corresponding to the standard word as the target action matched with the target word. It is conceivable that since a digital person can only perform one action at a time, there is typically only one target action determined for one target word.
However, not all target words require collocation actions when spoken by a digital person. Therefore, after determining the standard feature with highest similarity with the text feature of one target word, judging whether the similarity between the target word and the standard feature is not smaller than a specified threshold, and determining the standard action corresponding to the standard word to which the standard feature belongs as the target action matched with the target word only when the similarity is not smaller than the specified threshold; when the highest similarity between the text feature of a target word and the standard features is still less than the specified threshold, then no action is matched for that target word. The specified threshold may be set according to requirements, which is not specifically limited in this application.
In the method, one standard action may correspond to a plurality of different standard words according to the scene in which it can be applied. When the similarity between the text feature of a target word and the standard feature of a standard word is high, the meaning of the target word is similar to the standard word, and the two suitable scenes are the same or similar. Thus, the standard action corresponding to the standard word may be determined as an action matching the target word.
S108: and inserting the target action for the digital person when the digital person outputs the text to be output.
In this step, the target action determined in step S106 may be inserted for the input person when the digital person outputs the text to be output. Specifically, in the process of speaking by the digital person, for each target word with a matched target action, the target action matched with the target word can be inserted while the digital person speaks the target word, so that the digital person can make the target action.
Wherein, there are some preset parameters for guiding the insertion mode of the target action when inserting the action. For example, the duration of each action when inserting may be preset, and when inserting a target action for a digital person according to the duration, the duration of each target action may be specifically determined for each target action; inserting each target action for the digital person when the digital person outputs the target word matched with each target action, and enabling the digital person to maintain each target action for the duration.
For digital people, like real people in daily communication, each action is not completed instantaneously, but needs to be continued for a certain time; also, the time that each different action needs to last may be different, which is defined herein as the duration of the action. When preset, different durations can be set for each standard action; in actual application, the digital person can make the target action with corresponding duration according to the duration of the target action.
Preferably, a priority level can be preset for each target action on the basis of the duration. In this case, the insertion of the target action for the digital person, according to duration and priority, may be specific,
inserting a target action matching a first target word for the digital person when the digital person outputs the first target word with a duration of the target action matching the first target word in response to the duration of the target action matching the first target word not being greater than an interval time between the digital person outputting the first target word and a second target word; inserting the target action matched with the first target word for the digital person according to the priority of the target action matched with the first target word when the duration of the target action matched with the first target word is longer than the interval time between the output of the first target word and the output of the second target word by the digital person; the second target word is the next target word of the first target word in the text to be output.
In the specific embodiment provided in the specification, the first target word and the second target word are two continuous target words in the text to be output of the digital person, and the first target word is before the second target word in output.
In practical applications, it must be considered that, since each target action has a certain duration, the process of speaking by the digital person is generally faster, which is likely to result in the duration of the target action corresponding to the first target word not ending, the digital person has already uttered the second target word, and the next target action is inserted. Colloquially, two consecutive target words spoken by a digital person may occur during an insertion action for the digital person, and the corresponding two insertion actions may coincide in time.
For the above problems, in order to ensure the integrity and the beauty of each target action of the insertion, it is difficult to make an adjustment for the duration itself; the mode of directly switching to another action when one target action is not completed is too hard, and the poor appearance can be brought to people. Therefore, the best solution is to set a priority for each action and insert an action with a higher priority when a conflict occurs, and discard an action with a lower priority. The priority is also preset as is the duration. In the process of speaking by a digital person, since the text to be output is preset or generated by AI, the time when each target word is spoken can be determined before the text to be output is output, that is, before the text to be output is spoken.
Based on the above-mentioned idea, when inserting actions for a digital person, for each first target word for which there is a matching target action, it can be determined whether the duration of the first target action is not longer than the interval time between outputting the first target word and the second target word. If the target action is not larger than the first target word, the insertion of the target action which indicates that the first target word is matched does not influence the insertion of the next target action, and the target action can be directly inserted for the first target word.
If the duration of the target action matching the first target word is longer than the interval time between the output of the first target word and the output of the second target word, the insertion of the target action matching the first target word affects the insertion of the next target action. At this time, it may be specific that, in response to the priority of the target action matching the first target word being greater than the priority of the target action matching the second target word, the target action matching the first target word is inserted for the digital person when the digital person outputs the first target word, and the action is not inserted when the digital person outputs the second target word, with the duration of the target action matching the first target word; and in response to the priority of the target action matching the first target word being less than the priority of the target action matching the second target word, not inserting an action when the digital person outputs the first target word.
When the time between two adjacent target actions collides, it is necessary to determine which of the two target actions has a higher priority. If the priority of the target action matched with the first target word is higher, the target action matched with the first target word can still be inserted normally, and the insertion of the target action of the second target word is cancelled; and if the priority of the target action matching the second target word is higher, the target action insertion at the time of outputting the first target word may be canceled.
It should be noted that, since each standard action is different, when the priority is set, the priority of each standard action is also different, and there are no any two standard actions with the same priority. In other words, when any two target actions collide with each other, the target actions may be selected and removed according to the priority level. If the target actions corresponding to the first target word and the second target word are the same, the two can be directly combined and inserted into the corresponding target action once.
For example, assume that before actual application, a library is constructed as shown in FIG. 2 containing standard actions and standard words, as well as duration and priority, all target actions are inserted following the data in the library. In practical application, assume that a text to be output of a digital person is' hello, i call for your mind, hope that the user will take care of more, and thank you. Then after word segmentation in the method, the text to be output is changed into "|hello|", "i|Called|Xiaoming|", "i hope|people|mostly cares for|", and|thank you| ". Wherein, a target word is arranged between two adjacent 'I'. After extracting text characteristics of each target word and carrying out similarity calculation with standard characteristics of each standard word, the target action matched with the target word 'hello' can be obtained as 'double hand waving', 'I' matched target action is 'want only to oneself', 'thank you' matched target action is 'bow'. Then, without adjustment, the logic of inserting the target action in the text to be output at this time is: "[ hands swing ] hello, [ point to oneself ] I call for little, hope that there is much more care, and [ bow ] thank you. The target actions are contained in each "[ ]", the position of the "[ ]" is the position where the target actions are inserted, and the target word after the "[ ]" is the target word matched with the target actions in the "[ ]".
However, since the duration of each target action needs to be considered, it may not be possible in practical applications to insert all the matched target actions into the text to be output. Still using the above example, in the text to be output in the above example, "hello" and "me" are respectively matched with different target actions, but in the process of speaking by a digital person, because the interval between two target words of "hello" and "me" is shorter, the target action "double-hand waving" corresponding to the previous "hello" is not done before speaking the target word "me", so that a trade-off needs to be made between the two target actions according to the priority preset by the library. It can be seen that, in the library shown in fig. 2, the priority of "double hand waving" is 2, and the priority of "point to oneself" is 1, the former priority is greater than the latter, so in practical application, the target action "double hand waving" is reserved, the target action "point to oneself" is discarded, and the target action insertion logic of the text to be output becomes: "[ double hand waving ] hello, i called Xiaoming, hope that there is much attention, and [ bowing ] thank you.
The above is one of the specific embodiments of the action insertion method provided by the present method. It is to be appreciated that, based on the present method, in practical application, there may also be a plurality of different embodiments, and the present method is not described herein in detail.
Preferably, the operation of the present action insertion is often indelibly delayed due to performance limitations of the device, network, etc. during the actual operation. Thus, early insertion actions may be considered to ensure that the language and actions of the digital person are more synchronized as the show progresses. Specifically, for each target action, the target action may be inserted for the digital person before a specified duration when the digital person outputs a target word segment that matches the target action. The specified duration can be set according to requirements, and the method is not particularly limited.
It is worth mentioning that there are different opportunities for this behavior of the insertion action. In one aspect, the insertion of actions may be performed during the digital person speaking. That is, the digital person plays the inserted action at the same time as the inserted action. On the other hand, the insertion of the motion may be completed in advance before the digital person speaks, and the pre-inserted motion may be played during the digital person speaking. In the present application, the timing of the above two different insertion actions can be achieved, which is not particularly limited in the present application.
When the action inserting method provided by the application is used for inserting actions of a digital person, the target word which accords with a speaking scene and a context can be obtained by word segmentation of the text to be output according to semantics, and the target word is matched with a proper target action according to the text characteristics of the target word, so that the action is inserted when the digital person outputs the text to be output. By adopting the method, the situation that actions inconsistent with scenes or contexts are inserted due to strict matching in the traditional method can be avoided. Meanwhile, by adopting a semantic similarity matching mode, all possible words do not need to be listed under each scene, only a small number of standard words need to be listed for matching, and the workload of constructing a standard word stock in the early stage can be greatly reduced.
The above is an action inserting method provided in the present specification, and based on the same concept, the present specification also provides a corresponding action inserting device, as shown in fig. 3.
Fig. 3 is a schematic diagram of an action insertion device provided in the present specification, including:
the acquisition module 200 is used for acquiring a text to be output of a digital person, and performing word segmentation processing on the text to be output to obtain target word segmentation;
a feature determining module 202, configured to determine text features of the target word segments;
a similarity determining module 204, configured to determine, for each target word, similarity between a text feature of each target word and preset standard features;
an action determining module 206, configured to determine, according to the similarity, a target action that matches each target word;
an inserting module 208, configured to insert the target action for the digital person when the digital person outputs the text to be output.
Optionally, the feature determining module 202 is specifically configured to input each target word into a pre-trained semantic recognition model, so as to obtain the text feature of each target word output by the semantic recognition model.
Optionally, the action determining module 206 is specifically configured to determine, according to the standard features, a target action matched with each target word when it is determined that a maximum value of the similarity between the text feature of each target word and the standard features meets a preset condition.
Optionally, the inserting module 208 is specifically configured to determine, for each target action, a duration of the each target action; inserting each target action for the digital person when the digital person outputs the target word matched with each target action, and enabling the digital person to maintain each target action for the duration.
Optionally, the inserting module 208 is specifically configured to insert, for the digital person, a target action matching the first target word when the digital person outputs the first target word with a duration of the target action matching the first target word in response to the duration of the target action matching the first target word not being greater than an interval time between the digital person outputting the first target word and the second target word; inserting the target action matched with the first target word for the digital person according to the priority of the target action matched with the first target word when the duration of the target action matched with the first target word is longer than the interval time between the output of the first target word and the output of the second target word by the digital person; the second target word is the next target word of the first target word in the text to be output.
Optionally, the inserting module 208 is specifically configured to insert, for the digital person, a target action matching the first target word when the digital person outputs the first target word, and insert no action when the digital person outputs the second target word, in response to the priority of the target action matching the first target word being greater than the priority of the target action matching the second target word, so as to be longer than the duration of the target action matching the first target word; and in response to the priority of the target action matching the first target word being less than the priority of the target action matching the second target word, not inserting an action when the digital person outputs the first target word.
Optionally, the inserting module 208 is specifically configured to insert, for each target action, the digital person before a specified duration when the digital person outputs the target word segment matched with the target action.
The present specification also provides a computer readable storage medium storing a computer program operable to perform the action insertion method provided in fig. 1 described above.
The present specification also provides a schematic structural diagram of the electronic device shown in fig. 4. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, as described in fig. 4, although other hardware required by other services may be included. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs to implement the action insertion method described above with respect to fig. 1. Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.
In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present application.

Claims (10)

1. An action insertion method, comprising:
acquiring a text to be output of a digital person, and performing word segmentation processing on the text to be output to obtain target word segmentation;
determining the text characteristics of each target word;
aiming at each target word, determining the similarity between the text characteristic of each target word and each preset standard characteristic;
determining target actions matched with each target word according to the similarity;
And inserting the target action for the digital person when the digital person outputs the text to be output.
2. The method of claim 1, wherein determining text characteristics of the target tokens comprises:
inputting each target word into a pre-trained semantic recognition model to obtain text characteristics of each target word output by the semantic recognition model.
3. The method of claim 2, wherein determining a target action matching the each target word based on the similarity comprises:
and when the maximum value of the similarity between the text feature of each target word and each standard feature is determined to meet a preset condition, determining a target action matched with each target word according to each standard feature.
4. The method of claim 1, wherein inserting the target action for the digital person while the digital person outputs the text to be output comprises:
determining, for each target action, a duration of said each target action;
inserting each target action for the digital person when the digital person outputs the target word matched with each target action, and enabling the digital person to maintain each target action for the duration.
5. The method of claim 4, wherein inserting the target action for the digital person while the digital person outputs the text to be output comprises:
inserting a target action matching a first target word for the digital person when the digital person outputs the first target word with a duration of the target action matching the first target word in response to the duration of the target action matching the first target word not being greater than an interval time between the digital person outputting the first target word and a second target word;
inserting the target action matched with the first target word for the digital person according to the priority of the target action matched with the first target word when the duration of the target action matched with the first target word is longer than the interval time between the output of the first target word and the output of the second target word by the digital person;
the second target word is the next target word of the first target word in the text to be output.
6. The method of claim 5, wherein inserting the target action for the digital person that matches the first target word segment according to the priority of the target action that matches the first target word segment comprises:
Responsive to the priority of the target action matching the first target word being greater than the priority of the target action matching the second target word, inserting the target action matching the first target word for the digital person when the digital person outputs the first target word and not inserting the action when the digital person outputs the second target word for the duration of the target action matching the first target word;
and in response to the priority of the target action matching the first target word being less than the priority of the target action matching the second target word, not inserting an action when the digital person outputs the first target word.
7. The method of claim 1, wherein inserting the target action for the digital person while the digital person outputs the text to be output comprises:
for each target action, inserting each target action for the digital person before a specified duration when the digital person outputs a target word segment matched with each target action.
8. An action insertion device, comprising:
the acquisition module is used for acquiring a text to be output of a digital person, and performing word segmentation on the text to be output to obtain target word segmentation;
The feature determining module is used for determining the text features of the target word;
the similarity determining module is used for determining the similarity between the text characteristics of each target word and preset standard characteristics aiming at each target word;
the action determining module is used for determining target actions matched with each target word according to the similarity;
and the inserting module is used for inserting the target action for the digital person when the digital person outputs the text to be output.
9. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-7 when executing the program.
CN202311056014.0A 2023-08-21 2023-08-21 Action insertion method and device, storage medium and electronic equipment Pending CN117540706A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311056014.0A CN117540706A (en) 2023-08-21 2023-08-21 Action insertion method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311056014.0A CN117540706A (en) 2023-08-21 2023-08-21 Action insertion method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN117540706A true CN117540706A (en) 2024-02-09

Family

ID=89784869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311056014.0A Pending CN117540706A (en) 2023-08-21 2023-08-21 Action insertion method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN117540706A (en)

Similar Documents

Publication Publication Date Title
CN115952272B (en) Method, device and equipment for generating dialogue information and readable storage medium
CN116188632A (en) Image generation method and device, storage medium and electronic equipment
CN109979450B (en) Information processing method and device and electronic equipment
CN116227474B (en) Method and device for generating countermeasure text, storage medium and electronic equipment
CN113221555B (en) Keyword recognition method, device and equipment based on multitasking model
WO2020114384A1 (en) Voice interaction method and device
CN111292734B (en) Voice interaction method and device
CN111161725B (en) Voice interaction method and device, computing equipment and storage medium
CN116312480A (en) Voice recognition method, device, equipment and readable storage medium
CN117392694A (en) Data processing method, device and equipment
CN116186231A (en) Method and device for generating reply text, storage medium and electronic equipment
CN117573913A (en) Prompt word expanding and writing method and device, storage medium and electronic equipment
CN115620706B (en) Model training method, device, equipment and storage medium
CN116343314A (en) Expression recognition method and device, storage medium and electronic equipment
CN116757208A (en) Data processing method, device and equipment
CN116863484A (en) Character recognition method, device, storage medium and electronic equipment
CN116127328A (en) Training method, training device, training medium and training equipment for dialogue state recognition model
CN117540706A (en) Action insertion method and device, storage medium and electronic equipment
CN112397053B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN112397073A (en) Audio data processing method and device
CN115658891B (en) Method and device for identifying intention, storage medium and electronic equipment
CN117079646B (en) Training method, device, equipment and storage medium of voice recognition model
CN111785259A (en) Information processing method and device and electronic equipment
CN116501852B (en) Controllable dialogue model training method and device, storage medium and electronic equipment
CN116451808B (en) Model training method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination