WO2021114682A1 - 会话任务生成方法、装置、计算机设备和存储介质 - Google Patents

会话任务生成方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2021114682A1
WO2021114682A1 PCT/CN2020/104671 CN2020104671W WO2021114682A1 WO 2021114682 A1 WO2021114682 A1 WO 2021114682A1 CN 2020104671 W CN2020104671 W CN 2020104671W WO 2021114682 A1 WO2021114682 A1 WO 2021114682A1
Authority
WO
WIPO (PCT)
Prior art keywords
conversation
message
session
following
picture
Prior art date
Application number
PCT/CN2020/104671
Other languages
English (en)
French (fr)
Inventor
韩铃
Original Assignee
平安国际智慧城市科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安国际智慧城市科技股份有限公司 filed Critical 平安国际智慧城市科技股份有限公司
Publication of WO2021114682A1 publication Critical patent/WO2021114682A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/02User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail using automatic reactions or user delegation, e.g. automatic replies or chatbot-generated messages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/0486Drag-and-drop
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/52User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail for supporting social networking services

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method, device, computer equipment, and storage medium for generating a conversation task.
  • the virtual user object is a virtual user object that can respond to user demands and communicate with the user, which is implemented through software. Based on skills training and other requirements, it is sometimes necessary to configure session tasks. By completing conversation tasks, real users can communicate with virtual user objects who act as users in a certain role to practice and improve conversation skills.
  • the inventor realizes that most of the conversation tasks configured in the traditional way are conversations between users and fixed virtual user objects, and the form of conversation tasks is single and not flexible enough.
  • a method for generating a conversation task comprising: determining a conversation background of a conversation task to be generated; the conversation task to be generated includes a plurality of conversation pairs; The above conversation message and the following conversation text of the corresponding conversation pair are added; the following conversation text is converted into the following conversation voice matching the conversation background; for the text that contains the above conversation message and the following conversation text A plurality of conversation components are spliced to obtain the conversation task.
  • a conversation task generating device comprising: a background determining module, used to determine the conversation background of a conversation task to be generated; the conversation task to be generated includes a plurality of conversation pairs; a component building module, used for when a conversation component is dragged During the drag operation, the above conversation message and the following conversation text based on the corresponding conversation pair added by the dragged conversation component are obtained; the component conversion module is used to convert the following conversation text into the following text matching the conversation background Conversational voice; component splicing module for splicing multiple conversational components including the above conversation message and the following conversation text to obtain the conversation task.
  • a computer device includes a memory and a processor.
  • the memory stores a computer program.
  • the processor executes the computer program, the processor is used to implement the following steps.
  • the conversation task to be generated includes multiple conversation pairs.
  • the following conversation text is converted into the following conversation voice matching the conversation background.
  • the multiple conversation components including the above conversation message and the below conversation text are spliced to obtain the conversation task.
  • a computer-readable storage medium having a computer program stored thereon, wherein the computer program includes program instructions, and when the program instructions are executed by a processor, they are used to implement the following steps.
  • the conversation task to be generated includes multiple conversation pairs.
  • the following conversation text is converted into the following conversation voice matching the conversation background.
  • the multiple conversation components including the above conversation message and the below conversation text are spliced to obtain the conversation task.
  • the above-mentioned conversation task generation method, device, computer equipment and storage medium when the conversation component drag operation is triggered, based on the dragged conversation component, the above conversation message and the following conversation text of each conversation pair can be configured to be added; according to the conversation to be generated
  • the conversation background of the task the following conversation text can be converted into the following conversation voice; the conversation task can be obtained by splicing a plurality of conversation components including the above conversation message and the following conversation text. Since the following conversation text needs to be converted into the following conversation message output by a virtual user object acting as a certain user role due to the context of the conversation, it not only can distinguish different roles, but also greatly expands the virtual user object’s expression of the text content and improves The effect of session task execution.
  • the construction of conversation tasks can be completed by simply dragging and dropping conversation components, which greatly improves the efficiency of conversation task generation.
  • Fig. 1 is an application scenario diagram of a method for generating a conversation task in an embodiment.
  • Fig. 2 is a schematic flowchart of a method for generating a conversation task in an embodiment.
  • Fig. 3 is a schematic diagram of an interface of a task configuration page that supports construction of a conversation task by dragging and dropping conversation components in an embodiment.
  • Fig. 4 is a structural block diagram of an apparatus for generating a conversation task in an embodiment.
  • Fig. 5 is an internal structure diagram of a computer device in an embodiment.
  • the session task generation method provided in this application can be applied to the application environment as shown in FIG. 1.
  • the terminal 102 and the server 104 communicate through the network.
  • the terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server 104 may be implemented by an independent server or a server cluster composed of multiple servers.
  • a session application is running on the terminal 102. Based on the conversation application, the user can configure the conversation task, and publish and publish the configured conversation task to the server 104.
  • the server 104 pushes the session task to other users. Other users can perform conversation tasks based on conversational applications and have conversations with virtual user objects.
  • a method for generating a session task is provided. Taking the method applied to the terminal 102 in FIG. 1 as an example for description, the method includes the following steps.
  • Step 202 Determine the conversation background of the conversation task to be generated; the conversation task to be generated includes multiple conversation pairs.
  • the session application is running on the terminal.
  • Conversational applications refer to applications in which users can send conversational messages with other users or virtual user objects to achieve different social purposes.
  • the conversation application may specifically be an instant messaging application, an intelligent customer service application, a skill sparring application, etc.
  • the skill sparring application is an application in which a user in a certain role by a virtual user object conducts a simulated conversation with a user in another role to be trained, so as to improve the skills of the user to be trained.
  • the virtual user object acts as a customer to conduct a conversation with a salesperson to improve the service ability of the salesperson; or, a virtual user object acts as a student or a parent to conduct a conversation with a teacher to improve the teacher's teaching level.
  • the task publisher can configure the conversational task. Specifically, when receiving a session task configuration instruction triggered based on a session application, the terminal displays a task configuration page.
  • the task configuration page includes the session background description area.
  • Conversational background refers to the background information that the task performer needs to know when performing the conversational task, such as the role played by the virtual user object with which it communicates and its user needs.
  • the corresponding conversation background includes identity information such as the gender, age, and gender of the customer that the virtual user object serves as the customer, as well as the business direction that the virtual user object needs to consult.
  • the terminal obtains the conversation background of the conversation task to be generated that the user inputs in the conversation background description area.
  • Step 204 When a drag operation of the conversation component occurs, obtain the previous conversation message and the following conversation text based on the corresponding conversation pair added by the dragged conversation component.
  • the task configuration page also provides multiple session components such as narration session, fixed session, fixed question and answer, intent session, and scoring session. Users can quickly create session tasks by dragging and dropping session components freely, and release pre-configured session tasks to users to be trained for practice. Specifically, one session task includes multiple session pairs. By dragging and dropping session components of different component types, session pairs of different session modes can be obtained. For example, based on the conversational component "intent conversation”, the conversation mode can be realized as “intent recognition”; based on the conversation component “scoring conversation” can realize the conversation mode as "professional scoring” and so on.
  • each conversation pair includes the following conversation message and the above conversation message.
  • the user After dragging the conversation component, the user configures the following conversation message and the above conversation message corresponding to the conversation component.
  • the conversation task is executed, the following conversation message is output through the virtual user object, and the task performer inputs the conversation message as a reply after obtaining the following conversation message.
  • the above conversation message is the reference information used to judge the professionalism of the reply and the expressed intention of the input conversation message.
  • the following conversation message and the above conversation message preconfigured in this embodiment may be text, voice, etc., respectively.
  • Step 206 Convert the text of the following conversation into a voice of the following conversation that matches the context of the conversation.
  • the user can configure the user image, expression, timbre, etc. of the virtual user object that tells each following conversation message.
  • the conversation application pre-stores various virtual user role information on the server, and different virtual user roles have different timbre characteristics.
  • the virtual user role information includes role identification and its timbre characteristics, facial images or videos in different expression states, and so on.
  • the terminal can read pre-stored virtual user role information from the server and display it on the task configuration page.
  • the user can select a suitable face image or call video of the virtual user object for outputting the following conversation message.
  • the terminal converts the following conversation text into the following conversation voice according to the timbre characteristics of the virtual user role selected by the user.
  • the user can also configure the input mode of the above conversation message in each group of conversation pairs, such as oral explanation, graphic explanation, and so on.
  • the user needs to configure the corresponding reference explanation diagram in advance.
  • the reference explanation diagram includes a step-by-step explanation diagram with multiple explanation steps.
  • the text-to-speech technology is mainly to convert the text in the computer into continuous natural speech.
  • the traditional way of converting text into speech usually uses TTS (Text To Speech) technology to synthesize corresponding speech from text.
  • TTS Text To Speech
  • the entire task usually has only one voice, and most of them are women.
  • the use of a single voice will limit the expression of text content.
  • the timbre category of the required user role is determined in combination with the conversation scene, and the conversational text is converted into conversational voice according to the timbre category. Different roles can be output in a timbre category close to the role, which can distinguish different roles and greatly expand
  • the virtual user object’s expression of text content improves the performance of conversational tasks.
  • step 208 a plurality of conversation components including the above conversation message and the following conversation text are spliced to obtain a conversation task.
  • splicing multiple conversation components including the above conversation message and the following conversation text includes: determining the execution sequence between the corresponding multiple conversation pairs according to the relative position relationship of the multiple conversation components; according to the execution order Concatenate multiple conversation components including the above conversation message and the following conversation text.
  • the execution sequence of the multiple conversation pairs may be determined according to the display position of the conversation component on the task configuration page.
  • the terminal can scan one or more conversational components displayed on the task configuration page in a "Z"-shaped scanning manner to determine the relative positional relationship of the multiple conversational components.
  • the terminal may divide the task configuration page into multiple configuration sub-areas based on a two-dimensional matrix table with multiple rows and multiple columns. The terminal determines the conversation pair in the same row as the conversation pair in the same execution order, and determines the conversation pair in the previous row as the conversation pair in the previous order of the current row conversation pair.
  • FIG. 3 is a schematic diagram of an interface of a task configuration page that supports the construction of a session task by dragging and dropping a session component in an embodiment.
  • users can use directed edges to connect conversational components in adjacent execution sequences. The directed edge points from the previous conversational component to the subsequent conversational component.
  • the session pair corresponding to the session component whose component type is "fixed session” is "Hello, Xiaoli, I am Xiaoming, is it convenient to talk now?", the following session message As "Oh, convenient”.
  • the directed edge points from the conversation message above to the conversation message below.
  • the following session message of the "fixed session” session pair has a session pair pointing to the "intent session”, then the next sequential session pair of the "fixed session” session pair is the "intent session” session pair.
  • the conversation component drag operation when the conversation component drag operation is triggered, based on the dragged conversation component, the above conversation message and the following conversation text of each conversation pair can be configured; according to the conversation background of the conversation task to be generated, you can add The following conversational text is transformed into the following conversational voice; by splicing multiple conversational components including the above conversational message and the following conversational text, the conversation task can be obtained. Since the following conversation text needs to be converted into the following conversation message output by a virtual user object acting as a certain user role due to the context of the conversation, it not only can distinguish different roles, but also greatly expands the virtual user object’s expression of the text content and improves The effect of session task execution. In addition, the construction of conversation tasks can be completed by simply dragging and dropping conversation components, which greatly improves the efficiency of conversation task generation.
  • converting the following conversation text into the following conversation voice that matches the conversation context includes: determining the number of virtual user objects required for the conversation task to be generated and the role type of each virtual user object according to the conversation context; The target virtual user object specified by the current conversation; according to the timbre category matched by the role type of the target virtual user object, the following conversation text is converted into the following conversation voice.
  • the same session task may require multiple virtual user objects to act as different user roles.
  • the terminal determines the number of virtual user objects required in the conversation task and the role type of each virtual user object according to the conversation background. Depending on the age and gender of the user, the role types can include boys, girls, boys, girls, boys, girls, and so on.
  • the manner selected by the user can be followed.
  • the user does not need to specify a specific virtual user object for each subsequent conversation message, only the role type of the required virtual user object is specified, and the terminal automatically assigns the corresponding target virtual user object to the corresponding subsequent conversation message according to the role type.
  • the target virtual user object refers to a virtual user object that matches the gender, age, etc. of the user required to output the corresponding following conversation message among the plurality of pre-stored virtual user objects.
  • the same conversation task supports the user to communicate with multiple virtual user objects, which further improves the flexibility of the conversation task.
  • the above-mentioned conversation task generation method further includes: when the conversation task processing instruction is obtained, displaying the conversation page corresponding to the conversation task; obtaining the input conversation message generated based on the conversation page; and determining the specific session pair to which the input conversation message belongs The target virtual user object is triggered; the target virtual user object is triggered to perform business processing based on the input conversation message, and the response conversation voice is obtained; the response conversation voice is replies through the target virtual user object.
  • the terminal displays the session page and displays the session background on the session page.
  • the user can enter the conversation message (recorded as the input conversation message) required to execute the current sequence of the conversation pair steps on the conversation page.
  • the terminal determines the target virtual user object specified for the current sequential conversation pair, triggers the target virtual user object to determine the corresponding response conversation text according to the input conversation message, and converts the response conversation text into a response according to the timbre category corresponding to the target virtual user object Conversational voice, through the target virtual user object to output a response conversational voice to reply to the input conversational message.
  • the same conversation task supports the user to communicate with multiple virtual user objects to achieve the effect of a multi-person conversation; in a multi-person conversation, different virtual user objects can reply to the received response conversation message, corresponding to the simulation Conversation appeals realize flexible and intelligent interactions between users and virtual user objects, thereby enabling virtual user objects to respond to users in a targeted manner in a multi-person conversation, making the interaction method more flexible and convenient.
  • triggering the target virtual user object to perform service processing based on the input conversation message, and obtaining the response conversation voice includes: determining the component type of the conversation component; when the component type is the first type and the input conversation message includes the conversation picture, triggering The target virtual user object extracts the graphic features of the conversation picture; determines the category label text corresponding to the conversation picture according to the graphic feature; merges the graphic feature and the corresponding category label text to obtain a comprehensive feature; determines the conversation inputting the conversation message based on the comprehensive feature Intent; Get the intent tag corresponding to each above session message in the conversation pair; use the below session voice corresponding to the above session message that matches the intent tag with the session intent as the response session message.
  • the target type includes the first type and the second type.
  • the first type refers to the type of intent session component.
  • the second type refers to the scoring session component type.
  • For a target type of session pair it includes multiple above session messages and a below session message corresponding to each above session message. Each above conversation message and the corresponding below conversation message form a branch conversation pair. Therefore, the conversation pair of the target type includes a plurality of branch conversation pairs. Subsequently, according to which of the previous conversation messages the user inputs the conversation message to be similar to, jump to the corresponding branch conversation pair.
  • the terminal recognizes the conversation intention of the input conversation message according to a preset intention recognition strategy. Or, the terminal sends the input conversation message to the server, and the server recognizes the conversation intention of the input conversation message according to a preset intention recognition strategy.
  • intent recognition strategies such as rule matching and model recognition are preset, and different intent recognition strategies can be used in different situations to recognize the intent of the input session message according to requirements.
  • Rule matching can be a way of intent identification by identifying whether there are preset keywords that can represent a certain session intent in the input session message.
  • Model recognition can be a way of intent recognition by a pre-trained machine learning model.
  • Each intent recognition strategy has corresponding usage conditions.
  • the use condition may be that one or more indicators of the input conversation message reach the threshold respectively.
  • the indicators specifically include the amount of message data, the intent level of the current conversation pair, and the business scenario to which it belongs.
  • the amount of message data can be determined according to the length of the included text or the size of the picture involved. For example, when the amount of message data of the input conversation message is large, or the intention level is relatively low, rule matching can be used first.
  • the conversation picture feature may be extracted based on the pre-trained first model.
  • the first model may specifically be a convolutional neural network model, such as ResNet-80. Convolution process the conversation picture through the convolutional layer of the convolutional neural network to extract the feature of the conversation picture map (feature map), that is, the picture feature in this embodiment.
  • the computer device determines the category label text corresponding to the conversation picture according to the graphic characteristics.
  • the category label text is the label text corresponding to the category to which the conversation picture belongs.
  • the computer device may extract graphic features through the first model, and then classify the extracted graphic features to obtain the category of the conversation picture, and then determine the category label text corresponding to the conversation picture.
  • the first model may specifically be a convolutional neural network model.
  • the computer device can input the conversation picture into the convolutional neural network model to extract the graphic features of the conversation picture. Then the graph features are processed through the pooling layer and the fully connected layer to obtain the probability value of the category of the conversation picture. Use the category label corresponding to the maximum probability value as the category label corresponding to the conversation picture below.
  • the computer equipment merges the graphic features and the corresponding category label text to obtain comprehensive features.
  • the terminal extracts the text features of the category label text based on the pre-trained natural language model, and performs cross-modal fusion of the graphic features and the text features.
  • cross-modal fusion is the fusion of data with different modalities.
  • the data of different modalities specifically refers to the graphic features corresponding to the conversation pictures and the text data corresponding to the category label text.
  • the computer device can map the extracted graphic features and the corresponding category label text to data in the same space, and then perform fusion processing on the mapped data to obtain comprehensive features.
  • the graphic feature of the conversation picture is extracted through the first model.
  • the computer equipment can extract the text features of the category label text through the cyclic neural network.
  • the form of expression of both graphic features and text features can be in vector form.
  • the computer equipment can convert the graphic feature and the text feature into a standard form respectively, so that the feature vectors of both are in the same range.
  • the graphic feature and text feature can be normalized separately. Commonly used normalization algorithms include function method and probability density method.
  • the function method such as the maximum-minimum function, the mean-variance function (normalizing the features to a consistent interval, such as the interval with a mean of 0 and a variance of 1) or hyperbolic sigmoid (S-shaped growth curve) function Wait.
  • the computer device can perform a fusion operation on the normalized graphic feature and the corresponding text feature of the corresponding category label text to obtain a comprehensive feature.
  • the algorithm for fusing graphic features and text features can specifically adopt algorithms based on Bayesian decision theory, algorithms based on sparse representation theory, or algorithms based on deep learning theory.
  • the computer device can perform a weighted summation on the two vectors after the normalization process, and the graphic feature and the text feature have been merged to obtain a comprehensive feature.
  • the computer equipment recognizes the intent of the input conversation message based on the comprehensive characteristics. Specifically, the computer device processes the integrated features through the second model, and outputs the conversational intention of the conversation picture, such as recognizing objects in the conversation picture, understanding the relationship between the objects, and so on.
  • the conversational intention can be represented in the form of a word, a whole sentence, or a paragraph text.
  • the second model may specifically be a recurrent neural network model, such as an LSTM model.
  • the intent recognition of the conversation message based on the comprehensive feature includes: obtaining the intent pre-description text corresponding to the conversation picture; generating the predicted feature of the conversation picture based on each word vector of the intent pre-description text; combining the comprehensive feature and the predicted feature Input the pre-training model, and output the conversational intention of the drawing picture.
  • the intention pre-description text is the text that describes the conversation picture in advance.
  • the intention pre-description text may specifically be considered to be the initial rougher description text obtained after understanding the conversation picture.
  • the computer device may obtain the intent pre-description text corresponding to the conversation picture, and obtain each word vector of the intent pre-description text.
  • the computer equipment can use the encoding-decoding method, input the comprehensive feature as the first moment, and use each word vector as the input at the subsequent moments, and process the sequentially input comprehensive features and word vectors through the second model to output the conversational intention of the conversation message .
  • the second model can combine the comprehensive features and the intention pre-description text, so that the output conversation intention is more in line with the real intention expressed in the conversation picture, and the accuracy of the graphic understanding information is greatly improved.
  • the above-mentioned conversation intention recognition method can quickly and accurately obtain the corresponding category label text of the conversation picture according to the graphic characteristics of the extracted conversation picture.
  • the graphic feature and the corresponding category label text are cross-modally fused to obtain a comprehensive feature, and then based on the comprehensive feature, the conversational intention of the conversation message is identified.
  • the features of the conversation pictures are used in detail and fully.
  • the dual guidance of the graphic features and the category label text is obtained, which greatly improves the accuracy of the conversation picture comprehension information.
  • each of the above conversation messages pre-configured in the conversation task has a corresponding intent tag.
  • the intention tag By comparing the intention tag with the conversation intention of the input conversation message, the above conversation message that matches the input conversation message can be determined.
  • the terminal obtains the following conversation voice corresponding to the above conversation message that it wants to match as a response conversation message.
  • the "intent conversation" conversation pair includes two conversation branches.
  • intent recognition can be performed, which can make full use of the graphic features of the conversational picture itself and also combine the category information to which the conversational picture belongs in the intent recognition process. , Get the dual guidance of graphic features and category label text, which greatly improves the accuracy of the intent recognition of the input conversation message.
  • the above-mentioned conversation task generation method further includes: when the component type is the second type and the input conversation message includes the conversation picture, triggering the target virtual user object to recognize the drawing trajectory of the conversation picture; and passing the drawing trajectory in the conversation picture
  • the pixel value of the pixel is determined as the first pixel value, and the pixel value of the pixel that the drawing trajectory does not pass is determined as the second pixel value; extract the graphic characteristics of each drawing stroke in the conversation picture where the pixel value has been updated; for multiple drawings
  • the graphic features of the strokes are fused to obtain the sequence feature of the conversation picture; the similarity between the sequence feature of the conversation picture and the sequence feature of the reference explanation graph corresponding to the above conversation message in the current conversation branch is calculated; the above conversation message with the highest similarity is calculated
  • the corresponding conversation voice below is used as the response conversation message.
  • the pre-trained natural language processing model is used to extract the semantic feature of the input conversation message, and the semantic feature is compared with the semantic feature of each of the above conversation messages preset in the current sequential conversation pair to obtain Semantic similarity.
  • the following conversation voice corresponding to the above conversation message with the highest semantic similarity is selected as the response conversation message.
  • the terminal displays the drawing explanation prompt on the conversation page and displays the drawing page. Users can draw conversation pictures on the drawing page.
  • the drawing page can be the conversation message entry area in the conversation window, or it can be another page different from the conversation window.
  • the terminal tracks the drawing process of the conversation picture.
  • the conversation application prompts the explanation step, that is, prompts the user to draw a partial step diagram corresponding to which explanation step is currently drawn.
  • Each step diagram may correspond to multiple drawing strokes and annotated text. Drawing strokes can be judged by pause time and whether to leave the screen.
  • Drawing strokes can be judged by pause time and whether to leave the screen.
  • the step diagram of the current sequence according to the trigger operation of the "Next" button, proceed to the next sequence to explain the steps.
  • the types of conversation pictures that need to be drawn can be different.
  • the types of conversation pictures can be straw hat diagrams, climbing diagrams, wire diagrams, and so on.
  • the terminal determines the pixel value of the pixel through which the drawing trajectory passes in the conversation picture as the first pixel value, and the pixel value of the pixel through which the drawing trajectory has not passed as the second pixel value.
  • the terminal extracts the graphic feature of each drawing stroke in the conversation picture after the pixel value update is completed.
  • determining the graphic characteristics of each drawing stroke in the conversation picture according to the drawing trajectory includes: scaling the conversation picture to a standard size; according to the drawing trajectory, updating the pixel value of each pixel in the conversation picture of the standard size; Extract the graphic features of each drawing stroke in the following conversational picture after the pixel value update has been completed.
  • the terminal Whenever listening to the drawing of a step diagram explaining a step, the terminal extracts the graphic features of each step diagram, and scores the conversation pictures according to the extracted graphic features; or sends each step diagram to the server, and the server performs the graphic feature extraction And according to the extracted graphic features, the conversation pictures are scored. Or, when the entire conversation picture is drawn, the terminal or the server extracts the graphic features of each step diagram in the above-mentioned manner, and scores the conversation pictures according to the extracted graphic features.
  • the size of the first terminal used by different users may be different, so that the canvas size of the drawn conversation picture is different.
  • the computer equipment scales the current step diagram to a standard size, so that each compressed step diagram has the same number of pixels.
  • the standard size refers to the specified image size.
  • the computer equipment scales each step graph obtained to the standard size, according to the drawing trajectory, it updates the pixel value of each pixel in the step diagram of the standard size, and for the pixels that the drawing trajectory does not pass in the step diagram (excess points) Filtering, zooming and pixel value updating can realize coordinate normalization and redraw the step graph.
  • updating the pixel value of each pixel in the following conversation picture of standard size according to the drawing trajectory includes: updating the pixel value of the pixel through which the drawing trajectory passes in the conversation picture of the standard size to the first pixel value; Update the pixel value of the pixel that has not passed the drawing track in the standard-size conversation picture to the second pixel value.
  • the computer device updates the pixel values of the pixels that have not passed through the drawing trajectory in the current step diagram updated to the standard size to the first pixel value, and updates the pixel values of the pixels that the drawing trajectory passes through to the second pixel value.
  • the first pixel value and the second pixel value are different pixel values, and the pixels that the drawing trajectory has passed and those that have not passed by are distinguished by the different pixel values.
  • the computer extracts the graphic vector information of the step diagram that has been scaled to a standard size and updated with pixel values.
  • the graphic vector information can be a piece of JSON (JavaScript Object Notation, JS object) data.
  • JSON data includes text field testing and drawing field drawing.
  • the step graph is composed of one or more drawing strokes, and each drawing stroke is composed of multiple pixels with continuous coordinates. Therefore, the drawing field includes the abscissa x and the ordinate y of each pixel point corresponding to each drawing stroke in the corresponding step diagram. For example, in the above example (x1, y1) are the coordinates of each pixel in one drawing stroke, and x2, y2) are the coordinates of each pixel in another drawing stroke.
  • the computer equipment inputs the graphic vector information into the graphic feature extraction model to obtain the sequence feature corresponding to the corresponding step diagram.
  • graphic feature extraction models include lenet model (convolutional neural network model) and sequence model.
  • the lenet model includes a convolutional layer, a pooling layer and a fully connected layer.
  • the computer device inputs the graphics vector information into the convolutional layer for convolution operation, and inputs the first feature matrix output by the convolutional layer into the pooling layer for normalization operation, and obtains the projection of the maximum weight from each feature vector in the first feature matrix
  • the second feature matrix obtained.
  • the computer device inputs the second feature matrix into the fully connected layer to perform classification operations to obtain graphic features corresponding to each classification.
  • the graphic feature can specifically be the data extracted from the conversation picture by the computer device that can represent the shape or spatial relationship of the picture, and obtain the representation or description of the "non-picture" of the picture, such as a value, a vector, or a symbol.
  • the computer equipment integrates the graphic features of multiple drawing strokes to obtain the sequence feature of the conversational picture.
  • the computer device calls the sequence model to encode the graphic features, and obtains the sequence features of the corresponding step diagram.
  • the sequence model can be a recurrent neural network model, including a 3-layer convolutional layer, a 2-layer LSTM layer, and a Softmax classification layer. It is easy to understand that the number of convolutional layers and LSTM layers can be dynamically determined according to requirements.
  • the convolutional layer is used to reduce the amount of graphic feature data while ensuring the integrity of the graphic feature information.
  • the LSTM layer is used to calculate the sequence features of the current stroke by combining the graphical features of the previous stroke and the graphical features of the current stroke.
  • the LSTM layer includes forget gates, input gates and output gates.
  • the forgetting gate Through the forgetting gate, the graphic features of the previous sequence of drawing strokes are forgotten, the input gate is used to update the corresponding graphic features of the current sequence of drawing strokes, and the output gate is used to perform operations on the graphic features obtained after the forgetting process and the updated graphic features , Get the sequence feature corresponding to the current sequence of drawing strokes.
  • the Softmax classification layer is used to perform feature fusion of the sequence features of multiple drawing strokes to obtain the sequence features of the corresponding step diagram.
  • the computer device may map the sequence features of multiple drawing strokes with the same dimension to data in the same space, and then perform fusion processing on the mapped data to obtain a comprehensive feature.
  • the feature fusion algorithm can specifically adopt the method of vector splicing. It is easy to understand that computer equipment can also fuse multiple sequence features based on Bayesian decision theory algorithms, sparse representation theory based algorithms, or deep learning theory algorithms to obtain the sequence features of the entire conversation picture.
  • each conversation pair in the conversation task is preset with a variety of above conversation messages as a reference.
  • the input mode of a certain group of conversation pairs is "graphic explanation”
  • the corresponding conversation message above is the reference explanation picture.
  • the sequence feature of the reference chart can be dynamically calculated every time it is needed, reducing the occupation of computer equipment storage resources.
  • the sequence features of the reference explanation graph can also be pre-calculated and stored in the computer device, which improves the efficiency of sequence feature acquisition, and further improves the efficiency of scoring the corresponding explanation graph.
  • the computer device calculates the similarity between the sequence feature of the conversation picture and the sequence feature of the corresponding reference explanation picture based on the similarity calculation model.
  • the similarity calculation model may be a siamese network model. It is easy to understand that the computer device can also use other methods to calculate the similarity between the sequence feature of the following conversation picture and the sequence feature of the corresponding reference explanation picture, which is not limited.
  • the computer device uses the similarity as the score of the conversation picture, or performs numerical conversion on the similarity according to a preset logic to obtain the score of the conversation picture.
  • the conversation picture is obtained by splicing multiple step diagrams in the sequence of drawing time; fusing the graphic features of the multiple drawing strokes to obtain the sequence characteristics of the conversation picture includes: Graphic features are merged to obtain the sequence feature of the current sequence step diagram; when the next sequence step diagram is monitored, the next sequence step diagram is used as the current sequence step diagram to iterate until the last sequence step diagram; for multiple step diagrams The sequence features are fused to obtain the sequence features of the conversation picture.
  • the computer device extracts and obtains the sequence features of each step diagram in the conversation picture in the above manner, it merges the sequence features of the multiple step diagrams to obtain the sequence feature of the conversation picture. According to the sequence characteristics of the conversation picture and the reference explanation picture The similarity of the sequence features is used to score the conversation pictures.
  • the computer device may also, after extracting the sequence features of each step diagram, compare the sequence features of the step diagram with the sequence features of the corresponding steps in the reference diagram in time, according to the degree of similarity between the sequence features of the step diagram and the partial diagram of the corresponding step in the reference diagram.
  • the current step diagram is scored, and finally the score of the entire conversation picture is calculated based on the scores of all the step diagrams.
  • steps in the flowchart of FIG. 2 are displayed in sequence as indicated by the arrows, these steps are not necessarily performed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least part of the steps in FIG. 2 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • a conversation task generation device which includes: a background determination module 402, a component construction module 404, a component conversion module 406, and a component splicing module 408.
  • the background determining module 402 is used to determine the conversation background of the conversation task to be generated; the conversation task to be generated includes multiple conversation pairs.
  • the component construction module 404 is configured to obtain the above conversation message and the following conversation text based on the corresponding conversation pair added by the dragged conversation component when the conversation component drag operation occurs.
  • the component conversion module 406 is used to convert the following conversation text into the following conversation voice that matches the context of the conversation.
  • the component splicing module 408 is used for splicing a plurality of conversation components including the above conversation message and the following conversation text to obtain a conversation task.
  • the component conversion module 406 is further configured to determine the number of virtual user objects required for the session task to be generated and the role type of each virtual user object according to the session context; determine the target virtual user object specified for the current session; According to the timbre category matched by the role type of the target virtual user object, the following conversational text is converted into the following conversational voice.
  • the component splicing module 408 is used to determine the execution order between the corresponding multiple conversation pairs according to the relative positional relationship of the multiple conversation components; according to the execution order, the multiple conversation messages including the above conversation message and the following conversation text The session components are spliced.
  • the above-mentioned apparatus further includes a task execution module 410, which is used to display the session page corresponding to the session task when the session task processing instruction is obtained; obtain the input session message generated based on the session page; and determine whether the input session message belongs to The conversation targets the designated target virtual user object; triggers the target virtual user object to perform business processing based on the input conversation message to obtain the response conversation voice; and responds to the response conversation voice through the target virtual user object.
  • a task execution module 410 which is used to display the session page corresponding to the session task when the session task processing instruction is obtained; obtain the input session message generated based on the session page; and determine whether the input session message belongs to The conversation targets the designated target virtual user object; triggers the target virtual user object to perform business processing based on the input conversation message to obtain the response conversation voice; and responds to the response conversation voice through the target virtual user object.
  • the task execution module 410 is also used to determine the component type of the conversation component; the conversation pair whose component type is the target type includes multiple above conversation messages and the following conversation voice corresponding to each above conversation message;
  • the type is the first type.
  • the target virtual user object is triggered to extract the graphic feature of the conversation picture; according to the graphic feature, the category label text corresponding to the conversation picture is determined; the graphic feature and the corresponding category label text are processed Fusion to obtain comprehensive features; determine the session intent of the input session message based on the comprehensive features; obtain the intent label corresponding to each above session message in the session pair; match the intent label with the session intent corresponding to the above session message Conversation voice, as a response conversation message.
  • the task execution module 410 is further configured to trigger the target virtual user object to recognize the drawing trace of the conversation picture when the component type is the second type and the input conversation message includes the conversation picture;
  • the pixel value of the point is determined as the first pixel value, and the pixel value of the pixel point that the drawing trajectory does not pass is determined as the second pixel value; extract the graphic characteristics of each drawing stroke in the conversation picture that has completed the pixel value update; for multiple drawing strokes Fusion of the graphic features to obtain the sequence feature of the conversation picture; calculate the similarity between the sequence feature of the conversation picture and the sequence feature of the reference explanation graph corresponding to the above conversation message in the current conversation branch; compare the above conversation message with the highest similarity The corresponding conversation voice below is used as the response conversation message.
  • Each module in the above-mentioned conversation task generating device may be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 5.
  • the computer equipment includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, a trackball or a touchpad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.
  • FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, the steps of the method for generating a session task in any one of the embodiments of the present application are implemented.
  • the computer-readable storage medium may be non-volatile or volatile.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本申请涉及一种会话任务生成方法、装置、计算机设备和存储介质,可在人工智能中实现。所述方法包括:确定待生成会话任务的会话背景;所述待生成会话任务包括多个会话对;当发生会话组件拖拽操作时,获取基于被拖拽会话组件所添加的相应会话对的上文会话消息和下文会话文本;将所述下文会话文本转化为与所述会话背景相匹配的下文会话语音;对包含所述上文会话消息及所述下文会话文本的多个会话组件进行拼接,得到所述会话任务。采用本方法能够提高与虚拟用户对象交互灵活性。

Description

会话任务生成方法、装置、计算机设备和存储介质
本申请要求于2019年12月10日提交中国专利局、申请号为201911257891.8,发明名称为“会话任务生成方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,特别是涉及一种会话任务生成方法、装置、计算机设备和存储介质。
背景技术
随着通信技术的发展,出现了很多可以发起会话的应用,用户可通过这些应用实现与真实的用户或虚拟用户对象之间的通信交流。其中,虚拟用户对象是通过软件实现的可以响应用户诉求的、且与用户进行交流的虚拟的用户对象。基于技能培训等需求,有时需要配置会话任务。通过完成会话任务,真实用户可以与充当某种角色用户的虚拟用户对象进行通信交流,以练习提高会话技能。发明人意识到,传统方式配置的会话任务,大多是用户与固定的虚拟用户对象之间的会话,会话任务形式单一,不够灵活。
技术问题
基于此,有必要针对上述技术问题,提供一种能够提高与虚拟用户对象交互灵活性的会话任务生成方法、装置、计算机设备和存储介质。
技术解决方案
一种会话任务生成方法,所述方法包括:确定待生成会话任务的会话背景;所述待生成会话任务包括多个会话对;当发生会话组件拖拽操作时,获取基于被拖拽会话组件所添加的相应会话对的上文会话消息和下文会话文本;将所述下文会话文本转化为与所述会话背景相匹配的下文会话语音;对包含所述上文会话消息及所述下文会话文本的多个会话组件进行拼接,得到所述会话任务。
一种会话任务生成装置,所述装置包括:背景确定模块,用于确定待生成会话任务的会话背景;所述待生成会话任务包括多个会话对;组件构建模块,用于当发生会话组件拖拽操作时,获取基于被拖拽会话组件所添加的相应会话对的上文会话消息和下文会话文本;组件转化模块,用于将所述下文会话文本转化为与所述会话背景相匹配的下文会话语音;组件拼接模块,用于对包含所述上文会话消息及所述下文会话文本的多个会话组件进行拼接,得到所述会话任务。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时,用于实现以下步骤。
确定待生成会话任务的会话背景。所述待生成会话任务包括多个会话对。
当发生会话组件拖拽操作时,获取基于被拖拽会话组件所添加的相应会话对的上文会话消息和下文会话文本。
将所述下文会话文本转化为与所述会话背景相匹配的下文会话语音。
对包含所述上文会话消息及所述下文会话文本的多个会话组件进行拼接,得到所述会话任务。
一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序包括程序指令,所述程序指令被处理器执行时,用于实现以下步骤。
确定待生成会话任务的会话背景。所述待生成会话任务包括多个会话对。
当发生会话组件拖拽操作时,获取基于被拖拽会话组件所添加的相应会话对的上文会话消息和下文会话文本。
将所述下文会话文本转化为与所述会话背景相匹配的下文会话语音。
对包含所述上文会话消息及所述下文会话文本的多个会话组件进行拼接,得到所述会话任务。
有益效果
上述会话任务生成方法、装置、计算机设备和存储介质,当触发会话组件拖拽操作时,基于被拖拽会话组件可以配置添加每个会话对的上文会话消息和下文会话文本;根据待生成会话任务的会话背景,可以将所述下文会话文本转化为下文会话语音;通过对包含所述上文会话消息及所述下文会话文本的多个会话组件进行拼接,可以得到所述会话任务。由于贴合会话背景所需将下文会话文本转换为由充当某种用户角色的虚拟用户对象输出的下文会话消息,不仅能够区分不同的角色,且大大扩展了虚拟用户对象对文本内容的表达,改善会话任务执行效果。此外,通过简单拖拽会话组件的方式即可完成会话任务构建,大大提高会话任务生成效率。
附图说明
图1为一个实施例中会话任务生成方法的应用场景图。
图2为一个实施例中会话任务生成方法的流程示意图。
图3为一个实施例中支持通过拖拽会话组件构建会话任务的任务配置页面的界面示意图。
图4为一个实施例中会话任务生成装置的结构框图。
图5为一个实施例中计算机设备的内部结构图。
本发明的实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的会话任务生成方法,可以应用于如图1所示的应用环境中。其中,终端102与服务器104通过网络进行通信。其中,终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。终端102上运行了会话应用。基于会话应用,用户可以进行会话任务配置,将并将配置好的会话任务发布至服务器104。服务器104将会话任务推送至其他用户。其他用户可以基于会话应用执行会话任务,与虚拟用户对象进行会话。
在一个实施例中,如图2所示,提供了一种会话任务生成方法,以该方法应用于图1中的终端102为例进行说明,包括以下步骤。
步骤202,确定待生成会话任务的会话背景;待生成会话任务包括多个会话对。
终端上运行了会话应用。会话应用是指用户能够与其他用户或虚拟用户对象之间发送会话消息,实现不同社交用途的应用。会话应用具体可以是即时通讯应用、智能客服应用、技能陪练应用等。其中,技能陪练应用是由虚拟用户对象充当某种角色的用户与待培训的另一种角色的用户进行模拟会话,以提高待培训用户技能的应用程序。比如,虚拟用户对象充当客户与业务员进行会话,以提高业务员服务能力;或者,虚拟用户对象充当学生或家长与老师进行会话,以提高老师教学水平等。
基于会话应用,任务发布者可以进行会话任务的配置。具体地,当接收到基于会话应用触发的会话任务配置指令时,终端展示任务配置页面。任务配置页面包括会话背景描述区。会话背景是指任务执行者在执行会话任务时需要了解的背景信息,比如与其通信交流的虚拟用户对象所扮演的角色及其用户需求等。比如,一个用于提高业务员服务能力的会话任务中,对应的会话背景包括虚拟用户对象所充当客户的性别、年龄、性别等身份信息,以及虚拟用户对象所需咨询的业务方向等。终端获取用户在会话背景描述区输入的待生成的会话任务的会话背景。
步骤204,当发生会话组件拖拽操作时,获取基于被拖拽会话组件所添加的相应会话对的上文会话消息和下文会话文本。
任务配置页面还提供了旁白会话、固定会话、固定问答、意图会话和评分会话等多种会话组件。用户可以通过自由拖拽会话组件的方式快速创建会话任务,并发布预配置的会话任务给待培训用户进行练习。具体地,一个会话任务包括多个会话对。通过拖拽不同组件类型的会话组件可以获取不同会话模式的会话对。比如,基于会话组件“意图会话”可以实现会话模式为“意图识别”;基于会话组件“评分会话”可以实现会话模式为“专业评分”等。
进一步地,每个会话对包括下文会话消息和上文会话消息。用户在拖拽会话组件后,对会话组件对应的下文会话消息以及上文会话消息进行配置。其中,在会话任务被执行时,下文会话消息通过虚拟用户对象输出,任务执行者在获取到下文会话消息后输入会话消息作为答复。上文会话消息是用于评判所输入会话消息的答复专业度、所表达意图等的参考信息。在本实施例中预先配置的下文会话消息及上文会话消息分别可以是文本、语音等。
步骤206,将下文会话文本转化为与会话背景相匹配的下文会话语音。
用户可以对讲述每个下文会话消息的虚拟用户对象的用户形象、表情、音色等进行配置。具体地,会话应用在服务器预先存储了多种虚拟用户角色信息,不同的虚拟用户角色具有不同的音色特征。虚拟用户角色信息包括角色标识及其音色特征、在不同表情状态下的人脸图像或视频等。在会话任务配置时,终端可以从服务器读取预存储的虚拟用户角色信息,并展示在任务配置页面。用户在配置输入下文会话消息后,可以选定合适的用于输出该下文会话消息的虚拟用户对象的人脸图像或通话视频。当下文会话消息为下文会话文本时,终端将下文会话文本按照用户选定的虚拟用户角色的音色特征转换为下文会话语音。
在一个实施例中,用户还可以对每组会话对中上文会话消息的输入方式进行配置,比如口头讲解、图文讲解等。用户在配置每个输入方式为“图文讲解”的下文会话消息的上文会话消息时,需要预先配置对应的参考讲解图。参考讲解图包括多个讲解步骤的步骤讲解图。
文字语音转换技术主要是将计算机内的文本转换成连续自然的语音。传统的将文字转化成语音的方式,通常是采用TTS(Text To Speech)技术,根据文本合成相应的语音。然而,传统的将文字转化成声音的方式,通常整个任务只有一个嗓音,并且多为女性。在有些场景下,采用单一的嗓音会局限对文字内容的表达。而本实施例结合会话场景确定所需用户角色的音色类别,将会话文本按照音色类别转换为会话语音,不同的角色可以采用与该角色贴近的音色类别进行输出,能够区分不同的角色,大大扩展了虚拟用户对象对文本内容的表达,改善会话任务执行效果。
步骤208,对包含上文会话消息及下文会话文本的多个会话组件进行拼接,得到会话任务。
在一个实施例中,对包含上文会话消息及下文会话文本的多个会话组件进行拼接包括:根据多个会话组件的相对位置关系,确定相应多个会话对之间的执行顺序;根据执行顺序对包含上文会话消息及下文会话文本的多个会话组件进行拼接。
其中,多个会话对的执行顺序可以是根据会话组件在任务配置页面中的展示位置确定的。比如,终端可以按照“Z”字形扫描的方式对任务配置页面所展示的一个或多个会话组件进行扫描,确定多个会话组件的相对位置关系。在一个具体的实施例中,终端可以基于多行多列的二维矩阵表将任务配置页面划分为多个配置子区域。终端将处于同一行的会话对确定为相同执行顺序的会话对,将前一行的会话对确定为当前行会话对的前一顺序会话对。
在一个实施例中,也可以根据用户在任务配置页面触发的对会话组件的拼接操作,对包含上文会话消息及下文会话文本的多个会话组件进行拼接。参考图3,图3为一个实施例中支持通过拖拽会话组件构建会话任务的任务配置页面的界面示意图。在配置会话任务时,用户可以采用有向边将相邻执行顺序的会话组件连接。有向边从前序会话组件指向后续会话组件。如图3所示,组件类型为“固定会话”的会话组件所对应会话对中,上文会话消息为“喂,小丽你好,我是小明,现在方便聊一下吗?”,下文会话消息为“哦,方便的”。有向边从上文会话消息指向下文会话消息。该“固定会话”的会话对的下文会话消息存在指向“意图会话”的会话对,则该“固定会话”的会话对的下一顺序会话对为“意图会话”的会话对。
上述会话任务生成方法中, 当触发会话组件拖拽操作时,基于被拖拽会话组件可以配置添加每个会话对的上文会话消息和下文会话文本;根据待生成会话任务的会话背景,可以将下文会话文本转化为下文会话语音;通过对包含上文会话消息及下文会话文本的多个会话组件进行拼接,可以得到会话任务。由于贴合会话背景所需将下文会话文本转换为由充当某种用户角色的虚拟用户对象输出的下文会话消息,不仅能够区分不同的角色,且大大扩展了虚拟用户对象对文本内容的表达,改善会话任务执行效果。此外,通过简单拖拽会话组件的方式即可完成会话任务构建,大大提高会话任务生成效率。
在一个实施例中,将下文会话文本转化为与会话背景相匹配的下文会话语音包括:根据会话背景确定待生成会话任务所需虚拟用户对象的数量以及每个虚拟用户对象的角色类型;确定针对当前会话对指定的目标虚拟用户对象;按照目标虚拟用户对象的角色类型所匹配的音色类别,将下文会话文本转换为下文会话语音。
同一个会话任务可能需要多个虚拟用户对象分别充当不同的用户角色。终端根据会话背景确定会话任务中所需虚拟用户对象的数量,以及每个虚拟用户对象的角色类型。根据所充当用户的年龄、性别等不同,角色类型可以包含男童、女童、男少年、女少年、男青年、女青年等。
对于每个会话对中的下文会话消息由哪一虚拟用户对象输出,可以按照上述用户选定的方式。在本实施例中,用户无需为每个下文会话消息指定具体的虚拟用户对象,只需指定所需虚拟用户对象的角色类型,终端根据角色类型自动为相应下文会话消息分配对应的目标虚拟用户对象。目标虚拟用户对象是指预存储的多个虚拟用户对象中,与输出相应下文会话消息所需用户的性别、年龄等相匹配的虚拟用户对象。
上述实施例中,同一会话任务支持用户与多个虚拟用户对象通信交流,进一步提高了会话任务灵活性。
在一个实施例中,上述会话任务生成方法还包括:当获取到会话任务处理指令时,展示会话任务对应的会话页面;获取基于会话页面产生的输入会话消息;确定针对输入会话消息所属会话对指定的目标虚拟用户对象;触发目标虚拟用户对象进行基于输入会话消息的业务处理,得到应答会话语音;通过目标虚拟用户对象回复应答会话语音。
当执行会话任务时,终端展示会话页面,并在会话页面展示会话背景。在了解会话背景后,用户可以在会话页面输入执行当前顺序会话对步骤所需的会话消息(记作输入会话消息)。终端确定针对当前顺序会话对所指定的目标虚拟用户对象,触发目标虚拟用户对象根据输入会话消息确定相应的应答会话文本,并按照该目标虚拟用户对象所对应的音色类别将应答会话文本转换为应答会话语音,通过该目标虚拟用户对象输出应答会话语音,以回复输入会话消息。
在上述实施例中,同一会话任务支持用户与多个虚拟用户对象通信交流,达到多人会话效果;在多人会话中,通过不同虚拟用户对象即可回复获取到的应答会话消息,相应了模拟会话诉求,实现了用户与虚拟用户对象之间灵活且智能的交互,由此能够实现虚拟用户对象在多人会话中有针对性的响应用户,使得交互方式更灵活方便。
在一个实施例中,触发目标虚拟用户对象进行基于输入会话消息的业务处理,得到应答会话语音包括:确定会话组件的组件类型;当组件类型为第一类型,输入会话消息包括会话图片时,触发目标虚拟用户对象提取会话图片的图形特征;根据图形特征,确定与会话图片相应的类别标签文本;将图形特征和相应的类别标签文本进行融合,得到综合特征;基于综合特征确定输入会话消息的会话意图;获取所属会话对中每个上文会话消息对应的意图标签;将意图标签与会话意图相匹配的上文会话消息所对应的下文会话语音,作为应答会话消息。
其中,目标类型包括第一类型和第二类型。第一类型是指意图会话组件类型。第二类型是指评分会话组件类型。对于目标类型的会话对,其包括多个上文会话消息以及每个上文会话消息对应的下文会话消息。每个上文会话消息与对应的下文会话消息形成一个分支会话对。从而目标类型的会话对包括多个分支会话对。后续,根据用户输入会话消息与哪一上文会话消息想近似,则跳转至相应分支会话对。
具体地,若当前顺序会话对为意图会话类型,终端按照预置的意图识别策略识别输入会话消息的会话意图。或者,终端将输入会话消息发送至服务器,由服务器按照预置的意图识别策略识别输入会话消息的会话意图。本实施例预置了规则匹配和模型识别等多种意图识别策略,可以根据需求在不同情况下采用不同的意图识别策略识别输入会话消息的意图。
规则匹配可以是通过识别输入会话消息中是否存在预设的能够表征某种会话意图的关键词进行意图识别的方式。模型识别可以是预训练的机器学习模型进行意图识别的方式。每种意图识别策略具有对应的使用条件。使用条件可以是输入会话消息的一项或多项指标分别达到阈值。其中,指标具体包括消息数据量、当前会话对的意图层级、所属业务场景等。消息数据量可以根据所包含文本长度或者所涉及图片大小等确定。比如,当输入会话消息的消息数据量大,或意图层级比较低的时候,可以优先采用规则匹配。
当输入会话消息包含输入图片时,可以基于预训练的第一模型提取会话图片特征。第一模型具体可以是卷积神经网络模型,比如ResNet-80。通过卷积神经网络的卷积层对会话图片进行卷积处理,提取会话图片的feature map(特征图),即本实施例中的图片特征。
计算机设备根据图形特征确定与会话图片相应的类别标签文本。其中,类别标签文本是会话图片所属的类别对应的标签文本。具体地,计算机设备可通过第一模型提取图形特征,再对提取的图形特征进行分类处理,得到会话图片的类别,进而确定会话图片相应的类别标签文本。在一个实施例中,第一模型具体可以是卷积神经网络模型。计算机设备可将会话图片输入至卷积神经网络模型中,以提取会话图片的图形特征。再通过池化层和全连接层对图形特征进行处理,得到会话图片所属类别的概率值。将最大概率值所对应的类别标签作为与下文会话图片相应的类别标签。
计算机设备将图形特征和相应的类别标签文本进行融合,得到综合特征。终端基于预训练的自然语言模型提取类别标签文本的文本特征,并将图形特征与文本特征进行跨模态融合。其中,跨模态融合是将具有不同模态的数据进行融合。在本实施例中,不同模态的数据具体是指与会话图片对应的图形特征、以及与类别标签文本对应的文本数据。具体地,计算机设备可将提取的图形特征和相应的类别标签文本映射至同一空间内的数据,再对映射后的数据进行融合处理,得到综合特征。
在一个实施例中,通过第一模型提取会话图片的图形特征。计算机设备可通过循环神经网络提取类别标签文本的文本特征。其中,图形特征和文本特征的表现形式都可以是向量形式。计算机设备在对图形特征和文本特征进行融合之前,可将图形特征和文本特征分别转换成标准形式,使两者的特征向量都处于同一范围内。比如,可分别对图形特征和文本特征进行归一化处理。常用的归一化算法有函数法和概率密度法。其中,函数法,比如最大-最小函数、均值-方差函数(将特征都归一化到了一个一致的区间,比如均值为0,方差为1的区间)或双曲sigmoid(S型生长曲线)函数等。
进一步地,计算机设备可对归一化处理后的图形特征和相应的类别标签文本对应的文本特征,执行融合操作,得到综合特征。其中,将图形特征和文本特征进行融合的算法具体可采用基于贝叶斯决策理论的算法、基于稀疏表示理论的算法或基于深度学习理论算法等。或者,计算机设备可对归一化处理后的两个向量进行加权求和,已将图形特征和文本特征进行融合,得到综合特征。
计算机设备基于综合特征对输入会话消息进行意图识别。具体地,计算机设备通过第二模型处理综合特征,输出得到会话图片的会话意图,比如识别会话图片中的物体、理解物体间的关系等。会话意图具体可以一个词、一个整句或段落文本等的形式表征。第二模型具体可以是循环神经网络模型,如LSTM模型。
在一个实施例中,基于综合特征对会话消息进行意图识别包括:获取与会话图片对应的意图预描述文本;基于意图预描述文本各个词向量,生成会话图片的预测特征;将综合特征以及预测特征输入预训练模型,输出得到绘图图片的会话意图。
其中,意图预描述文本是预先对会话图片进行描述的文本。意图预描述文本具体可以是认为对会话图片进行理解后,得到的初始的较为粗糙的描述文本。
在一个实施例中,计算机设备可获取与会话图片对应的意图预描述文本,并获取意图预描述文本的各个词向量。计算机设备可以采用编码-解码的方式,将综合特征作为第一时刻输入,将各个词向量分别作为后续时刻的输入,通过第二模型处理依次输入的综合特征和词向量,输出会话消息的会话意图。这样,第二模型可以结合综合特征和意图预描述文本,使得输出的会话意图更贴合会话图片所表达真实意图,大大提高了图形理解信息的准确性。
上述会话意图识别方法,根据提取得到的会话图片的图形特征,可以快速准确地获得会话图片相应的类别标签文本。将图形特征和相应的类别标签文本进行跨模态融合,得到综合特征,再根据综合特征,识别得到会话消息的会话意图。这样,可以使得在意图识别过程中既能充分利用会话图片本身的图形特征,又能结合会话图片所属的类别信息。这样细致且充分地利用了会话图片的特征,在对会话图片进行理解时,得到了图形特征和类别标签文本的双重指导,大大提高了会话图片理解信息的准确性。
进一步地,会话任务中预先配置的每个上文会话消息具有对应的意图标签。通过比对意图标签和输入会话消息的会话意图,可以确定与输入会话消息相匹配的上文会话消息。终端获取想匹配的上文会话消息对应的下文会话语音作为应答会话消息。如图3所示,“意图会话”会话对包括两个会话分支。当用户输入会话消息所表达意图为“推荐产品”时,走向一个会话分支流,虚拟用户对象输出应答会话消息“都那么熟的朋友了,没关系的,不用跟我讲这些”。当用户输入会话消息所表达意图为“送礼物”时,走向另一个会话分支流,虚拟用户对象输出应答会话消息“好啊!”。
上述实施例,基于图形特征和文本特征跨模态融合得到的综合特征,进行意图识别,可以使得在意图识别过程中既能充分利用会话图片本身的图形特征,又能结合会话图片所属的类别信息,得到了图形特征和类别标签文本的双重指导,大大提高了输入会话消息意图识别准确性。
在一个实施例中,上述会话任务生成方法还包括:当组件类型为第二类型,输入会话消息包括会话图片时,触发目标虚拟用户对象识别会话图片的绘图轨迹;将会话图片中绘图轨迹经过的像素点的像素值确定为第一像素值,绘图轨迹未经过的像素点的像素值确定为第二像素值;提取完成像素值更新的会话图片中每个绘图笔画的图形特征;对多个绘图笔画的图形特征进行融合,得到会话图片的序列特征;计算会话图片的序列特征与当前会话分支中上文会话消息所对应参考讲解图的序列特征的相似度;将相似度最高的上文会话消息所对应的下文会话语音,作为应答会话消息。
若当前顺序会话对评分会话,采用预训练的自然语言处理模型提取输入会话消息的语义特征,将该语义特征与当前顺序会话对中预置的每个上文会话消息的语义特征进行比较,得到语义相似度。筛选语义相似度最高的上文会话消息对应的下文会话语音作为应答会话消息。
若当前顺序会话对的输入方式为“图文讲解”,则用户需要输入会话图片,并对会话图片作出解释讲解。终端在会话页面展示绘图讲解提示,并展示绘图页面。用户可以在绘图页面绘制会话图片。绘图页面可以是会话窗口中的会话消息录入区域,也可以是区别于会话窗口的其他页面。
终端对会话图片的绘制过程进行跟踪。具体地,会话应用进行讲解步骤提示,即提示用户当前应当绘制哪一讲解步骤对应的局部的步骤图,每个步骤图可能对应多个绘图笔画和标注文本。绘图笔画可以通过停顿时间和是否离开屏幕来判定。当绘制完当前顺序的步骤图之后,根据对“下一步”按钮的触发操作,进行下一顺序讲解步骤提示。在不同的业务场景,所需绘制的会话图片的类型可以不同,比如,在产品销售场景,会话图片的类型可以是草帽图、爬坡图、钢丝图等。
终端将会话图片中绘图轨迹经过的像素点的像素值确定为第一像素值,绘图轨迹未经过的像素点的像素值确定为第二像素值。终端提取完成像素值更新的会话图片中每个绘图笔画的图形特征。
在一个实施例中,根据绘图轨迹,确定会话图片中每个绘图笔画的图形特征包括:将会话图片缩放至标准大小;根据绘图轨迹,更新标准大小的会话图片中每个像素点的像素值;提取完成像素值更新的下文会话图片中每个绘图笔画的图形特征。
每当监听到一个讲解步骤的步骤图绘制完成,终端提取每个步骤图的图形特征,并根据提取的图形特征对会话图片评分;或将每个步骤图发送至服务器,由服务器进行图形特征提取并根据提取的图形特征对会话图片评分。或者,在整个会话图片绘制完成时,终端或服务器按照上述方式提取每个步骤图的图形特征,并根据提取的图形特征对会话图片评分。
不同用户采用的第一终端的尺寸可能不同,使得绘制的会话图片的画布尺寸不同。计算机设备将当前的步骤图缩放至标准大小,使压缩后的每个步骤图具有相同数量的像素点。标准大小是指指定的图片尺寸。
计算机设备将获取到的每个步骤图缩放至标准大小后,根据绘图轨迹,更新标准大小的步骤图中每个像素点的像素值,对步骤图中绘图轨迹未经过的像素点(多余点)过滤、通过缩放及像素值更新可以实现坐标归一化及步骤图重绘。
在一个实施例中,根据绘图轨迹,更新标准大小的下文会话图片中每个像素点的像素值包括:将标准大小的会话图片中绘图轨迹经过的像素点的像素值更新为第一像素值;将标准大小的会话图片中绘图轨迹未经过的像素点的像素值更新为第二像素值。
计算机设备将更新为标准大小的当前步骤图中绘图轨迹未经过的像素点的像素值更新为第一像素值,将绘图轨迹经过的像素点的像素值更新为第二像素值。第一像素值与第二像素值为不同的像素数值,通过不同的像素值对绘图轨迹经过与未经过的像素点进行区分。
进一步地,计算机提取已缩放至标准大小并更新了像素值的步骤图的图形矢量信息。图形矢量信息可以是一条JSON(JavaScript Object Notation, JS 对象)数据。JSON数据包括文本字段testing和绘画字段drawing。
步骤图有一个或多个绘图笔画构成,每个绘图笔画由多个坐标连续的像素点组成。从而,绘画字段包括相应步骤图中每个绘图笔画对应各个像素点的横坐标x和纵坐标y。比如,上例(x1,y1)为一个绘图笔画中各个像素点的坐标,x2,y2)为另一个绘图笔画中各个像素点的坐标。
计算机设备将图形矢量信息输入图形特征提取模型,得到相应步骤图对应的序列特征。其中,图形特征提取模型包括lenet模型(卷积神经网络模型)和序列模型。lenet模型包括卷积层、池化层和全连接层。计算机设备将图形矢量信息输入卷积层进行卷积运算,将卷积层输出的第一特征矩阵输入池化层进行归一化运算,得到由第一特征矩阵中每个特征向量中最大权重投影得到的第二特征矩阵。计算机设备将第二特征矩阵输入全连接层进行分类运算,得到每个分类对应的图形特征。图形特征具体可以是计算机设备从会话图片中提取出的可以表示图片的形状或空间关系等数据,得到图片的“非图片”的表示或描述,如数值、向量或符号等。
计算机设备对多个绘图笔画的图形特征进行融合,得到会话图片的序列特征。具体地,计算机设备调用序列模型对图形特征进行编码,得到相应步骤图的序列特征。序列模型可以是循环神经网络模型,包括3层卷积层、2层LSTM层和Softmax分类层。容易理解,卷积层与LSTM层的数量可以根据需求动态确定。卷积层用于在保证图形特征信息完整的情况下,减少图形特征数据量。LSTM层用于结合前一笔画的图形特征和当前笔画的图形特征计算当前笔画的序列特征。LSTM层包括遗忘门、输入门和输出门。通过遗忘门对前一顺序绘图笔画的图形特征进行遗忘处理,通过输入门对当前顺序绘图笔画对应的图形特征进行更新,通过输出门对遗忘处理后得到的图形特征以及更新得到的图形特征进行运算,得到当前顺序绘图笔画对应的序列特征。
Softmax分类层用于将多个绘图笔画的序列特征进行特征融合,得到相应步骤图的序列特征。具体地,计算机设备可以将具有相同维度的多个绘图笔画的序列特征映射至同一空间内的数据,再对映射后的数据进行融合处理,得到综合特征。特征融合的算法具体可采用向量拼接的方式。容易理解,计算机设备也可以基于贝叶斯决策理论的算法、基于稀疏表示理论的算法或基于深度学习理论算法等将多个序列特征进行融合,得到整个会话图片的序列特征。
如上文,会话任务中每个会话对预置了作为参考的多种上文会话消息。当某组会话对的输入方式为“图文讲解”时,对应上文会话消息为参考讲解图。参考讲解图的序列特征可以是每次需要用到时临时动态计算得到的,减少对计算机设备存储资源的占用。参考讲解图的序列特征也可以是预先计算并存储在计算机设备的,提高序列特征获取效率,进而提高相应讲解图进行评分的效率。
计算机设备基于相似度计算模型计算会话图片的序列特征与相应参考讲解图的序列特征的相似度。其中,相似度计算模型可以是孪生神经网络模型(siamese network)。容易理解,计算机设备也可以采用其他方法计算下文会话图片的序列特征与相应参考讲解图的序列特征的相似度,对此不作限制。计算机设备将相似度作为会话图片的评分,或者按照预设逻辑对相似度进行数值换算,得到会话图片的评分。
在一个实施例中,会话图片由多个步骤图按照绘制时间顺序拼接得到;对多个绘图笔画的图形特征进行融合,得到会话图片的序列特征包括:对当前顺序步骤图中多个绘图笔画的图形特征进行融合,得到当前顺序步骤图的序列特征;当监听到下一顺序步骤图时,将下一顺序步骤图作为当前顺序步骤图进行迭代,直至最后顺序步骤图;对多个步骤图的序列特征进行融合,得到会话图片的序列特征。
计算机设备在按照上述方式提取得到构成会话图片中每个步骤图的序列特征后,对多个步骤图的序列特征进行融合,得到会话图片的序列特征,根据会话图片的序列特征与参考讲解图的序列特征的相似度,对会话图片进行评分。
在另一个实施例中,计算机设备也可以在提取得到每个步骤图的序列特征后,根据该步骤图的序列特征与参考讲解图中相应讲解步骤的局部图的序列特征的相似度,及时对当前步骤图进行评分,最后根据全部步骤图的评分计算整个会话图片的评分。
通过对绘图轨迹进行跟踪,可以按照绘图笔画进行图形特征提取,不仅可以实现对绘图格式的会话消息进行监控,还可以细化图形特征提取粒度,有助于提高所提取图形特征的准确性,继而有助于提高会话任务按照预设的评分跳转规则稳定执行。
上述实施例中,通过对会话图片绘制轨迹进行跟踪,并将对多个绘图笔画的图形特征进行融合,不仅可以实现对图片格式的会话消息进行评分,还可以细化图形特征提取粒度,有助于提高所提取图形特征的准确性,继而有助于提高会话评分结果准确性。
应该理解的是,虽然图2的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图4所示,提供了一种会话任务生成装置,包括:背景确定模块402、组件构建模块404、组件转化模块406和组件拼接模块408,其中。
背景确定模块402,用于确定待生成会话任务的会话背景;待生成会话任务包括多个会话对。
组件构建模块404,用于当发生会话组件拖拽操作时,获取基于被拖拽会话组件所添加的相应会话对的上文会话消息和下文会话文本。
组件转化模块406,用于将下文会话文本转化为与会话背景相匹配的下文会话语音。
组件拼接模块408,用于对包含上文会话消息及下文会话文本的多个会话组件进行拼接,得到会话任务。
在一个实施例中,组件转化模块406还用于根据会话背景确定待生成会话任务所需虚拟用户对象的数量以及每个虚拟用户对象的角色类型;确定针对当前会话对指定的目标虚拟用户对象;按照目标虚拟用户对象的角色类型所匹配的音色类别,将下文会话文本转换为下文会话语音。
在一个实施例中,组件拼接模块408,用于根据多个会话组件的相对位置关系,确定相应多个会话对之间的执行顺序;根据执行顺序对包含上文会话消息及下文会话文本的多个会话组件进行拼接。
在一个实施例中,上述装置还包括任务执行模块410,用于当获取到会话任务处理指令时,展示会话任务对应的会话页面;获取基于会话页面产生的输入会话消息;确定针对输入会话消息所属会话对指定的目标虚拟用户对象;触发目标虚拟用户对象进行基于输入会话消息的业务处理,得到应答会话语音;通过目标虚拟用户对象回复应答会话语音。
在一个实施例中,任务执行模块410还用于确定会话组件的组件类型;组件类型为目标类型的会话对包括多个上文会话消息及每个上文会话消息对应的下文会话语音;当组件类型为第一类型,输入会话消息包括会话图片时,触发目标虚拟用户对象提取会话图片的图形特征;根据图形特征,确定与会话图片相应的类别标签文本;将图形特征和相应的类别标签文本进行融合,得到综合特征;基于综合特征确定输入会话消息的会话意图;获取所属会话对中每个上文会话消息对应的意图标签;将意图标签与会话意图相匹配的上文会话消息所对应的下文会话语音,作为应答会话消息。
在一个实施例中,任务执行模块410还用于当组件类型为第二类型,输入会话消息包括会话图片时,触发目标虚拟用户对象识别会话图片的绘图轨迹;将会话图片中绘图轨迹经过的像素点的像素值确定为第一像素值,绘图轨迹未经过的像素点的像素值确定为第二像素值;提取完成像素值更新的会话图片中每个绘图笔画的图形特征;对多个绘图笔画的图形特征进行融合,得到会话图片的序列特征;计算会话图片的序列特征与当前会话分支中上文会话消息所对应参考讲解图的序列特征的相似度;将相似度最高的上文会话消息所对应的下文会话语音,作为应答会话消息。
关于会话任务生成装置的具体限定可以参见下文中对于会话任务生成方法的限定,在此不再赘述。上述会话任务生成装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图5所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种会话任务生成方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现本申请任意一个实施例中提供会话任务生成方法的步骤。其中,所述计算机可读存储介质可以是非易失性,也可以是易失性的。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种会话任务生成方法,所述方法包括:
    确定待生成会话任务的会话背景;所述待生成会话任务包括多个会话对;
    当发生会话组件拖拽操作时,获取基于被拖拽会话组件所添加的相应会话对的上文会话消息和下文会话文本;
    将所述下文会话文本转化为与所述会话背景相匹配的下文会话语音;
    对包含所述上文会话消息及所述下文会话文本的多个会话组件进行拼接,得到所述会话任务。
  2. 根据权利要求1所述的方法,其中,所述将所述下文会话文本转化为与所述会话背景相匹配的下文会话语音包括:
    根据所述会话背景确定所述待生成会话任务所需虚拟用户对象的数量以及每个虚拟用户对象的角色类型;
    确定针对当前会话对指定的目标虚拟用户对象;
    按照所述目标虚拟用户对象的角色类型所匹配的音色类别,将所述下文会话文本转换为下文会话语音。
  3. 根据权利要求2所述的方法,其中,所述对包含所述上文会话消息及所述下文会话文本的多个会话组件进行拼接包括:
    根据多个会话组件的相对位置关系,确定相应多个会话对之间的执行顺序;
    根据所述执行顺序对包含所述上文会话消息及所述下文会话文本的多个会话组件进行拼接。
  4. 根据权利要求3所述的方法,其中,在得到所述会话任务之后,所述方法还包括:
    当获取到会话任务处理指令时,展示所述会话任务对应的会话页面;
    获取基于所述会话页面产生的输入会话消息;
    确定针对所述输入会话消息所属会话对指定的目标虚拟用户对象;
    触发所述目标虚拟用户对象进行基于所述输入会话消息的业务处理,得到应答会话语音;
    通过所述目标虚拟用户对象回复所述应答会话语音。
  5. 根据权利要求3所述的方法,其中,所述触发所述目标虚拟用户对象进行基于所述输入会话消息的业务处理,得到应答会话语音包括:
    确定所述会话组件的组件类型;
    当所述组件类型为第一类型,所述输入会话消息包括会话图片时,触发所述目标虚拟用户对象提取所述会话图片的图形特征;
    根据所述图形特征,确定与所述会话图片相应的类别标签文本;
    将所述图形特征和相应的类别标签文本进行融合,得到综合特征;
    基于所述综合特征确定所述输入会话消息的会话意图;
    获取所属会话对中每个上文会话消息对应的意图标签;
    将意图标签与所述会话意图相匹配的上文会话消息所对应的下文会话语音,作为应答会话消息。
  6. 根据权利要求5所述的方法,其中,所述方法还包括:
    当所述组件类型为第二类型,所述输入会话消息包括会话图片时,触发所述目标虚拟用户对象识别所述会话图片的绘图轨迹;
    将所述会话图片中绘图轨迹经过的像素点的像素值确定为第一像素值,绘图轨迹未经过的像素点的像素值确定为第二像素值;
    提取完成像素值更新的会话图片中每个绘图笔画的图形特征;
    对多个绘图笔画的图形特征进行融合,得到所述会话图片的序列特征;
    计算所述会话图片的序列特征与当前会话分支中上文会话消息所对应参考讲解图的序列特征的相似度;
    将所述相似度最高的上文会话消息所对应的下文会话语音,作为应答会话消息。
  7. 一种会话任务生成装置,其中,所述装置包括:
    背景确定模块,用于确定待生成会话任务的会话背景;所述待生成会话任务包括多个会话对;
    组件构建模块,用于当发生会话组件拖拽操作时,获取基于被拖拽会话组件所添加的相应会话对的上文会话消息和下文会话文本;
    组件转化模块,用于将所述下文会话文本转化为与所述会话背景相匹配的下文会话语音;
    组件拼接模块,用于对包含所述上文会话消息及所述下文会话文本的多个会话组件进行拼接,得到所述会话任务。
  8. 根据权利要求7所述的装置,其中,所述组件转化模块还用于根据所述会话背景确定所述待生成会话任务所需虚拟用户对象的数量以及每个虚拟用户对象的角色类型;确定针对当前会话对指定的目标虚拟用户对象;按照所述目标虚拟用户对象的角色类型所匹配的音色类别,将所述下文会话文本转换为下文会话语音。
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时,用于实现以下步骤:
    确定待生成会话任务的会话背景;所述待生成会话任务包括多个会话对;
    当发生会话组件拖拽操作时,获取基于被拖拽会话组件所添加的相应会话对的上文会话消息和下文会话文本;
    将所述下文会话文本转化为与所述会话背景相匹配的下文会话语音;
    对包含所述上文会话消息及所述下文会话文本的多个会话组件进行拼接,得到所述会话任务。
  10. 根据权利要求9所述的计算机设备,其中,所述处理器用于:
    根据所述会话背景确定所述待生成会话任务所需虚拟用户对象的数量以及每个虚拟用户对象的角色类型;
    确定针对当前会话对指定的目标虚拟用户对象;
    按照所述目标虚拟用户对象的角色类型所匹配的音色类别,将所述下文会话文本转换为下文会话语音。
  11. 根据权利要求10所述的计算机设备,其中,所述处理器用于:
    根据多个会话组件的相对位置关系,确定相应多个会话对之间的执行顺序;
    根据所述执行顺序对包含所述上文会话消息及所述下文会话文本的多个会话组件进行拼接。
  12. 根据权利要求11所述的计算机设备,其中,所述处理器用于:
    当获取到会话任务处理指令时,展示所述会话任务对应的会话页面;
    获取基于所述会话页面产生的输入会话消息;
    确定针对所述输入会话消息所属会话对指定的目标虚拟用户对象;
    触发所述目标虚拟用户对象进行基于所述输入会话消息的业务处理,得到应答会话语音;
    通过所述目标虚拟用户对象回复所述应答会话语音。
  13. 根据权利要求11所述的计算机设备,其中,所述处理器用于:
    确定所述会话组件的组件类型;
    当所述组件类型为第一类型,所述输入会话消息包括会话图片时,触发所述目标虚拟用户对象提取所述会话图片的图形特征;
    根据所述图形特征,确定与所述会话图片相应的类别标签文本;
    将所述图形特征和相应的类别标签文本进行融合,得到综合特征;
    基于所述综合特征确定所述输入会话消息的会话意图;
    获取所属会话对中每个上文会话消息对应的意图标签;
    将意图标签与所述会话意图相匹配的上文会话消息所对应的下文会话语音,作为应答会话消息。
  14. 根据权利要求13所述的计算机设备,其中,所述处理器用于:
    当所述组件类型为第二类型,所述输入会话消息包括会话图片时,触发所述目标虚拟用户对象识别所述会话图片的绘图轨迹;
    将所述会话图片中绘图轨迹经过的像素点的像素值确定为第一像素值,绘图轨迹未经过的像素点的像素值确定为第二像素值;
    提取完成像素值更新的会话图片中每个绘图笔画的图形特征;
    对多个绘图笔画的图形特征进行融合,得到所述会话图片的序列特征;
    计算所述会话图片的序列特征与当前会话分支中上文会话消息所对应参考讲解图的序列特征的相似度;
    将所述相似度最高的上文会话消息所对应的下文会话语音,作为应答会话消息。
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序包括程序指令,所述程序指令被处理器执行时,用于实现以下步骤:
    确定待生成会话任务的会话背景;所述待生成会话任务包括多个会话对;
    当发生会话组件拖拽操作时,获取基于被拖拽会话组件所添加的相应会话对的上文会话消息和下文会话文本;
    将所述下文会话文本转化为与所述会话背景相匹配的下文会话语音;
    对包含所述上文会话消息及所述下文会话文本的多个会话组件进行拼接,得到所述会话任务。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述程序指令被处理器执行时,还用于实现以下步骤:
    根据所述会话背景确定所述待生成会话任务所需虚拟用户对象的数量以及每个虚拟用户对象的角色类型;
    确定针对当前会话对指定的目标虚拟用户对象;
    按照所述目标虚拟用户对象的角色类型所匹配的音色类别,将所述下文会话文本转换为下文会话语音。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述程序指令被处理器执行时,还用于实现以下步骤:
    根据多个会话组件的相对位置关系,确定相应多个会话对之间的执行顺序;
    根据所述执行顺序对包含所述上文会话消息及所述下文会话文本的多个会话组件进行拼接。
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述程序指令被处理器执行时,还用于实现以下步骤:
    当获取到会话任务处理指令时,展示所述会话任务对应的会话页面;
    获取基于所述会话页面产生的输入会话消息;
    确定针对所述输入会话消息所属会话对指定的目标虚拟用户对象;
    触发所述目标虚拟用户对象进行基于所述输入会话消息的业务处理,得到应答会话语音;
    通过所述目标虚拟用户对象回复所述应答会话语音。
  19. 根据权利要求17所述的计算机可读存储介质,其中,所述程序指令被处理器执行时,还用于实现以下步骤:
    确定所述会话组件的组件类型;
    当所述组件类型为第一类型,所述输入会话消息包括会话图片时,触发所述目标虚拟用户对象提取所述会话图片的图形特征;
    根据所述图形特征,确定与所述会话图片相应的类别标签文本;
    将所述图形特征和相应的类别标签文本进行融合,得到综合特征;
    基于所述综合特征确定所述输入会话消息的会话意图;
    获取所属会话对中每个上文会话消息对应的意图标签;
    将意图标签与所述会话意图相匹配的上文会话消息所对应的下文会话语音,作为应答会话消息。
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述程序指令被处理器执行时,还用于实现以下步骤:
    当所述组件类型为第二类型,所述输入会话消息包括会话图片时,触发所述目标虚拟用户对象识别所述会话图片的绘图轨迹;
    将所述会话图片中绘图轨迹经过的像素点的像素值确定为第一像素值,绘图轨迹未经过的像素点的像素值确定为第二像素值;
    提取完成像素值更新的会话图片中每个绘图笔画的图形特征;
    对多个绘图笔画的图形特征进行融合,得到所述会话图片的序列特征;
    计算所述会话图片的序列特征与当前会话分支中上文会话消息所对应参考讲解图的序列特征的相似度;
    将所述相似度最高的上文会话消息所对应的下文会话语音,作为应答会话消息。
PCT/CN2020/104671 2019-12-10 2020-07-25 会话任务生成方法、装置、计算机设备和存储介质 WO2021114682A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911257891.8A CN111224863B (zh) 2019-12-10 2019-12-10 会话任务生成方法、装置、计算机设备和存储介质
CN201911257891.8 2019-12-10

Publications (1)

Publication Number Publication Date
WO2021114682A1 true WO2021114682A1 (zh) 2021-06-17

Family

ID=70829767

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/104671 WO2021114682A1 (zh) 2019-12-10 2020-07-25 会话任务生成方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN111224863B (zh)
WO (1) WO2021114682A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113626573A (zh) * 2021-08-11 2021-11-09 北京深维智信科技有限公司 一种销售会话异议及应对提取方法及系统
CN118132731A (zh) * 2024-05-06 2024-06-04 杭州数云信息技术有限公司 对话方法及装置、存储介质、终端、计算机程序产品

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111224863B (zh) * 2019-12-10 2021-06-22 平安国际智慧城市科技股份有限公司 会话任务生成方法、装置、计算机设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704530A (zh) * 2017-09-19 2018-02-16 百度在线网络技术(北京)有限公司 语音设备交互方法、装置及设备
US20180052664A1 (en) * 2016-08-16 2018-02-22 Rulai, Inc. Method and system for developing, training, and deploying effective intelligent virtual agent
CN108766561A (zh) * 2018-05-31 2018-11-06 平安医疗科技有限公司 病症信息处理方法、装置、计算机设备和存储介质
CN109739605A (zh) * 2018-12-29 2019-05-10 北京百度网讯科技有限公司 生成信息的方法和装置
CN109857910A (zh) * 2019-01-07 2019-06-07 平安科技(深圳)有限公司 Xml文件的生成方法、装置、计算机设备及存储介质
CN111224863A (zh) * 2019-12-10 2020-06-02 平安国际智慧城市科技股份有限公司 会话任务生成方法、装置、计算机设备和存储介质

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6719739B2 (ja) * 2016-05-20 2020-07-08 日本電信電話株式会社 対話方法、対話システム、対話装置、及びプログラム
JP6719747B2 (ja) * 2016-05-20 2020-07-08 日本電信電話株式会社 対話方法、対話システム、対話装置、およびプログラム
WO2017200080A1 (ja) * 2016-05-20 2017-11-23 日本電信電話株式会社 対話方法、対話装置、及びプログラム
CN107610550A (zh) * 2017-09-12 2018-01-19 北京银河润泰科技有限公司 在线课堂的实现方法及装置
CN107861951A (zh) * 2017-11-17 2018-03-30 康成投资(中国)有限公司 智能客服中的会话主题识别方法
CN108415706B (zh) * 2018-03-14 2021-07-06 上海携程商务有限公司 可视化网页生成的方法、系统、设备及存储介质
CN108429953A (zh) * 2018-04-11 2018-08-21 四川斐讯信息技术有限公司 一种外语口语练习用智能耳机及其人机交互方法
CN110059182A (zh) * 2019-03-21 2019-07-26 阿里巴巴集团控股有限公司 面向客服的话术推荐方法和装置
CN110162610A (zh) * 2019-04-16 2019-08-23 平安科技(深圳)有限公司 机器人智能应答方法、装置、计算机设备及存储介质
CN110309279A (zh) * 2019-05-23 2019-10-08 平安国际智慧城市科技股份有限公司 基于语言模型的话语训练方法、装置及计算机设备
CN110413961B (zh) * 2019-06-21 2021-02-09 平安国际智慧城市科技股份有限公司 基于分类模型进行文本评分的方法、装置和计算机设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180052664A1 (en) * 2016-08-16 2018-02-22 Rulai, Inc. Method and system for developing, training, and deploying effective intelligent virtual agent
CN107704530A (zh) * 2017-09-19 2018-02-16 百度在线网络技术(北京)有限公司 语音设备交互方法、装置及设备
CN108766561A (zh) * 2018-05-31 2018-11-06 平安医疗科技有限公司 病症信息处理方法、装置、计算机设备和存储介质
CN109739605A (zh) * 2018-12-29 2019-05-10 北京百度网讯科技有限公司 生成信息的方法和装置
CN109857910A (zh) * 2019-01-07 2019-06-07 平安科技(深圳)有限公司 Xml文件的生成方法、装置、计算机设备及存储介质
CN111224863A (zh) * 2019-12-10 2020-06-02 平安国际智慧城市科技股份有限公司 会话任务生成方法、装置、计算机设备和存储介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113626573A (zh) * 2021-08-11 2021-11-09 北京深维智信科技有限公司 一种销售会话异议及应对提取方法及系统
CN113626573B (zh) * 2021-08-11 2022-09-27 北京深维智信科技有限公司 一种销售会话异议及应对提取方法及系统
CN118132731A (zh) * 2024-05-06 2024-06-04 杭州数云信息技术有限公司 对话方法及装置、存储介质、终端、计算机程序产品

Also Published As

Publication number Publication date
CN111224863A (zh) 2020-06-02
CN111224863B (zh) 2021-06-22

Similar Documents

Publication Publication Date Title
WO2021233112A1 (zh) 基于多模态机器学习的翻译方法、装置、设备及存储介质
JP7432556B2 (ja) マンマシンインタラクションのための方法、装置、機器および媒体
WO2021042904A1 (zh) 会话意图识别方法、装置、计算机设备和存储介质
US9875445B2 (en) Dynamic hybrid models for multimodal analysis
US20190034814A1 (en) Deep multi-task representation learning
WO2021114682A1 (zh) 会话任务生成方法、装置、计算机设备和存储介质
CN110688008A (zh) 虚拟形象交互方法和装置
US20240029436A1 (en) Action classification in video clips using attention-based neural networks
US20220083153A1 (en) System and method of determining input characters based on swipe input
CN109711356B (zh) 一种表情识别方法和系统
US11544886B2 (en) Generating digital avatar
WO2021134417A1 (zh) 交互行为预测方法、智能装置和计算机可读存储介质
CN114173188B (zh) 视频生成方法、电子设备、存储介质和数字人服务器
CN114495927A (zh) 多模态交互的虚拟数字人的生成方法及装置、存储介质、终端
US11322151B2 (en) Method, apparatus, and medium for processing speech signal
CN113590078A (zh) 虚拟形象合成方法、装置、计算设备及存储介质
Siddique et al. Deep learning-based bangla sign language detection with an edge device
CN111222854B (zh) 基于面试机器人的面试方法、装置、设备及存储介质
CN117251057A (zh) 一种基于aigc构建ai数智人的方法及系统
CN112910761B (zh) 即时通讯方法、装置、设备、存储介质以及程序产品
US20210407504A1 (en) Generation and operation of artificial intelligence based conversation systems
CN110689052B (zh) 会话消息处理方法、装置、计算机设备和存储介质
US11188158B2 (en) System and method of determining input characters based on swipe input
JP2017182261A (ja) 情報処理装置、情報処理方法、およびプログラム
Gamage et al. Sinhala Sign Language Translation through Immersive 3D Avatars and Adaptive Learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20900443

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12/10/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20900443

Country of ref document: EP

Kind code of ref document: A1