CN113590769A

CN113590769A - State tracking method and device in task-driven multi-turn dialogue system

Info

Publication number: CN113590769A
Application number: CN202010366918.3A
Authority: CN
Inventors: 陈谦
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2021-11-02

Abstract

The embodiment of the application discloses a state tracking method and device in a task-driven multi-turn dialog system, wherein the method comprises the following steps: after the input text information corresponding to the current turn is determined, splicing the input text information corresponding to the current turn with the input text information corresponding to the received historical turn to obtain target text information; feature vectors respectively corresponding to modeling units at a plurality of positions in the target text information are obtained by extracting features of the target text information; and inputting the feature vectors corresponding to the modeling units at the positions into a deep learning model to obtain the dialog state information at the current moment. According to the embodiment of the application, the state tracking of the task-driven multi-turn conversation can be realized more simply and effectively.

Description

State tracking method and device in task-driven multi-turn dialogue system

Technical Field

The present invention relates to the field of state tracking processing in a task-driven multi-turn dialog system, and more particularly, to a state tracking method and apparatus in a task-driven multi-turn dialog system.

Background

Task-driven dialog systems are increasingly being used in real-world scenarios. In a conventional task-driven dialog process, as shown in fig. 1, a voice input by a user is converted into a Text through an ASR (Automatic Speech Recognition) system, the Text is converted into a triplet (field, semantic slot, value) in a current State through an NLU (Natural Language understanding) module and a DST (dialog State tracking) module, an action To be responded by a DP (dialog Policy) module is generated, the triplet is converted into a readable and understandable Text through a Natural Language generation module, and the readable Text is converted into a voice through a TTS (Text To Speech) system and transmitted To the user.

NLU and DST processing (i.e., converting the input text into triples of the current dialog state) are among the key ones. In the prior art, when each round of NLU and DST processing is performed, specifically, JSGF (Java Speech Grammar Format) is required to extract the state of the current round in an NLU module, and then, a DST module determines whether the current round needs to inherit the state of the previous round or rounds, and if the current round needs to inherit, the state of the current round needs to be pieced together with the state of the previous round to obtain the state output of the current round. The determination of whether to inherit the state of the front wheel mainly depends on a predetermined rule, which is usually manually written by an experienced person such as an expert. For example, if the semantic slot is "music" in the state of the current round, if the semantic slot is "artist" in the state of the previous round, the state of the previous round needs to be inherited, and so on. However, manual rule writing is required, which is time-consuming and labor-consuming, and as the rules increase, the maintenance cost increases exponentially, and it is difficult to achieve a good generalization capability.

Therefore, how to more easily and effectively track the state of a task-driven multi-turn dialog becomes a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The application provides a state tracking method and a state tracking device in a task-driven multi-turn dialog system, which can more simply and effectively realize the state tracking of the task-driven multi-turn dialog.

The application provides the following scheme:

a method of state tracking in a task-driven multi-turn dialog system, comprising:

after the input text information corresponding to the current turn is determined, splicing the input text information corresponding to the current turn with the input text information corresponding to the received historical turn to obtain target text information;

feature vectors respectively corresponding to modeling units at a plurality of positions in the target text information are obtained by extracting features of the target text information;

and inputting the feature vectors corresponding to the modeling units at the positions into a deep learning model, wherein the deep learning model is used for generating context feature information for the modeling units by combining the feature vectors corresponding to the modeling units at the positions, predicting the field, the semantic slot and the slot value according to the context feature information, and determining whether to inherit the dialog state information in the history round so as to obtain the dialog state information at the current moment.

A method of building a deep learning model, comprising:

obtaining a training sample set, wherein the training sample set comprises a plurality of pieces of text information and corresponding marking information, the text information is obtained by splicing the text information in a plurality of rounds of conversations and inserting an identifier, and the marking information comprises domain or semantic slot information corresponding to modeling units at a plurality of positions of the text information;

inputting the plurality of pieces of text information into a deep learning model to carry out iteration for a plurality of times until the deep learning model is trained after the algorithm is converged; and in each iteration process, adjusting the weights of a plurality of layers of the deep learning model according to the difference between the output result of the deep learning model and the labeling information.

An information processing method in a task-driven multi-turn dialog system, comprising:

the method comprises the steps that a client receives input information of a current turn and submits the input information to a server, so that the server determines text information corresponding to the input information of the current turn, the text information corresponding to the current turn is spliced with the received text information corresponding to a historical turn to obtain target text information, and feature vectors corresponding to modeling units in multiple positions in the target text information are obtained by extracting features of the target text information; inputting the feature vectors corresponding to the modeling units in the positions into a deep learning model to obtain the dialog state information at the current moment, and providing the dialog state information for a dialog decision module to generate response information of the current turn;

and receiving and outputting response information which is returned by the server and aims at the current round.

the server receives input information of the current turn submitted by the client;

determining text information corresponding to input information of the current round, and splicing the text information corresponding to the current round with the received text information corresponding to the historical round to obtain target text information;

inputting the feature vectors corresponding to the modeling units in the positions into a deep learning model to obtain the dialog state information at the current moment, and providing the dialog state information for a dialog decision module to generate response information of the current turn;

and returning the response information to the client.

An audio and video voice searching method comprises the following steps:

receiving an audio and video search request in a multi-round voice conversation mode;

inputting the feature vectors corresponding to the modeling units at the positions into a deep learning model, wherein the deep learning model is used for generating context feature information for the modeling units by combining the feature vectors corresponding to the modeling units at the positions, predicting a field, a semantic slot and a slot value according to the context feature information, and determining whether to inherit the dialog state information in a historical turn so as to obtain the dialog state information at the current moment;

and generating a conversation strategy according to the conversation state information at the current moment so as to return voice response information corresponding to the voice input of the current round and corresponding audio/video search results.

A method of providing merchandise object information, comprising:

receiving a request for acquiring commodity object information in a multi-round voice conversation mode;

and generating a conversation strategy according to the conversation state information at the current moment so as to return the voice shopping guide information corresponding to the voice input of the current turn and the corresponding commodity object information.

the intelligent communication system receives the voice information of the current turn;

converting the voice information into text information, and splicing the text information corresponding to the current turn with the received text information corresponding to the historical turn to obtain target text information;

inputting the feature vectors corresponding to the modeling units at the positions into a deep learning model to obtain the dialog state information at the current moment;

providing the conversation state information of the current moment to a conversation decision module to generate response information of the current turn;

and converting the response information into natural language and carrying out voice broadcasting.

the first equipment receives input information of the current turn input by the second equipment;

determining text information corresponding to the input information of the current round, and splicing the text information corresponding to the current round with the received text information corresponding to the historical round to obtain target text information;

and providing the conversation state information of the current moment to a conversation decision module, generating response information of the current turn, and providing the response information to the second equipment.

the self-service ticket vending machine equipment receives the voice information of the current turn;

A terminal device upgrading method comprises the following steps:

providing upgrade suggestion information to the terminal equipment;

after an upgrading request submitted by a terminal device is received, the terminal device is endowed with the authority of state tracking in a multi-round conversation process through a deep learning model; the deep learning model is used for generating context feature information for a modeling unit by combining feature vectors corresponding to the modeling unit at a plurality of positions in target text information, predicting a field, a semantic slot and a slot value according to the context feature information, and determining whether to inherit the dialog state information in a historical turn to obtain the dialog state information at the current moment; the target text information is obtained by splicing the input text information corresponding to the current turn with the input text information corresponding to the received historical turn.

A state tracking device in a task-driven multi-turn dialog system, comprising:

the text splicing unit is used for splicing the input text information corresponding to the current turn with the received input text information corresponding to the historical turn after the input text information corresponding to the current turn is determined, so that target text information is obtained;

the feature extraction unit is used for extracting features of the target text information to obtain feature vectors corresponding to the modeling units at a plurality of positions in the target text information;

and the input unit is used for inputting the feature vectors corresponding to the modeling units in the positions into a deep learning model, and the deep learning model is used for generating context feature information for the modeling units by combining the feature vectors corresponding to the modeling units in the positions, predicting a field, a semantic slot and a slot value according to the context feature information, and determining whether to inherit the dialog state information in the history round so as to obtain the dialog state information at the current moment.

An apparatus for building a deep learning model, comprising:

the training sample acquisition unit is used for acquiring a training sample set, the training sample set comprises a plurality of pieces of text information and corresponding marking information, the text information is obtained by splicing the text information in a plurality of rounds of conversations and inserting identifiers, and the marking information comprises domain or semantic slot information corresponding to modeling units at a plurality of positions of the text information;

the training unit is used for inputting the text messages into a deep learning model to carry out iteration for multiple times until the deep learning model is trained after the algorithm is converged; and in each iteration process, adjusting the weights of a plurality of layers of the deep learning model according to the difference between the output result of the deep learning model and the labeling information.

An information processing device in a task-driven multi-turn dialog system, applied to a client, comprises:

the system comprises an input information receiving unit, a service end and a modeling unit, wherein the input information receiving unit is used for receiving input information of a current round and submitting the input information to the service end so that the service end can determine text information corresponding to the input information of the current round, and obtains target text information by splicing the text information corresponding to the current round with the received text information corresponding to a historical round; inputting the feature vectors corresponding to the modeling units in the positions into a deep learning model to obtain the dialog state information at the current moment, and providing the dialog state information for a dialog decision module to generate response information of the current turn;

and the response output unit is used for receiving and outputting the response information which is returned by the server and aims at the current round.

An information processing device in a task-driven multi-turn dialog system, which is applied to a server side, comprises:

the input information receiving unit is used for receiving the input information of the current turn submitted by the client;

the text splicing unit is used for determining text information corresponding to the input information of the current round and splicing the text information corresponding to the current round with the received text information corresponding to the historical round to obtain target text information;

the dialogue state information determining unit is used for inputting the feature vectors corresponding to the modeling units in the positions into the deep learning model to obtain the dialogue state information at the current moment, and the dialogue state information is used for being provided for the dialogue decision-making module to generate response information of the current turn;

and the response information returning unit is used for returning the response information to the client.

An audio-video voice search apparatus, comprising:

the search request receiving unit is used for receiving an audio and video search request in a multi-round voice conversation mode;

the dialog state information determining unit is used for inputting the feature vectors corresponding to the modeling units in the positions into a deep learning model, the deep learning model is used for generating context feature information for the modeling units by combining the feature vectors corresponding to the modeling units in the positions, predicting the field, the semantic slot and the slot value according to the context feature information, and determining whether to inherit the dialog state information in the history round to obtain the dialog state information at the current moment;

and the conversation strategy generating unit is used for generating a conversation strategy according to the conversation state information at the current moment so as to return the voice response information corresponding to the voice input of the current round and the corresponding audio and video searching result.

An apparatus for providing commodity object information, comprising:

the request receiving unit is used for receiving a request for acquiring the commodity object information in a multi-round voice conversation mode;

and the conversation strategy generating unit is used for generating a conversation strategy according to the conversation state information at the current moment so as to return the voice shopping guide information corresponding to the voice input of the current round and the corresponding commodity object information.

An information processing device in a task-driven multi-turn conversation system, applied to an intelligent conversation system, comprises:

the voice information receiving unit is used for receiving the voice information of the current turn;

the text processing unit is used for converting the voice information into text information and splicing the text information corresponding to the current turn with the received text information corresponding to the historical turn to obtain target text information;

the dialogue state information determining unit is used for inputting the feature vectors corresponding to the modeling units in the positions into the deep learning model to obtain the dialogue state information at the current moment;

the response information generating unit is used for providing the conversation state information of the current moment to the conversation decision module and generating response information of the current turn;

and the response information conversion unit is used for converting the response information into natural language and carrying out voice broadcasting.

An information processing apparatus in a task-driven multi-turn dialog system, applied to a first device, comprising:

the input information receiving unit is used for receiving input information of the current turn input by the second equipment;

the text processing unit is used for determining the text information corresponding to the input information of the current round and splicing the text information corresponding to the current round with the received text information corresponding to the historical round to obtain target text information;

and the response information generating unit is used for providing the conversation state information of the current moment to the conversation decision module, generating response information of the current turn and providing the response information to the second equipment.

An information processing device in a task-driven multi-turn dialogue system applied to self-service ticket vending machine equipment comprises:

the voice receiving unit is used for receiving the voice information of the current turn;

the voice recognition unit is used for converting the voice information into text information and splicing the text information corresponding to the current turn with the received text information corresponding to the historical turn to obtain target text information;

and the response information output unit is used for converting the response information into natural language and carrying out voice broadcast.

A terminal device upgrading apparatus includes:

a recommendation information providing unit for providing upgrade recommendation information to the terminal device;

the authority granting unit is used for granting the authority for the terminal equipment to perform state tracking in a multi-round conversation process through a deep learning model after receiving an upgrading request submitted by the terminal equipment; the deep learning model is used for generating context feature information for a modeling unit by combining feature vectors corresponding to the modeling unit at a plurality of positions in target text information, predicting a field, a semantic slot and a slot value according to the context feature information, and determining whether to inherit the dialog state information in a historical turn to obtain the dialog state information at the current moment; the target text information is obtained by splicing the input text information corresponding to the current turn with the input text information corresponding to the received historical turn.

According to the specific embodiments provided herein, the present application discloses the following technical effects:

according to the method and the device, the field prediction of the current conversation state and the prediction of the semantic slot and the slot value can be realized through a deep learning model, whether the states in the historical ethics need to be inherited or not is judged, and the triple used for expressing the current conversation state is generated on the basis. In the process, complete data is driven, and judgment rules do not need to be written artificially, so that the efficiency and the generalization capability can be improved. In addition, because the state tracking result can be directly obtained from the input text information without a plurality of modules such as NLU, DST and the like, the end-to-end conversation state tracking can be realized, and the accumulation of errors is avoided.

Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for the practice of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic process flow diagram of the prior art;

FIG. 2 is a schematic process flow diagram provided by an embodiment of the present application;

FIGS. 3-1 to 3-3 are schematic diagrams of system architectures provided by embodiments of the present application;

FIG. 4 is a flow chart of a first method provided by an embodiment of the present application;

FIG. 5 is a schematic process flow diagram provided by an embodiment of the present application;

FIG. 6 is a flow chart of a second method provided by embodiments of the present application;

FIG. 7 is a flow chart of a third method provided by embodiments of the present application;

FIG. 8 is a flow chart of a fourth method provided by embodiments of the present application;

FIG. 9 is a flow chart of a fifth method provided by the embodiments of the present application

FIG. 10 is a flow chart of a sixth method provided by embodiments of the present application;

FIG. 11 is a flow chart of a seventh method provided by embodiments of the present application;

fig. 12 is a flowchart of an eighth method provided by an embodiment of the present application;

FIG. 13 is a flow chart of a ninth method provided by embodiments of the present application;

fig. 14 is a flow chart of a tenth method provided by an embodiment of the present application;

FIG. 15 is a schematic diagram of a first apparatus provided by an embodiment of the present application;

FIG. 16 is a schematic diagram of a second apparatus provided by an embodiment of the present application;

FIG. 17 is a schematic diagram of a third apparatus provided by an embodiment of the present application;

FIG. 18 is a schematic diagram of a fourth apparatus provided by an embodiment of the present application;

FIG. 19 is a schematic diagram of a fifth apparatus provided in embodiments of the present application

FIG. 20 is a schematic view of a sixth apparatus provided by an embodiment of the present application;

FIG. 21 is a schematic diagram of a seventh apparatus provided by an embodiment of the present application;

FIG. 22 is a schematic diagram of an eighth apparatus provided by an embodiment of the present application;

FIG. 23 is a schematic view of a ninth apparatus provided by an embodiment of the present application;

FIG. 24 is a schematic view of a tenth apparatus provided by an embodiment of the present application;

fig. 25 is a schematic diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments that can be derived from the embodiments given herein by a person of ordinary skill in the art are intended to be within the scope of the present disclosure.

To facilitate understanding of the solution provided by the embodiments of the present application, a detailed description of a conventional task-driven conversation process is first provided below. As shown in fig. 1, a conventional task-driven dialog system is composed of five modules: speech recognition, natural language understanding, dialog management, natural language generation, speech synthesis. More and more products are now being integrated into the knowledge base, mainly introduced in the dialogue management module. The main functions of each module are as follows:

speech Recognition (ASR): which is responsible for converting speech signals into text information.

Natural Language Understanding (NLU): the main function is to process the sentence input by the user or the result of speech recognition, and extract the dialogue intention of the user and the information transmitted by the user.

Dialog Management (DM): the dialogue management is divided into two submodules, Dialogue State Tracking (DST) and dialogue strategy learning (DP), which mainly function to update the state of the system according to the result of the NLU and generate corresponding system actions.

Natural Language Generation (NLG): and the system action output by the DM is converted into text, and the system action is expressed in a text form.

Among them, with respect to NLUs, functions of intent recognition and slot value filling are mainly implemented. For example, the user inputs "play zhou jeren rice aroma", the NLU module first identifies the "music" field by field identification, identifies the user intention as "play _ music" by the user intention detection function, and finally fills each word into the corresponding slot by slot filling pair: "Play [ O ]/Zhou Ji Lun [ B-singer ]/O/Rice fragrance [ B-song ]".

However, because errors may exist in both the ASR and NLU links, the content input into the DST is typically an N-best list (for ASR, instead of inputting a sentence, N sentences each with a confidence level). DST also often outputs a probability distribution of the individual dialog states, which representation also facilitates modification of the states in multiple rounds of dialog.

Among them, the definition about the dialog state and DST may be as follows:

conversation state: at time t, the probability distribution condition of the value of each current semantic slot is given by combining the conversation history and the current user input and is used as the input of the DP module, and the conversation state at the moment is expressed as St.

DST (dialog state tracking): the current dialog state St and the user goals are inferred from all dialog history information.

However, as described in the background section, the rule-based method requires a large amount of manual and expert knowledge, and thus is not highly applicable in a complex scene. Generative models, discriminant models, and the like may also be included.

In summary, in the conventional scheme, after text recognition is completed for a user's voice input, intent recognition and slot value filling are performed through the NLU module, and then converted into a triplet of a current state through the DST module, so as to be used as an input of the DP module.

In the embodiment of the application, a deep learning model can be used for completing the tasks of the NLU module and the DST module, so that an end-to-end multi-round dialog state tracking is realized. The term "end-to-end" is a concept in the field of machine learning, and specifically means that a user directly inputs original information and can directly obtain a usable result without paying attention to intermediate products. Specifically, in the embodiment of the present application, the tasks that need to be completed by the NLU and the DST in the conventional scheme are completed by a deep learning model. The advantage of this is that only one deep learning model needs to be trained, and the tracking of the current dialogue state can be completed directly according to the input information (text information or speech recognition result) of the user, and the triplet representing the current dialogue state is output. In the early stage, only one deep learning model is required to be trained, and the models used in the NLU module and the DST module do not need to be trained separately, so that the accumulation of errors is avoided. In addition, the whole state tracking process (including the judgment on whether the states in the history wheel need to be integrated or not) is completely driven by data, and manual writing rules are not needed, so that more efficient dynamic tracking can be realized, and the generalization capability of the scheme is improved.

Specifically, in the "end-to-end" multi-turn dialog state tracking method provided in the embodiment of the present application, the current dialog state can be tracked separately in each dialog turn. After receiving input information of a current conversation turn (if the input information is a voice input, the input information may be first converted into a text), referring to fig. 2, in this embodiment of the application, text information corresponding to the current turn may be spliced with text information corresponding to a received history turn (which may be one or more) to obtain a spliced text. Next, feature extraction may be performed on the text information obtained by concatenation, and specifically, the feature extraction may include a position feature of each modeling unit (for example, for chinese, one chinese character may correspond to one modeling unit), a segmentation feature (whether the modeling unit belongs to a current round or a historical round), a word sense label feature, and the like. Then, feature vectors are respectively generated for the positions of the modeling units in the spliced text according to the extracted features (for example, the position features, the segmentation features, and the semantic tag features of the same modeling unit can be added, and the like), and then the feature vectors of the modeling units are input into a deep learning model. In the deep learning model, feature vectors of a plurality of modeling units can be encoded, so that feature information of the modeling units at different positions can be fused, each modeling unit can obtain feature information of the context of the modeling unit, and then dialog state information at the current moment, including a domain prediction result, a semantic slot and a slot value prediction result, can be obtained by inputting the feature information into a specific classifier.

It should be noted that the deep learning model may include an encoding sub-module and a classifier sub-module, where the classifier sub-module may further be divided into a classifier for performing domain prediction, a classifier for performing semantic slot and slot value prediction (specifically, a classifier capable of solving a sequence labeling problem), and the like. Although the encoder and the classifier may involve different models, the joint training can be performed in the same deep learning model, and therefore, the problem of error accumulation is not caused.

In specific implementation, the specific technical scheme provided by the embodiment of the application can be used in various application scenarios. For example, as shown in fig. 3-1, in an application system of a music service class, a "voice assistant" function may be provided, so that a user may play music by inputting a voice command, in the process, multiple rounds of conversations between the user and the voice assistant may be involved, and at this time, the method provided by the embodiment of the present application may be used to perform state tracking of the multiple rounds of conversations, thereby implementing a conversation with the user. Alternatively, as shown in fig. 3-2, the above functions may also be implemented in a product such as a smart speaker, so that in the process of a conversation between a user and a smart speaker device, state tracking of multiple rounds of conversations is performed by the method of the embodiment of the present application. The intelligent sound box can comprise a household intelligent sound box, and can also comprise a vehicle-mounted intelligent sound box, and the like. In addition, the user can have a conversation with the system by inputting a voice signal, and can also have a conversation with the system by inputting text information. For the latter, there may be various specific scenarios, for example, as shown in fig. 3-3, a customer service system is provided in a certain commodity object information service system, in order to increase the response speed to the user's question, a "question and answer robot" function is provided in the customer service system, and the user may input text information in a dialog box to implement a dialog with the "question and answer robot". In the process of the dialogue, the dialogue state tracking function provided by the embodiment of the application can be realized in the question-answering robot module so as to identify a specific question field and the intention of the user, further make an accurate response to the question posed by the user, and the like. Of course, other customer service systems may also provide similar "question and answer robot" functions, such as intelligent call services provided by customer service systems in telephone banking systems, and so on. In addition, the specific input information may be not only user input but also machine input, that is, in a system in which a machine and a machine have a conversation, the scheme provided in the embodiment of the present application may also be used for tracking the conversation state. For example, different IOT (Internet of Things) devices may interact with each other through multiple rounds of conversations, and during this process, a machine on the responding party may track the conversation state through the solution provided in the embodiment of the present application. In addition, the scheme provided by the embodiment can also be used in equipment such as a self-service ticket machine, for example, a subway station or a railway station is equipped with the self-service ticket machine. Specifically, when a user arrives at a certain place, the user often needs to know information such as local weather, traffic and the like, and a subway station or a railway station is usually the starting point of the user arriving at the place, so that the user can travel conveniently by providing a query function of the information such as the relevant weather, traffic and the like in the self-service ticket vending machine. Specifically, when the self-service ticket vending machine provides the information, the intention of the user can be known in a multi-turn conversation mode with the user so as to provide the information required by the user for the user. In the process, the tracking of the user dialog state can be realized in the manner provided in the embodiment of the application.

It should be noted that the system in a specific scenario may include a client and a server, where the client is mainly used for interacting with a user, including receiving input information and outputting response information, and a specific machine learning model and the like may be deployed at the server, that is, processing such as session state tracking may be completed by the server, and the server generates specific response information and returns the generated response information to the client for outputting. Of course, in practical applications, the client may complete various processes including session state tracking, generate specific response information, and output the response information, which may be determined according to the performance of the terminal device.

The following describes in detail specific implementations provided in embodiments of the present application.

Example one

First, the embodiment provides a method for tracking states in a task-driven multi-turn dialog system, and referring to fig. 4, the method may specifically include:

s401: after the text information corresponding to the current turn is determined, splicing the input text information corresponding to the current turn with the received input text information corresponding to the historical turn to obtain target text information;

where a dialog turn means that the user has completed an input and the system can respond to the current and previously received inputs. For example, a user inputs first (where the user may input text information directly, or the user may input information in a form of voice, and at this time, voice recognition may be performed first to convert a voice signal into text information) "music", and accordingly, after obtaining a recognition result, the system may play a piece of music at random; then, the user inputs 'Zhougelong', at this time, the system can play a piece of music with the singer being Zhougelong in combination with the previous recognition result about 'music'; then the user inputs 'issue as snow', and at this time, a song of 'issue as snow' of Zhou Ji Lun can be played in combination with the previous recognition results of 'music' and 'Zhou Ji Lun', and so on. The above process includes three dialog turns.

It can be seen that the user may enter a portion of the information in each conversation turn, and that together, multiple conversation turns may express the user's full intent. In the embodiment of the application, field prediction and slot value filling are not performed separately for the input of the current round, but the text information corresponding to the current round and the historical round can be spliced at first, and subsequent feature extraction and prediction processes are performed on the basis of the spliced text information.

In the specific implementation, the subsequent feature extraction process may involve extracting the segmented features of each modeling unit, that is, determining whether each modeling unit belongs to the current round or the historical round. Therefore, when the text information corresponding to a plurality of rounds is spliced according to the time sequence, identifiers are respectively inserted between the text information corresponding to different rounds, the sentence head of the text information corresponding to the first round and the sentence tail of the text information corresponding to the current round so as to extract the segmentation characteristics of the modeling unit.

For example, assume that the current round is the third round, where the first round is the user entering "music", the second round is "Zhou Jilun", and the third round is "snow. Then when the text information is spliced, the three input wheels can be spliced into 'CLS' music 'SEP' Zhou Ji Lun 'SEP' such as snow 'SEP' in a time sequence splicing mode. Wherein [ CLS ] and [ SEP ] belong to the inserted identifiers, and in particular, when feature extraction is performed, one identifier can also be regarded as a modeling unit. In addition, each chinese character may also correspond to one modeling unit, so that a total of 12 modeling units are included in the spliced text information. It should be noted that, for other languages such as english, one word may be used as one modeling unit, or, if the word is a relatively long word, one word may be divided into a plurality of modeling units, and so on.

S402: feature vectors respectively corresponding to modeling units at a plurality of positions in the target text information are obtained by extracting features of the target text information;

after the splicing of the text information is completed, the feature extraction can be performed on the obtained target text information. Specifically, when feature extraction is performed, feature extraction may be performed on modeling units at a plurality of positions included in the target text information. For example, in the example in S401, if the spliced target text "[ CLS ] music [ SEP ] zhou jieren [ SEP ] is sent as snow [ SEP ]" including 12 modeling units, feature extraction may be performed for the 12 modeling units, respectively. In the specific implementation, in order to obtain richer information, the features extracted by each modeling unit may include multiple types of features, so that the different types of features corresponding to the same modeling unit may be added to obtain the feature vector of the corresponding modeling unit.

In particular implementations, the various types of features may include: location features, segmentation features, and word sense tag features. The position feature is position sequence information of the modeling unit in the target text. The segment characteristic is whether the modeling unit belongs to a current round or a historical round. The word sense label characteristic is whether the modeling unit belongs to a physical word. Of course, in practical applications, other types of features may also be extracted, for example, text features may also be included, and so on.

It should be noted that, in order to fuse different types of feature information of the same modeling unit, before the feature addition processing is specifically performed, the different types of features may be uniformly converted, for example, the three types of features corresponding to the same modeling unit may be converted into a 128-dimensional vector, and then the three 128-dimensional vectors are added to obtain a feature vector corresponding to the modeling unit, where the feature vector is also a 128-dimensional vector, and only a plurality of different types of feature information are fused in each dimension.

S403: and inputting the feature vectors corresponding to the modeling units at the positions into a deep learning model, wherein the deep learning model is used for generating context information for the modeling units by combining the feature vectors corresponding to the modeling units at the positions, predicting the field, the semantic slot and the slot value according to the context feature information, and determining whether to inherit the dialog state information in the history round so as to obtain the dialog state information at the current moment.

After the feature vectors corresponding to the modeling units at the multiple positions are obtained, the feature vectors can be simultaneously input into a pre-trained deep learning model. In the deep learning model, context information can be generated for the modeling units by combining the feature vectors corresponding to the modeling units at the positions, and then dialog state information at the current moment, namely < field, semantic slot, slot value > triplet, can be obtained according to the context information corresponding to the modeling units.

In a specific implementation, to implement this, the deep learning model may include an encoder, a first classifier, and a second classifier, where the encoder may be specifically configured to encode feature vectors corresponding to modeling units at multiple positions to obtain the context information. After that, hidden layer state information corresponding to the identifier inserted by the beginning of the sentence (e.g., [ CLS ] in the foregoing example) may be input into the first classifier for predicting domain information of the dialog state at the current time, hidden layer state information corresponding to other modeling units may be input into the second classifier for predicting semantic slots and slot values of the dialog state at the current time, and whether or not to inherit the dialog state information corresponding to the history turn may be determined. Of course, in a specific implementation, the prediction of the domain or the bin value may also be performed in other manners, for example, the hidden layer state information mean values corresponding to the modeling units at multiple positions may be input into the first classifier to perform the domain prediction, and so on.

In particular, the encoder may be implemented using a transform or other model. In addition, since domain recognition can be viewed as a text classification problem and semantic slot filling as a Sequence Tagging (Sequence Tagging) problem, i.e., each word in a continuous Sequence is assigned a corresponding semantic class label, the first classifier can be a classifier that can solve the text classification problem and the second classifier can be a classifier that can solve the Sequence Tagging problem. There may be various types of second classifiers, including, for example, a generator model such as HMM/CFG, HVS (hidden vector state), and a discriminant model such as CRF, SVM, and so on.

For example, also assume that the current time is the third round of user input, where the first round of user input is "music", the second round is "Zhou Jilun", and the third round is "snow. The input text information of the current round is "send like snow" and the input text information of the historical round is "music" and "zhou jilun", respectively. In this embodiment of the application, as shown in fig. 5, a specific processing manner may be that, first, three rounds of input text information are spliced according to a time sequence, and an identifier is added, so that an obtained splicing result is that [ CLS ] music [ SEP ] zhou jiron [ SEP ] is issued as snow [ SEP ] ". Then, feature information of the modeling unit at each Position is extracted, specifically including a Position feature (Position), a Segment feature (Segment), and a Semantic Tag feature (Semantic Tag), where the specific feature extraction result may be shown in table 1:

TABLE 1

Modeling unit	Location features	Segmentation feature	Semantic tag features
				[CLS]	0	A	PAD
Sound	1	A	Tag
				Musical instrument	2	A	Tag
[SEP]	3	A	PAD
				Week (week)	4	A	Tag
Jie's wine	5	A	Tag
				All-in-one	6	A	Tag
[SEP]	7	A	PAD
				Hair-like device	8	B	Tag
Such as	9	B	Tag
				Snow (snow)	10	B	Tag
[SEP]	11	B	PAD

The position feature may be represented by a number, and a specific number represents a position sequence of the modeling unit in the spliced text information, for example, a position feature corresponding to [ CLS ] is 0, which represents that [ CLS ] is located at the head of the spliced text information, a position feature corresponding to "tone" is 1, which represents that [ CLS ] is located at the second head of the spliced text information, and so on. The segmentation features may be represented by the letters a or B, where a represents that the corresponding modeling unit belongs to the historical round and B represents that the corresponding modeling unit belongs to the current round. For example, in the above example, all of the modeling units of "[ CLS ] music [ SEP ] Zhou Ji Lun [ SEP ]" belong to the history turn, and therefore, the corresponding segment features are all A; these modeling units "send as snow [ SEP ]" belong to the current round, so the corresponding segment features are all B. The semantic Tag feature can be represented by PAD or Tag, wherein PAD represents that the corresponding modeling unit is a non-entity word, and Tag represents that the corresponding modeling unit is an entity word. The term "entity" means a word having an actual meaning. For example, in the above example, all the modeling units "music Zhougelong is snow" belong to entity words, and therefore the corresponding semantic Tag features are Tag; and [ CLS ] and [ SEP ] do not belong to entity words, so the corresponding semantic tag characteristic is PAD.

After the features are extracted, the features of each modeling unit on different types can be converted and added to obtain feature vectors corresponding to the modeling units, and then the feature vectors can be input into a deep learning model, and the deep learning model can output field identification and semantic groove value filling results. For example, in the foregoing example, the field in which the current dialog state is predicted to be "Play music" according to the hidden layer state corresponding to "[ CLS ]"; the semantic slot corresponding to the 'sound' is 'B-music _ type', namely 'sound' belongs to the semantic slot of 'music _ type', and 'B' represents that 'sound' is positioned at the head of the semantic slot; the semantic slot corresponding to the 'music' is 'I-music _ type', namely 'music' also belongs to the semantic slot of 'music _ type', and 'I' represents that 'music' is not the first position of the semantic slot; [ SEP ] the corresponding recognition result is "PAD", i.e., does not belong to any semantic slot. Similarly, the semantic slot corresponding to the week is "B-artist", that is, the week belongs to the semantic slot "artist", and is located at the head of the semantic slot, the semantic slot corresponding to the "jie" is "I-artist", that is, the week also belongs to the semantic slot "artist", and is not located at the head of the semantic slot, and so on.

After the prediction of the above-mentioned field, semantic slot, and slot value is completed, it may be converted into a triple corresponding to the current dialog state, which may be specifically expressed as: (play _ music, [ (music _ type, music), (artist, zhou jilun), (music, such as snow) ] ". the triplet may then be used as input for a subsequent DP module to make a dialog decision to generate response information corresponding to the user input information.

Specifically, when the deep learning model in the embodiment of the present application is trained, the encoder, the first classifier, and the second classifier may be jointly trained. Specifically, a specific model may be selected first, and a training sample set and labeling information are obtained. The training sample set may specifically include a plurality of pieces of text information after splicing, and each piece of text information may be a splicing result obtained by splicing the text information in multiple rounds of conversations and inserting an identifier. In addition, the text information can be labeled, and the specific labeled information can be the information of the domain or semantic slot corresponding to the modeling unit at each position. And after the output result is obtained by inputting the training sample set into the deep learning model, the weights of a plurality of layers in the deep learning model are finely adjusted through the difference between the output result and the labeling result. After multiple iterations, the deep learning model can be trained.

In addition, the training process of the deep learning model can also be completely driven by data. For example, it may specifically include an auto-supervised pre-training (ALBERT, BERT) and a supervised fine-tuning (finetune). The pre-training data can be composed of a large amount of unsupervised general texts and/or texts in the field, and the pre-training result can be used for determining initial weight values of a plurality of layers in the deep learning model, so that the training efficiency is improved. The fine-tuned data may be collected by M2M (Machine to Machine), H2H (Human to Human), H2M (Human to Machine), and the like.

Therefore, in the scheme provided by the embodiment of the application, the domain prediction and the prediction of semantic slots and slot values of the current conversation state can be realized through a deep learning model, and whether the states in the historical ethics need to be inherited or not can be realized, and the triple used for expressing the current conversation state is generated on the basis. In the process, complete data is driven, and judgment rules do not need to be written artificially, so that the efficiency and the generalization capability can be improved. In addition, because the state tracking result can be directly obtained from the input text information without a plurality of modules such as NLU, DST and the like, the end-to-end conversation state tracking can be realized, and the accumulation of errors is avoided.

Example two

The second embodiment provides a method for building a deep learning model for a training process of the deep learning model, and referring to fig. 6, the method may specifically include:

s601: obtaining a training sample set, wherein the training sample set comprises a plurality of pieces of text information and corresponding marking information, the text information is obtained by splicing the text information in a plurality of rounds of conversations and inserting an identifier, and the marking information comprises domain or semantic slot information corresponding to modeling units at a plurality of positions of the text information;

s602: inputting the plurality of pieces of text information into a deep learning model to carry out iteration for a plurality of times until the deep learning model is trained after the algorithm is converged; and in each iteration process, adjusting the weights of a plurality of layers of the deep learning model according to the difference between the output result of the deep learning model and the labeling information.

In specific implementation, before the deep learning model is trained, initial weight values of a plurality of layers in the deep learning model can be determined through self-supervision pre-training, so that a model training process driven by data completely is realized.

EXAMPLE III

The third embodiment is introduced with respect to the application of the scheme provided in the embodiment of the present application in a specific application scenario. Specifically, referring to fig. 7, a third embodiment provides an information processing method in a task-driven multi-turn dialog system, where the method specifically includes:

s701: the method comprises the steps that a client receives input information of a current turn and submits the input information to a server, so that the server determines text information corresponding to the input information of the current turn, the text information corresponding to the current turn is spliced with the received text information corresponding to a historical turn to obtain target text information, and feature vectors corresponding to modeling units in multiple positions in the target text information are obtained by extracting features of the target text information; inputting the feature vectors corresponding to the modeling units in the positions into a deep learning model to obtain the dialog state information at the current moment, and providing the dialog state information for a dialog decision module to generate response information of the current turn;

s702: and receiving and outputting response information which is returned by the server and aims at the current round.

The input information of the current round comprises voice information, and the server converts the voice information into text information in a voice recognition mode when determining the text information corresponding to the current round; and after the response information is generated, converting the response information into natural language so as to carry out voice playing through the client.

Specifically, the client includes a client of a music service application running in the terminal device.

Or the client comprises a client associated with the smart sound box device.

In addition, the input information of the current round may also include text information. At this time, the client may include a customer service module in the goods object information service class application.

Example four

The fourth embodiment corresponds to the third embodiment, and from the perspective of the server, provides an information processing method in the task-driven multi-turn dialog system, and referring to fig. 8, the method may specifically include:

s801: the server receives input information of the current turn submitted by the client;

s802: determining text information corresponding to input information of the current round, and splicing the text information corresponding to the current round with the received text information corresponding to the historical round to obtain target text information;

s803: feature vectors respectively corresponding to modeling units at a plurality of positions in the target text information are obtained by extracting features of the target text information;

s804: inputting the feature vectors corresponding to the modeling units in the positions into a deep learning model to obtain the dialog state information at the current moment, and providing the dialog state information for a dialog decision module to generate response information of the current turn;

s805: and returning the response information to the client.

EXAMPLE five

The fifth embodiment is introduced to an application of the scheme provided in the embodiment of the present application in a specific application scenario. Specifically, the scene may be an audio/video search scene provided in an audio/video application, and in the search process, a specific search request may be received in a manner of performing multiple rounds of voice conversations with the user. After receiving each round of user voice input, the dialog state tracking can be performed in the manner described in the first embodiment. Specifically, referring to fig. 9, a fifth embodiment provides an audio/video speech search method, which may specifically include:

s901: receiving an audio and video search request in a multi-round voice conversation mode;

s902: after the input text information corresponding to the current turn is determined, splicing the input text information corresponding to the current turn with the input text information corresponding to the received historical turn to obtain target text information;

s903: feature vectors respectively corresponding to modeling units at a plurality of positions in the target text information are obtained by extracting features of the target text information;

s904: inputting the feature vectors corresponding to the modeling units at the positions into a deep learning model, wherein the deep learning model is used for generating context feature information for the modeling units by combining the feature vectors corresponding to the modeling units at the positions, predicting a field, a semantic slot and a slot value according to the context feature information, and determining whether to inherit the dialog state information in a historical turn so as to obtain the dialog state information at the current moment;

s905: and generating a conversation strategy according to the conversation state information at the current moment so as to return voice response information corresponding to the voice input of the current round and corresponding audio/video search results.

EXAMPLE six

The sixth embodiment is introduced with respect to an application of the scheme provided in the embodiment of the present application in another specific application scenario. Specifically, the scene may be a commodity object shopping guide scene provided in an application associated with the commodity object information service system. In the shopping guide process, a video shopping guide mode and the like can be specifically adopted, and a specific user request can be received in a mode of carrying out multiple rounds of voice conversations with the user through the shopping guide robot. After receiving each round of user voice input, the dialog state tracking can be performed in the same manner as described in the first embodiment. Specifically, referring to fig. 10, a sixth embodiment provides a method for providing information of a commodity object, where the method may specifically include:

s1001: receiving a request for acquiring commodity object information in a multi-round voice conversation mode;

s1002: after the input text information corresponding to the current turn is determined, splicing the input text information corresponding to the current turn with the input text information corresponding to the received historical turn to obtain target text information;

s1003: feature vectors respectively corresponding to modeling units at a plurality of positions in the target text information are obtained by extracting features of the target text information;

s1004: inputting the feature vectors corresponding to the modeling units at the positions into a deep learning model, wherein the deep learning model is used for generating context feature information for the modeling units by combining the feature vectors corresponding to the modeling units at the positions, predicting a field, a semantic slot and a slot value according to the context feature information, and determining whether to inherit the dialog state information in a historical turn so as to obtain the dialog state information at the current moment;

s1005: and generating a conversation strategy according to the conversation state information at the current moment so as to return the voice shopping guide information corresponding to the voice input of the current turn and the corresponding commodity object information.

EXAMPLE seven

The seventh embodiment is introduced with respect to an application of the scheme provided in the embodiment of the present application in another specific application scenario. Specifically, referring to fig. 11, a fifth embodiment provides an information processing method in a task-driven multi-turn dialog system, where the method specifically includes:

s1101: the intelligent communication system receives the voice information of the current turn;

s1102: converting the voice information into text information, and splicing the text information corresponding to the current turn with the received text information corresponding to the historical turn to obtain target text information;

s1103: feature vectors respectively corresponding to modeling units at a plurality of positions in the target text information are obtained by extracting features of the target text information;

s1104: inputting the feature vectors corresponding to the modeling units at the positions into a deep learning model to obtain the dialog state information at the current moment;

s1105: providing the conversation state information of the current moment to a conversation decision module to generate response information of the current turn;

s1106: and converting the response information into natural language and carrying out voice broadcasting.

Example eight

The eighth embodiment is introduced with respect to an application of the scheme provided in the embodiment of the present application in another specific application scenario. Specifically, referring to fig. 12, an eighth embodiment provides an information processing method in a task-driven multi-turn dialog system, where the method specifically includes:

s1201: the first equipment receives input information of the current turn input by the second equipment;

s1202: determining text information corresponding to the input information of the current round, and splicing the text information corresponding to the current round with the received text information corresponding to the historical round to obtain target text information;

s1203: feature vectors respectively corresponding to modeling units at a plurality of positions in the target text information are obtained by extracting features of the target text information;

s1204: inputting the feature vectors corresponding to the modeling units at the positions into a deep learning model to obtain the dialog state information at the current moment;

s1205: and providing the conversation state information of the current moment to a conversation decision module, generating response information of the current turn, and providing the response information to the second equipment.

Example nine

The ninth embodiment is introduced with respect to application of the scheme in the embodiment of the present application in another scenario, and specifically, the scenario may be a self-service ticket machine scenario at a subway station, a railway station, or the like. Referring to fig. 13, this embodiment provides an information processing method in a task-driven multi-turn dialog system, where the method may include:

s1301: the self-service ticket vending machine equipment receives the voice information of the current turn;

s1302: converting the voice information into text information, and splicing the text information corresponding to the current turn with the received text information corresponding to the historical turn to obtain target text information;

s1303: feature vectors respectively corresponding to modeling units at a plurality of positions in the target text information are obtained by extracting features of the target text information;

s1304: inputting the feature vectors corresponding to the modeling units at the positions into a deep learning model to obtain the dialog state information at the current moment;

s1305: providing the conversation state information of the current moment to a conversation decision module to generate response information of the current turn;

s1306: and converting the response information into natural language and carrying out voice broadcasting.

Example ten

The foregoing embodiments describe the dialog state tracking method provided by the embodiments of the present application and applications in various scenarios. In specific implementation, for an application scenario in hardware devices such as a smart speaker, since a user may not yet implement the functions provided in the embodiment of the present application when purchasing a specific hardware device, the "old" hardware device may only perform processing such as natural language understanding and dialog state tracking in a conventional manner. In the embodiment of the application, in order to enable the part of "old" hardware equipment to also perform session state tracking through the new deep learning model so as to improve the experience of the user, an upgrade scheme may be provided for the terminal equipment. For example, in specific implementation, a specific processing flow such as state tracking may be provided at the server, and the specific hardware device side only needs to submit the collected user voice input to the server and receive a result returned by the server to perform voice output. In this case, the models and the like needed in the processing process such as the specific state tracking and the like only need to be saved at the server, and the terminal device side can be upgraded without improving hardware. Of course, in the process of performing session state tracking, acquisition of user data is usually involved, so that in the specific implementation, a suggestion that upgrading can be performed can be pushed to specific hardware equipment through a server, if a user needs to upgrade the equipment, the user's needs can be expressed in modes of inputting voice and the like, and then, a specific upgrade request can be submitted to the server, and the server processes the upgrade request. During specific implementation, the server may further determine the state of the specific hardware device, for example, whether a relevant user has paid corresponding resources for obtaining an upgraded service, and the like, and if so, may give the authority to perform state tracking in a multi-round conversation process through the deep learning model. Therefore, the hardware device can track the state in the multi-round conversation process through the deep learning model in the process of carrying out multi-round conversation with the user subsequently. Specifically, the deep learning model may be stored in the server, or, in a case where hardware resources of the hardware device can be supported, the deep learning model may be directly pushed to the specific hardware device, and the hardware device locally completes processing such as state tracking in a multi-turn conversation process.

In addition, for the situation that the specific deep learning model is stored in the server, a 'switch' function can be provided, so that the user can use the function only when necessary, and the purposes of saving resources and the like are achieved. For example, when the user only needs to perform a single round of conversation with the hardware device and does not need multiple rounds of conversations, a request for closing the above function can be submitted by issuing a voice instruction or the like, and then the server may temporarily close the function for the user, and if accounting or the like is involved, the accounting stop may also be triggered. And subsequently, if the user needs to perform multiple rounds of conversations with the hardware device, the functions can be turned on again, and the like.

Specifically, in an eighth embodiment of the present application, a method for upgrading a terminal device is provided from the perspective of a server, and referring to fig. 14, the method may specifically include:

s1401: providing upgrade suggestion information to the terminal equipment;

s1402: after an upgrading request submitted by a terminal device is received, the client is endowed with the authority of state tracking in the multi-round conversation process through a deep learning model; the deep learning model is used for generating context feature information for a modeling unit by combining feature vectors corresponding to the modeling unit at a plurality of positions in target text information, predicting a field, a semantic slot and a slot value according to the context feature information, and determining whether to inherit the dialog state information in a historical turn to obtain the dialog state information at the current moment; the target text information is obtained by splicing the input text information corresponding to the current turn with the input text information corresponding to the received historical turn.

During specific implementation, the permission of state tracking in the process of carrying out multiple rounds of conversations through the deep learning model can be closed for the terminal equipment according to a degradation request submitted by the terminal equipment.

For the parts of the second to tenth embodiments that are not described in detail, reference may be made to the description of the first embodiment, which is not repeated herein.

It should be noted that, in the embodiments of the present application, the user data may be used, and in practical applications, the user-specific personal data may be used in the scheme described herein within the scope permitted by the applicable law, under the condition of meeting the requirements of the applicable law and regulations in the country (for example, the user explicitly agrees, the user is informed, etc.).

In accordance with the first embodiment, the present application further provides a status tracking apparatus in a task-driven multi-turn dialog system, and referring to fig. 15, the apparatus may include:

the text splicing unit 1501 is configured to splice the input text information corresponding to the current round with the received input text information corresponding to the historical round after determining the input text information corresponding to the current round, so as to obtain target text information;

a feature extraction unit 1502, configured to perform feature extraction on the target text information to obtain feature vectors corresponding to modeling units at multiple positions in the target text information, respectively;

the input unit 1503 is configured to input feature vectors corresponding to the modeling units at the multiple positions into a deep learning model, where the deep learning model is configured to generate context feature information for the modeling units by combining the feature vectors corresponding to the modeling units at the multiple positions, predict a domain, a semantic slot, and a slot value according to the context feature information, and determine whether to inherit the dialog state information in the history round, so as to obtain the dialog state information at the current time.

Wherein the modeling unit at each position corresponds to a plurality of different types of features;

the feature extraction unit may be specifically configured to:

and adding the characteristics of different types corresponding to the same modeling unit to obtain the characteristic information of the corresponding modeling unit.

Wherein the plurality of different types of features includes: a location feature, a segmentation feature, and a word sense tag feature; the position characteristics are position sequence information of the modeling unit in the target text information, the segmentation characteristics are whether the modeling unit belongs to the current round or the historical round, and the word sense label characteristics are whether the modeling unit belongs to the entity word.

Specifically, the text splicing unit may be specifically configured to:

and splicing the input text information corresponding to a plurality of turns according to the time sequence, and respectively inserting identifiers into the sentence heads of the input text information corresponding to different turns and the input text information corresponding to the first turn and the sentence tail of the input text information corresponding to the current turn so as to extract the segmentation characteristics of the modeling unit.

Specifically, the deep learning model includes an encoder, a first classifier, and a second classifier, where the encoder is configured to encode feature vectors corresponding to the multiple modeling units to obtain the context information, and input hidden layer state information corresponding to an identifier inserted by a sentence start into the first classifier to predict field information of a dialog state at a current time, and input hidden layer state information corresponding to other modeling units into the second classifier to predict a semantic slot and a slot value of the dialog state at the current time, and determine whether to inherit the dialog state information corresponding to a history turn.

Corresponding to the second embodiment, the embodiment of the present application further provides an apparatus for building a deep learning model, referring to fig. 16, where the apparatus may include:

a training sample obtaining unit 1601, configured to obtain a training sample set, where the training sample set includes multiple pieces of text information and corresponding label information, where the text information is obtained by splicing text information in multiple rounds of conversations and inserting an identifier, and the label information includes field or semantic slot information corresponding to modeling units at multiple positions of the text information;

a training unit 1602, configured to perform multiple iterations by inputting the multiple pieces of text information into a deep learning model until the deep learning model is trained after an algorithm converges; and in each iteration process, adjusting the weights of a plurality of layers of the deep learning model according to the difference between the output result of the deep learning model and the labeling information.

In a specific implementation, the apparatus may further include:

and the pre-training unit is used for determining weight initial values of a plurality of layers in the deep learning model through self-supervised pre-training before the deep learning model is trained.

Corresponding to the embodiment, the embodiment of the present application further provides an information processing apparatus in a task-driven multi-turn dialog system, which is applied to a client, and includes:

an input information receiving unit 1701, configured to receive input information of a current round, and submit the input information to a server, so that the server determines text information corresponding to the input information of the current round, and obtains target text information by splicing the text information corresponding to the current round with text information corresponding to a received historical round, and obtains feature vectors corresponding to modeling units at multiple positions in the target text information by performing feature extraction on the target text information; inputting the feature vectors corresponding to the modeling units in the positions into a deep learning model to obtain the dialog state information at the current moment, and providing the dialog state information for a dialog decision module to generate response information of the current turn;

a response output unit 1702, configured to receive and output response information for the current round returned by the server.

The client comprises a client of the music service application program running in the terminal equipment.

Or the client comprises a client associated with the smart sound box device.

Or the input information of the current turn comprises text information.

At this time, the client includes a customer service module in the commodity object information service class application.

In correspondence with the fourth embodiment, an embodiment of the present application further provides an information processing apparatus in a task-driven multi-turn dialog system, and referring to fig. 18, the apparatus is applied to a server and includes:

an input information receiving unit 1801, configured to receive input information of a current turn submitted by a client;

the text splicing unit 1802 is configured to determine text information corresponding to input information of a current round, and splice the text information corresponding to the current round with the received text information corresponding to a historical round to obtain target text information;

a feature extraction unit 1803, configured to perform feature extraction on the target text information to obtain feature vectors corresponding to the modeling units at multiple positions in the target text information, respectively;

the dialog state information determining unit 1804 is configured to input the feature vectors corresponding to the modeling units at the multiple positions into the deep learning model, and obtain dialog state information at the current moment, so as to provide the dialog state information to the dialog decision module and generate response information of the current turn;

a response information returning unit 1805, configured to return the response information to the client.

Corresponding to the fourth embodiment, an embodiment of the present application further provides an audio/video speech search apparatus, with reference to fig. 19, where the apparatus includes:

a search request receiving unit 1901, configured to receive an audio/video search request in a multi-round voice conversation manner;

a text splicing unit 1902, configured to splice the input text information corresponding to the current round with the input text information corresponding to the received historical round after determining the input text information corresponding to the current round, so as to obtain target text information;

a feature extraction unit 1903, configured to perform feature extraction on the target text information to obtain feature vectors corresponding to modeling units at multiple positions in the target text information, respectively;

a dialog state information determining unit 1904, configured to input feature vectors corresponding to the modeling units at the multiple positions into a deep learning model, where the deep learning model is configured to generate context feature information for the modeling units by combining the feature vectors corresponding to the modeling units at the multiple positions, predict a domain, a semantic slot, and a slot value according to the context feature information, and determine whether to inherit the dialog state information in a history turn, so as to obtain dialog state information at a current time;

a dialog strategy generating unit 1905, configured to generate a dialog strategy according to the dialog state information at the current time, so as to return voice response information corresponding to the current round of voice input, and a corresponding audio/video search result.

Corresponding to the fourth embodiment, the embodiment of the present application further provides an apparatus for providing information of a commodity object, referring to fig. 20, the apparatus includes:

a request receiving unit 2001 for receiving a request for acquiring the commodity object information by means of a plurality of voice dialogues;

the text splicing unit 2002 is configured to, after the input text information corresponding to the current turn is determined, splice the input text information corresponding to the current turn with the received input text information corresponding to the historical turn to obtain target text information;

a feature extraction unit 2003, configured to perform feature extraction on the target text information to obtain feature vectors corresponding to the modeling units at multiple positions in the target text information, respectively;

a dialog state information determining unit 2004, configured to input feature vectors corresponding to the modeling units at the multiple positions into a deep learning model, where the deep learning model is configured to generate context feature information for the modeling units by combining the feature vectors corresponding to the modeling units at the multiple positions, perform prediction of a domain, a semantic slot, and a slot value according to the context feature information, and determine whether to inherit the dialog state information in a history turn, so as to obtain dialog state information at a current time;

the conversation strategy generating unit 2005 is configured to generate a conversation strategy according to the conversation state information at the current time, so as to return the voice shopping guide information corresponding to the voice input of the current turn and the corresponding commodity object information.

Corresponding to the fifth embodiment, the embodiment of the present application further provides an information processing apparatus in a task-driven multi-turn dialog system, referring to fig. 21, where the apparatus is applied to an intelligent call system, and the apparatus includes:

a voice information receiving unit 2101 configured to receive voice information of a current round;

a text processing unit 2102 configured to convert the voice information into text information, and obtain target text information by splicing the text information corresponding to the current round with the text information corresponding to the received historical round;

a feature extraction unit 2103, configured to perform feature extraction on the target text information to obtain feature vectors corresponding to the modeling units in multiple positions in the target text information, respectively;

a dialog state information determining unit 2104 configured to input the feature vectors corresponding to the modeling units at the multiple positions into the deep learning model, so as to obtain dialog state information at the current time;

a response information generating unit 2105, configured to provide the session state information of the current time to the session decision module, and generate response information of the current round;

a response information conversion unit 2106, configured to convert the response information into a natural language, and perform voice broadcast.

In accordance with a sixth embodiment, an embodiment of the present application further provides an information processing apparatus in a task-driven multi-turn dialog system, which is applied to a first device and includes:

an input information receiving unit 2201 for receiving input information of the current round input by the second device;

a text processing unit 2202, configured to determine text information corresponding to the input information of the current round, and obtain target text information by splicing the text information corresponding to the current round and the received text information corresponding to the historical round;

a feature extraction unit 2203, configured to perform feature extraction on the target text information to obtain feature vectors corresponding to modeling units at multiple positions in the target text information, respectively;

a dialog state information determination unit 2204, configured to input the feature vectors corresponding to the modeling units at the multiple positions into the deep learning model, and obtain dialog state information at the current time;

a response information generating unit 2205, configured to provide the session state information of the current time to the session decision module, generate response information of the current turn, and provide the response information to the second device.

In accordance with a seventh embodiment, an information processing apparatus in a task-driven multi-turn dialog system is also provided in an embodiment of the present application, and referring to fig. 23, the apparatus is applied to a self-service ticket vending machine device, and includes:

a voice receiving unit 2301, configured to receive voice information of a current round;

the voice recognition unit 2302 is used for converting the voice information into text information and splicing the text information corresponding to the current turn with the received text information corresponding to the historical turn to obtain target text information;

a feature extraction unit 2303, configured to perform feature extraction on the target text information to obtain feature vectors corresponding to modeling units at multiple positions in the target text information, respectively;

a dialog state information determining unit 2304, configured to input the feature vectors corresponding to the modeling units at the multiple positions into a deep learning model, and obtain dialog state information at the current time;

a response information generating unit 2305, configured to provide the session state information of the current time to a session decision module, and generate response information of the current round;

a response information output unit 2306, configured to convert the response information into a natural language, and perform voice broadcast.

Corresponding to the eighth embodiment, an embodiment of the present application further provides a terminal device upgrading apparatus, referring to fig. 24, where the apparatus may include:

an advice information providing unit 2401 for providing upgrade advice information to the terminal device;

the permission granting unit 2402 is configured to, after receiving an upgrade request submitted by a terminal device, grant permission for state tracking in a multi-round conversation process through a deep learning model to the terminal device; the deep learning model is used for generating context feature information for a modeling unit by combining feature vectors corresponding to the modeling unit at a plurality of positions in target text information, predicting a field, a semantic slot and a slot value according to the context feature information, and determining whether to inherit the dialog state information in a historical turn to obtain the dialog state information at the current moment; the target text information is obtained by splicing the input text information corresponding to the current turn with the input text information corresponding to the received historical turn.

In addition, the present application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method described in any of the preceding method embodiments.

And an electronic device comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of the preceding method embodiments.

Where fig. 25 illustratively shows the architecture of an electronic device, for example, device 2500 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, an aircraft, or the like.

Referring to fig. 25, device 2500 may include one or more of the following components: processing component 2502, memory 2504, power component 2506, multimedia component 2508, audio component 2510, input/output (I/O) interface 2512, sensor component 2514, and communications component 2516.

The processing component 2502 generally controls the overall operation of the device 2500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 2502 may include one or more processors 2520 to execute instructions to perform all or some of the steps of the methods provided by the disclosed subject matter. Further, the processing component 2502 may include one or more modules that facilitate interaction between the processing component 2502 and other components. For example, the processing component 2502 can include a multimedia module to facilitate interaction between the multimedia component 2508 and the processing component 2502.

The memory 2504 is configured to store various types of data to support operation at the device 2500. Examples of such data include instructions for any application or method operating on device 2500, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 2504 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

Power components 2506 provide power to the various components of device 2500. The power components 2506 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 2500.

The multimedia component 2508 includes a screen that provides an output interface between the device 2500 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 2508 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the back-facing camera may receive external multimedia data when the device 2500 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 2510 is configured to output and/or input audio signals. For example, audio component 2510 can include a Microphone (MIC) configured to receive external audio signals when device 2500 is in an operating mode, such as a call mode, a record mode, and a voice recognition mode. The received audio signals may further be stored in memory 2504 or transmitted via communications component 2516. In some embodiments, audio component 2510 also includes a speaker for outputting audio signals.

I/O interface 2512 provides an interface between processing component 2502 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 2514 includes one or more sensors for providing various aspects of status assessment for the device 2500. For example, the sensor component 2514 may detect the open/closed status of the device 2500, the relative positioning of components, such as the display and keypad of the device 2500, the sensor component 2514 may also detect a change in the position of the device 2500 or a component of the device 2500, the presence or absence of user contact with the device 2500, orientation or acceleration/deceleration of the device 2500 and a change in the temperature of the device 2500. The sensor component 2514 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor component 2514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 2514 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

Communication component 2516 is configured to facilitate communications between device 2500 and other devices in a wired or wireless manner. The device 2500 may access a wireless network based on a communication standard, such as WiFi, or a mobile communication network such as 2G, 3G, 4G/LTE, 5G, etc. In an exemplary embodiment, the communication part 2516 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 2516 further comprises a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the device 2500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium including instructions, such as the memory 2504 including instructions, executable by the processor 2520 of the device 2500 to perform the methods provided by the aspects of the present disclosure is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The foregoing detailed description has provided the present application with specific examples to explain the principles and implementations of the present application, which are merely provided to facilitate an understanding of the methods and their core concepts; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific embodiments and the application range may be changed. In view of the above, the description should not be taken as limiting the application.

Claims

1. A method for state tracking in a task-driven multi-turn dialog system, comprising:

2. The method of claim 1,

the modeling unit at each position corresponds to a plurality of different types of features;

the obtaining of the feature vectors corresponding to the modeling units at the plurality of positions in the target text information includes:

3. The method of claim 2,

the plurality of different types of features includes: a location feature, a segmentation feature, and a word sense tag feature; the position characteristics are position sequence information of the modeling unit in the target text information, the segmentation characteristics are whether the modeling unit belongs to the current round or the historical round, and the word sense label characteristics are whether the modeling unit belongs to the entity word.

4. The method of claim 3,

the splicing of the input text information corresponding to the current round and the input text information corresponding to the received historical round includes:

5. The method of claim 4,

the deep learning model comprises an encoder, a first classifier and a second classifier, wherein the encoder is used for encoding the feature vectors corresponding to the modeling units to obtain the context information, inputting hidden layer state information corresponding to the identifier inserted by the sentence start into the first classifier to predict field information of the dialog state at the current moment, inputting hidden layer state information corresponding to other modeling units into the second classifier to predict semantic slots and slot values of the dialog state at the current moment, and judging whether to inherit the dialog state information corresponding to the history turns.

6. A method of building a deep learning model, comprising:

7. The method of claim 6, further comprising:

before the deep learning model is trained, determining weight initial values of a plurality of layers in the deep learning model through self-supervision pre-training.

8. An information processing method in a task-driven multi-turn dialog system, comprising:

9. The method of claim 8,

10. The method of claim 9,

11. The method of claim 9,

the client comprises a client associated with the intelligent sound box device.

12. The method of claim 8,

the input information of the current turn includes text information.

13. The method of claim 12,

the client comprises a customer service module in the commodity object information service application program.

14. An information processing method in a task-driven multi-turn dialog system, comprising:

and returning the response information to the client.

15. An audio/video voice search method is characterized by comprising the following steps:

16. A method of providing merchandise object information, comprising:

17. An information processing method in a task-driven multi-turn dialog system, comprising:

18. An information processing method in a task-driven multi-turn dialog system, comprising:

19. An information processing method in a task-driven multi-turn dialog system, comprising:

20. A method for upgrading terminal equipment is characterized by comprising the following steps:

providing upgrade suggestion information to the terminal equipment;

21. The method of claim 20, further comprising:

and according to the degradation request submitted by the terminal equipment, closing the authority of the terminal equipment for state tracking in the multi-turn conversation process through the deep learning model.

22. A state tracking apparatus in a task-driven multi-turn dialog system, comprising:

23. An apparatus for building a deep learning model, comprising:

24. An information processing apparatus in a task-driven multi-turn dialog system, applied to a client, comprising:

25. An information processing apparatus in a task-driven multi-turn dialog system, applied to a server, comprising:

26. An audio-video voice search device, characterized by comprising:

27. An apparatus for providing commodity object information, comprising:

28. An information processing apparatus in a task-driven multi-turn dialog system, applied to an intelligent call system, comprising:

29. An information processing apparatus in a task-driven-type multi-turn dialog system, applied to a first device, comprising:

30. An information processing apparatus in a task-driven multi-turn dialogue system, applied to a self-service ticket vending machine device, comprising:

31. A terminal device upgrading apparatus, comprising:

32. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the method of any one of claims 1 to 21.

33. An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of claims 1 to 21.