CN111125326A

CN111125326A - Method, device, medium and electronic equipment for realizing man-machine conversation

Info

Publication number: CN111125326A
Application number: CN201911241788.4A
Authority: CN
Inventors: 支康仪; 孙拔群; 郝梦圆
Original assignee: Beike Technology Co Ltd
Current assignee: Seashell Housing Beijing Technology Co Ltd
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-05-08

Abstract

A method, apparatus, medium, and electronic device for implementing a human-machine conversation are disclosed. The method comprises the following steps: performing multi-head self-attention-based semantic feature extraction processing on a conversation statement of a current time sequence of a conversation party to obtain a semantic feature vector of the conversation statement; acquiring hidden layer state information of the current time sequence according to the semantic feature vector and the hidden layer state information of the previous time sequence of the current time sequence; and generating the dialogue state information of the dialogue statement according to the hidden layer state information of the current time sequence. The technical scheme provided by the disclosure is beneficial to improving the real-time performance of man-machine conversation and improving the experience of a user on the man-machine conversation.

Description

Method, device, medium and electronic equipment for realizing man-machine conversation

Technical Field

The present disclosure relates to a human-machine conversation technology, and more particularly, to a method for implementing a human-machine conversation, an apparatus for implementing a human-machine conversation, a storage medium, and an electronic device.

Background

A task-based human-machine dialog system may be referred to as a task-oriented dialog system, or a task-oriented dialog system, etc. Task-based human-machine dialog systems generally comprise: NLU (Natural Language understanding), DM (Dialog Management), and NLG (Natural Language Generation), which are the major components. The NLU is used to recognize the intention of a conversing party and provide the recognition result to the DM. The DM generally includes: DST (dialog state tracker) and DPL (dialog Policy Learning). DST is used for predicting the current dialogue state of the dialogue party according to the information output by the NLU, and DPL is used for making a strategy according to the current dialogue state predicted by the DST so as to determine the action to be taken next. The NLG is used for generating natural language according to the action taken by the DPL and finishing the interaction with a conversation party.

How to accurately track the current conversation state of the conversation party so as to take proper action according to the current conversation state is a technical problem of great concern.

Disclosure of Invention

The present disclosure is proposed to solve the above technical problems. Embodiments of the present disclosure provide a method for implementing a human-machine conversation, an apparatus for implementing a human-machine conversation, a storage medium, and an electronic device.

According to an aspect of an embodiment of the present disclosure, there is provided a method for implementing a man-machine conversation, the method including: performing multi-head self-attention-based semantic feature extraction processing on a conversation statement of a current time sequence of a conversation party to obtain a semantic feature vector of the conversation statement; acquiring hidden layer state information of the current time sequence according to the semantic feature vector and the hidden layer state information of the previous time sequence of the current time sequence; and generating the dialogue state information of the dialogue statement according to the hidden layer state information of the current time sequence.

In an embodiment of the present disclosure, the performing a semantic feature extraction process based on multi-head self-attention on a current time-series conversational sentence of a conversation party includes: acquiring word segmentation vectors of all the word segments contained in the current time sequence conversation statement of the conversation party; forming a vector matrix of the conversation statement according to the word segmentation vector of each word segmentation and the position coding information of each word segmentation; and performing multi-head self-attention-based semantic feature extraction processing on the vector matrix of the conversation statement to obtain a semantic feature vector of the conversation statement.

In another embodiment of the present disclosure, the performing a multi-head self-attention-based semantic feature extraction process on the vector matrix of the conversational sentence includes: performing multi-head self-attention processing, first-layer normalization processing, feedforward processing and second-layer normalization processing on the vector matrix of the conversational statement at least once; and obtaining the semantic feature vector of the conversation statement according to the result of the last second-layer normalized processing.

In another embodiment of the present disclosure, the acquiring hidden layer state information of the current time sequence according to the semantic feature vector and hidden layer state information of a time sequence previous to the current time sequence includes: and performing long-term and short-term memory processing on the semantic feature vector and the hidden layer state information of the previous time sequence of the current time sequence to obtain the hidden layer state information of the current time sequence.

In still another embodiment of the present disclosure, the hidden layer state information includes: long-short term memory hidden state information and long-short term memory cell state information.

In another embodiment of the present disclosure, the generating dialog state information of the dialog statement according to the hidden layer state information of the current time sequence includes: classifying the hidden layer state information of the current time sequence; and obtaining the dialog state information of the dialog statement according to the classification processing result.

In another embodiment of the present disclosure, the dialog state information of the dialog statement includes: slot value probability distribution for all slots.

According to another aspect of the embodiments of the present disclosure, there is provided an apparatus for implementing a man-machine conversation, the apparatus including: the semantic feature extraction module is used for extracting and processing the semantic features of the conversation sentences of the current time sequence of the conversation party based on multi-head self-attention to obtain semantic feature vectors of the conversation sentences; a hidden layer state obtaining module, configured to obtain hidden layer state information of the current time sequence according to the semantic feature vector and hidden layer state information of a time sequence previous to the current time sequence; and the dialog state information generating module is used for generating the dialog state information of the conversation statement according to the hidden layer state information of the current time sequence.

In an embodiment of the present disclosure, the semantic feature extraction module includes: the first sub-module is used for acquiring word segmentation vectors of all word segmentations contained in the conversation sentences of the current time sequence of the conversation party; the second sub-module is used for forming a vector matrix of the conversation statement according to the word segmentation vector of each word segmentation and the position coding information of each word segmentation; and the third sub-module is used for extracting and processing the semantic features of the vector matrix of the conversational statement based on multi-head self-attention to obtain the semantic feature vector of the conversational statement.

In yet another embodiment of the present disclosure, the third sub-module is further configured to: performing multi-head self-attention processing, first-layer normalization processing, feedforward processing and second-layer normalization processing on the vector matrix of the conversational statement at least once; and obtaining the semantic feature vector of the conversation statement according to the result of the last second-layer normalized processing.

In yet another embodiment of the present disclosure, the hidden layer state obtaining module is further configured to: and performing long-term and short-term memory processing on the semantic feature vector and the hidden layer state information of the previous time sequence of the current time sequence to obtain the hidden layer state information of the current time sequence.

In yet another embodiment of the present disclosure, the generating dialog state information module is further configured to: classifying the hidden layer state information of the current time sequence; and obtaining the dialog state information of the dialog statement according to the classification processing result.

According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the above-mentioned method for implementing a man-machine conversation.

According to still another aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; and the processor is used for reading the executable instructions from the memory and executing the instructions to realize the method for realizing the man-machine conversation.

Based on the method and the device for realizing the man-machine conversation provided by the embodiment of the disclosure, the semantic feature vector of the conversational sentence is obtained by utilizing a semantic feature extraction processing mode based on multi-head self-attention, and more accurate semantic feature vector can be obtained on the basis of parallel processing of all participles in the conversational sentence; by utilizing the hidden layer state information of the previous time sequence of the current time sequence to obtain the hidden layer state information of the current time sequence and utilizing the hidden layer state information of the current time sequence to generate the conversation state information of the conversation statement, the conversation state information of the conversation statement of the current time sequence can be formed on the basis of considering the historical conversation statement, and therefore the accuracy of the conversation state information is improved. Therefore, the technical scheme provided by the disclosure is beneficial to improving the real-time performance of the man-machine conversation and improving the experience of the user on the man-machine conversation.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of one embodiment of a suitable scenario for use with the present disclosure;

FIG. 2 is a flow diagram of one embodiment of a method for implementing a human-machine dialog according to the present disclosure;

FIG. 3 is a flow diagram for one embodiment of obtaining semantic feature vectors for conversational utterances according to the present disclosure;

FIG. 4 is a schematic diagram illustrating one embodiment of obtaining semantic feature vectors for conversational utterances according to the present disclosure;

FIG. 5 is a flow chart of one embodiment of training a Word2vec model of the present disclosure;

FIG. 6 is a flow chart of another embodiment of the present disclosure for training a Word2vec model;

FIG. 7 is a schematic diagram illustrating an embodiment of an apparatus for implementing a human-machine conversation according to the present disclosure;

FIG. 8 is a schematic structural diagram of another embodiment of an apparatus for implementing a human-machine conversation according to the present disclosure;

FIG. 9 is a schematic block diagram of one embodiment of an encoder of the present disclosure;

fig. 10 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more than two and "at least one" may refer to one, two or more than two.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, such as a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Embodiments of the present disclosure may be implemented in electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with an electronic device, such as a terminal device, computer system, or server, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment. In a distributed cloud computing environment, tasks may be performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the disclosure

In the process of implementing the present disclosure, the inventors found that, at present, semantic feature extraction is generally implemented by using an RNN (recurrent neural network)/LSTM (Long Short-Term Memory) neural network on the basis of an N-gram (N-gram). The N-gram mode is used for providing input for the RNN/LSTM neural network, so that the problem of data sparsity exists in extraction of semantic features, the waste of computing resources is caused, and the improvement of the real-time performance of human-computer conversation is not facilitated. Because the processing operation executed by the RNN/LSTM neural network on a participle in a conversational sentence can be executed only after the processing operation of all preceding participles of the participle is completed, there is a problem that the participles in the conversational sentence cannot be processed in parallel, which is not beneficial to improving the real-time performance of human-computer conversation. In addition, although RNN/LSTM has the characteristic of long-distance capture, its semantic feature extraction capability is limited, which may affect the accuracy of dialog state and thus the accuracy of conversational sentences replied to the user.

If the conversation state information can be quickly and accurately obtained, the real-time performance of the man-machine conversation and the experience of the user on the man-machine conversation are improved.

Brief description of the drawings

One example of an application scenario of the techniques for implementing a human-machine dialog provided by the present disclosure is shown in fig. 1.

In fig. 1, a task-type human-machine conversation system is installed in a smart mobile phone 100 of a user 101. For example, a task-type man-machine conversation system is installed in an APP in the smart mobile phone 100, and the APP can realize functions of medical inquiry, house renting and selling, ticket business and the like. The following describes the application scenario by taking the APP as an APP for implementing a house renting and selling function as an example.

When the user 101 has a house renting and selling demand, the APP in the smart mobile phone 100 is opened, and the user 101 can input the house renting and selling demand in the APP in a voice or text input mode. Assuming that the user 101 currently inputs "i want to rent a house", the task-type man-machine dialog system in the APP extracts the semantic feature vector of "i want to rent a house", and obtains dialog state information of the current time sequence based on the extracted semantic feature vector, so that the task-type man-machine dialog system can form a corresponding reply sentence according to the dialog state information of the current time sequence, for example, "do you have a requirement on the position of a house? ". Assume that the user 101 continues to enter "house is as close as possible to AAA" in this style APP. In the APP, the task-based man-machine dialog system extracts the semantic feature vector that the house is as close as possible to the AAA, and obtains the dialog state information of the current time sequence based on the extracted semantic feature vector, so that the task-based man-machine dialog system can form a reply sentence again according to the dialog state information of the current time sequence, for example, "do you rent a party or rent a house alone? ". The task type man-machine conversation system in the APP can finally achieve the purpose of recommending houses meeting the requirements of the user 101 through multiple conversations with the user 101.

Exemplary method

Fig. 2 is a flowchart illustrating an embodiment of a method for implementing a human-machine conversation according to the present disclosure. As shown in fig. 2, the method of this embodiment includes the steps of: s200, S201, and S202. The following describes each step.

S200, performing multi-head self-attention-based semantic feature extraction processing on the conversation sentences of the current time sequence of the conversation party to obtain semantic feature vectors of the conversation sentences.

The dialer in the present disclosure may refer to: in the process of implementing man-machine conversation, the user side carries out conversation with the machine. The time sequence in the present disclosure may refer to a time arrangement order between a plurality of conversation sentences that are expressed by a conversation partner one after another. The current timing in this disclosure may be referred to as the current session turn. A conversational sentence of a current time sequence of a conversing party in the present disclosure may refer to a conversational sentence that the conversing party expresses in the current conversation turn. The current chronological conversational statement may include: a single sentence or multiple sentences.

The semantic feature extraction process based on multi-head self-attention in the present disclosure may refer to: a semantic feature extraction processing mode of a multi-head self-attention mechanism is adopted. Namely, in the semantic feature extraction process, at least a multi-head self-attention mechanism is adopted to process the conversation statement. A semantic feature vector in the present disclosure may refer to a feature vector used to describe semantics.

S201, acquiring hidden layer state information of the current time sequence according to the semantic feature vector and the hidden layer state information of the previous time sequence of the current time sequence.

The previous timing of the current timing in this disclosure may be referred to as: the previous dialog turn to the current dialog turn. Hidden layer state information in this disclosure may refer to LSTM-based hidden layer state information.

And S202, generating the dialogue state information of the dialogue statement according to the hidden layer state information of the current time sequence.

Dialog state information of a conversation statement in the present disclosure may refer to probability information of a slot value in a slot. In task-based human-machine conversation techniques, there are typically multiple slot positions, there are typically multiple slot values under each slot position, and all slot positions and all slot values are typically pre-set. A slot may be considered an intent category. One slot value can be considered as a specific intention under the intention category, and all slots and all slot values can be preset according to the specific field to which the task type man-machine conversation is applied. For example, for the house renting and selling field, the house location, the house area, the house orientation, etc. may be used as the slot position, each specific house location (e.g., west two flags, etc.) may be used as a slot value under the house location slot position, each specific house area (e.g., 80-90 square meters, etc.) may be used as a slot value under the house area slot position, and each specific orientation (e.g., south, north, etc.) may be used as a slot value under the house orientation slot position. Dialog state information of a dialog statement generated by the present disclosure may be: and respectively generating probability values for all the slot values in all the slot positions, wherein the probability values of all the slot values in each slot position form slot value probability distribution of the slot position.

The dialog state information in the present disclosure may be provided to the DPL so that the DPL may perform policy selection operations based on the received dialog state information, and formulate corresponding policies so that actions to be taken next may be determined. In the case that the generated dialog state information is the slot value probability distributions of all the slot positions, the present disclosure may provide the slot value probability distributions of all the slot positions to the DPL, or may filter the slot value probability distributions of all the slot positions by using a predetermined probability threshold, and provide the slot value information, of which the probability value obtained by the filtering exceeds the predetermined probability threshold, to the DPL.

The semantic feature vector of the conversation statement is obtained by utilizing a multi-head self-attention-based semantic feature extraction processing mode, so that on one hand, the phenomenon of computing resource waste caused by an N-gram mode can be avoided, on the other hand, the phenomenon that the participle can be processed only after all the participles in the conversation statement need to be processed is avoided, and on the other hand, the multi-head self-attention semantic feature extraction capability is fully utilized, so that the accurate semantic feature vector can be obtained on the basis of parallel processing of all the participles in the conversation statement; according to the method and the device, the hidden layer state information of the current time sequence is obtained by utilizing the hidden layer state information of the previous time sequence of the current time sequence, and the conversation state information of the conversation statement is generated by utilizing the hidden layer state information of the current time sequence, so that the conversation state information of the conversation statement of the current time sequence can be formed on the basis of considering the historical conversation statement, and the accuracy of the conversation state information is improved. Therefore, the technical scheme provided by the disclosure is beneficial to improving the real-time performance of the man-machine conversation and improving the experience of the user on the man-machine conversation.

In an alternative example, the present disclosure may form a vector matrix of the conversational sentence first, and then process the vector matrix by using a semantic feature extraction processing manner based on multi-head self-attention to obtain a semantic feature vector of the conversational sentence. Specifically, an example of obtaining semantic feature vectors of conversational utterances by the present disclosure is shown in fig. 3.

In fig. 3, in S300, word segmentation processing is performed on the current time-series conversational sentence of the conversation party, and all the word segments included in the conversational sentence are obtained.

Optionally, the present disclosure may employ a word segmentation tool to perform word segmentation processing on the conversational sentence. For example, when the conversational sentence is chinese, the present disclosure may employ a chinese word segmentation tool (e.g., jieba word segmentation tool, etc.) to perform word segmentation processing on the conversational sentence.

S301, acquiring a word segmentation vector of each word segmentation and position coding information of each word segmentation.

Alternatively, the participle vector in this disclosure may be represented using a multidimensional real vector. For example, a 128-dimensional or 200-dimensional real number vector may be used for representation. A participle vector in the present disclosure may represent a word (e.g., "on," "good," "bad," etc.) or a word (e.g., "house," "fitment," etc.).

Optionally, the Word segmentation vector of each Word in the conversational sentence can be obtained by using a successfully trained Word2vec model. Specifically, each participle is provided to the Word2vec model, a real number vector (e.g., a 128-dimensional real number vector) is generated for each inputted participle through the Word2vec model, and the generated real number vector is output, so that the present disclosure obtains the participle vector of each participle. An example of training a Word2vec model can be found in the description below with respect to fig. 5 and 6.

It should be noted that all participles included in the conversational sentence in the present disclosure do not generally include stop words. Stop words may include, but are not limited to: subject, object, and mood assist words, etc. As one example, the participle process in this disclosure may include a stop word filtering process to remove stop words in conversational utterances.

Optionally, the position coding information of the participle in the present disclosure may be in the form of a vector, that is, the position coding information of the participle may be a position coding vector, and the dimension of the position coding vector may be the same as the dimension of the participle vector. For example, the dimension of the word segmentation vector and the position coding vector are both 128 dimensions.

S302, generating a vector matrix of the conversation statement according to the word segmentation vector of each word segmentation and the position coding information of each word segmentation.

Optionally, the present disclosure may add the participle vector of each participle and the position coding vector of each participle, and form a vector matrix of the conversational sentence in the current time sequence by using the added vector quantities. For example, for any participle, the present disclosure may add the 128-dimensional participle vector of the participle and the real number of the same dimension in the 128-dimensional position coding vector to form a 128-dimensional vector, so that the present disclosure may obtain the added 128-dimensional vector of each participle, and the present disclosure may form a vector matrix of the current time-series conversation sentence by using the added 128-dimensional vectors of all participles. The width of the vector matrix may be a predetermined value and the height of the vector matrix may be a dimension (e.g., 128 dimensions) of the participle vector. The width of the vector matrix represents the maximum length of the conversational sentence that the vector matrix can accommodate, for example, when the predetermined value is 120, the vector matrix can accommodate up to 120 participles in the conversational sentence.

Optionally, the present disclosure may splice the word segmentation vectors of each word segmentation and the position coding vectors of each word segmentation, and form a vector matrix of the conversational sentence by using the spliced vectors. For example, for any participle, the present disclosure may splice a 128-dimensional participle vector and a 128-dimensional position coding vector of the participle into one 256-dimensional vector, so that the present disclosure may obtain a spliced 256-dimensional vector of each participle, and the present disclosure may form a vector matrix of a conversational sentence using the spliced 256-dimensional vectors of all participles. Likewise, the width of the vector matrix may be a predetermined value, and the height of the vector matrix may be a dimension (e.g., 256 dimensions) of the concatenated vector of the participle. The width of the vector matrix represents the maximum length of the conversational sentence that the vector matrix can accommodate, for example, when the predetermined value is 120, the vector matrix can accommodate up to 120 participles in the conversational sentence.

Optionally, the size of the predetermined value in the present disclosure may be set according to actual requirements. When the number of participles included in the conversation sentence of the current time sequence is less than a predetermined value, the present disclosure may perform a filling process on the vector matrix, for example, perform a filling process on the vector matrix by using the PAD function. When the number of the participles contained in the conversation sentence in the current time sequence exceeds a preset value, the conversation sentence in the current time sequence can be intercepted, the intercepted participles with the preset value number are only utilized to form a vector matrix, and the participles with the number exceeding the preset value can be discarded, so that the width of the vector matrix is ensured to be the preset value.

S303, performing multi-head self-attention-based semantic feature extraction processing on the vector matrix of the conversation statement to obtain a semantic feature vector of the conversation statement.

Optionally, the semantic feature extraction process based on multi-head self-attention in the present disclosure may include: multi-headed self-attention (Multi-phase) processing, Layer normalization (Layer normalization) processing, and Feed-Forward (Feed Forward) processing. The self-attention may mean that each participle in the conversational sentence is respectively subjected to attention weight calculation with each participle in the conversational sentence, and the multi-head may mean that calculation results of the attention weights are spliced, so that a final result of multi-head self-attention processing contains more context information. The number of heads (heads) for multi-head self-attention processing in this disclosure may be 4.

Optionally, the present disclosure may perform at least one multi-head self-attention processing, a first-layer normalization processing, a feed-forward processing, and a second-layer normalization processing on the vector matrix, and obtain the semantic feature vector of the conversational sentence according to a result of the last second-layer normalization processing. An example of obtaining a semantic feature vector of a conversational sentence using a multi-headed self-attention based semantic feature extraction process is shown in fig. 4.

In fig. 4, it is assumed that the current conversation turn is the nth conversation turn, and the participle vector 400 of each participle of the conversation sentence in the nth conversation turn is added to the position code vector 401 of each participle to form the vector matrix of the conversation sentence in the nth conversation turn. The vector matrix is taken as a matrix to be processed, multi-head self-attention processing is carried out on the matrix to be processed, layer normalization processing (namely first layer normalization processing) is carried out on a first result formed by adding (for example, element-by-element addition) the multi-head self-attention processing result and the vector matrix, and a second result is obtained. And performing feed-forward processing on the second result, and performing layer normalization processing (i.e. second layer normalization processing) on a third result formed by adding (e.g. element-by-element adding) the result of the feed-forward processing and the second result to form a fourth result. The present disclosure may take the fourth result as the pending matrix again, and obtain the fourth result again according to the processing manner described above. After repeating the above processing manner for multiple times (e.g. 4 times in fig. 4), the present disclosure may use the fourth result obtained last time as the semantic feature vector of the conversational sentence.

Because the vector matrix in the disclosure comprises the position coding information of each participle, reference information is provided for semantic feature extraction processing, thereby being beneficial to improving the accuracy of the semantic feature vector of the conversation statement. Because the conversation sentences form a vector matrix form, and the semantic feature extraction processing based on multi-head self-attention is executed by taking the vector matrix as a unit instead of taking the participle as a unit, the participle can be processed only after all the participles in front of the participle are processed, and therefore, the method is beneficial to improving the efficiency of extracting the semantic feature vectors of the conversation sentences. In addition, the size of the vector matrix in the disclosure can be flexibly set according to actual requirements, for example, the size of the vector matrix can be reasonably set on the basis of considering the habitual length of the conversation statement expressed by the user and actual computing resources, so that the problem of data sparsity is favorably avoided, and the phenomenon of computing resource waste is favorably avoided. The method and the device are favorable for improving the accuracy of the semantic feature vector of the conversation statement by executing the multi-head self-attention processing, the first-layer normalization processing, the feedforward processing and the second-layer normalization processing for multiple times.

In an optional example, the manner of acquiring hidden layer state information of a current time sequence in the present disclosure may be: and performing long-short term memory (LSTM) processing on the semantic feature vector and the hidden layer state information of the previous time sequence of the current time sequence, and acquiring the hidden layer state information of the current time sequence according to the result of the long-short term memory processing. Hidden layer state information in the present disclosure may include: long-short term memory hidden state information (e.g., h in LSTM) and long-short term memory cell state information (e.g., c in LSTM). That is, the present disclosure may perform long-short term memory (LSTM) processing on the semantic feature vector, h and c of a time sequence previous to the current time sequence, and obtain h and c of the current time sequence according to a result of the long-short term memory processing. Because the hidden layer state information of the previous time sequence of the current time sequence comprises the historical conversation statement information of the conversation statement of the current time sequence, the hidden layer state information of the current time sequence can be obtained on the basis of fully considering the historical conversation statement by carrying out long-term and short-term memory processing on the semantic feature vector and the hidden layer state information of the previous time sequence of the current time sequence, and therefore the accuracy of conversation state tracking is improved.

In an alternative example, the manner of generating dialog state information of a conversational statement in the present disclosure may be: the hidden layer state information of the current time sequence is classified, and the conversation state information of the conversation statement can be obtained according to the classification result. For example, a slot value probability distribution is obtained for all slots. One example of the present disclosure that classifies the hidden layer state information of the current time sequence may be: and performing full-connection-layer-based operation on the hidden layer state information of the current time sequence, and processing an operation result based on the full connection layer by using a normalized exponential function (namely, softmax function), so as to obtain the dialogue state information of the dialogue statement.

Alternatively, one example of the present disclosure for training a Word2vec model may be as shown in fig. 5.

In fig. 5, S500, the model training process is started.

S501, setting model parameters of a Word2vec model. For example, a set of specified parameter values is used as model parameters for the Word2vec model.

S502, respectively providing a plurality of participles in the participle training set to a Word2vec model, and respectively generating a real number vector (such as a 128-dimensional real number vector) for each participle through the Word2vec model.

And S503, storing the real number vector of each participle output by the Word2vec model at this time.

S504, the plurality of preset participles and synonyms thereof are respectively provided for the Word2vec model, a real number vector (such as a 128-dimensional real number vector) is respectively generated for each preset participle and synonym thereof through the Word2vec model, and a plurality of groups of real number vectors are obtained. A set of real vectors includes a real vector of a predetermined participle and a real vector of its synonym. The number of synonyms for a predetermined segment in the present disclosure may be one or more.

Alternatively, the predetermined participles may be participles in a field used by the technology of the present disclosure. The number of predetermined segmentations is usually not large, and may be, for example, 20 predetermined segmentations. The present disclosure may obtain synonyms of predetermined participles using corresponding dictionaries. For example, synonyms for predetermined segments may be obtained using "forest of great synonyms of hayawara," or the like.

And S505, respectively calculating the similarity between the real number vector of the preset participle and the real number vector of the synonym thereof aiming at each group of real number vectors. For example, the cosine distance between the real number vector of a predetermined participle and the real number vector of its synonym is calculated.

S506, judging the similarity of each group of real number vectors obtained by calculation, and if the judgment result is that the similarity does not meet the preset requirement, then S507 is carried out; and if the similarity meets the preset requirement as a result of the judgment, the step is S508.

Optionally, the present disclosure may calculate a mean value of similarity degrees of all sets of real number vectors, determine that the similarity degree satisfies a predetermined requirement when the mean value reaches a predetermined similarity degree, and determine that the similarity degree does not satisfy the predetermined requirement when the mean value does not reach the predetermined similarity degree.

And S507, adjusting model parameters of the Word2vec model, and returning to the S502. For example, another set of specified parameter values is used as model parameters for the Word2vec model.

And S508, taking the current model parameters of the Word2vec model as the final model parameters of the Word2vec model.

And S509, ending the model training process.

Alternatively, another example of the present disclosure for training a Word2vec model may be as shown in fig. 6.

In fig. 6, S600, the model training process is started.

S601, setting model parameters of a Word2vec model. For example, a set of specified parameter values is used as model parameters for the Word2vec model.

S602, respectively providing a plurality of participles in the participle training set to a Word2vec model, and respectively generating a real number vector (such as a 128-dimensional real number vector) for each participle through the Word2vec model.

And S603, storing real number vectors of all participles output by the Word2vec model at this time.

S604, respectively providing the plurality of preset participles and synonyms thereof to a Word2vec model, respectively generating a real number vector (such as a 128-dimensional real number vector) for each preset participle and synonym thereof through the Word2vec model, and obtaining a plurality of groups of real number vectors. A set of real vectors includes a real vector of a predetermined participle and a real vector of its synonym. The number of synonyms for a predetermined segment in the present disclosure may be one or more.

S605, for each group of real vectors obtained this time, respectively calculating a similarity (e.g., cosine distance) between the real vector of each predetermined participle and the real vector of the synonym thereof, and calculating a similarity mean of all the groups of real vectors.

S606, judging whether the number of times of currently setting the model parameters of the Word2vec model reaches the preset number, and if so, going to S608; if the predetermined number of times is not reached, go to S607.

S607, adjusting the model parameters of the Word2vec model. For example, another set of specified parameter values is used as model parameters for the Word2vec model. Returning to S602.

S608, selecting the minimum similarity mean value from all the similarity mean values, and taking the model parameter corresponding to the minimum similarity mean value as the final model parameter of the Word2vec model.

And S609, ending the model training process.

According to the method and the device, the Word2vec model is trained, so that the successfully trained Word2vec model can be used for obtaining the accurate Word segmentation vector, and the accuracy of the dialogue state information is improved.

Exemplary devices

Fig. 7 is a schematic structural diagram of an embodiment of an apparatus for implementing a man-machine conversation according to the present disclosure. The apparatus of this embodiment may be used to implement the method embodiments of the present disclosure described above.

As shown in fig. 7, the apparatus of the present embodiment includes: a semantic feature extraction module 700, a hidden layer state acquisition module 701, and a dialog state information generation module 702.

The semantic feature extraction module 700 is configured to perform multi-head self-attention-based semantic feature extraction processing on a conversation sentence of a current time sequence of a conversation party to obtain a semantic feature vector of the conversation sentence.

Optionally, the semantic feature extraction module 700 may include: a first sub-module 7001, a second sub-module 7002, and a third sub-module 7003. The first sub-module 7001 is used for acquiring a participle vector of each participle contained in the conversation sentence of the current time sequence of the conversation party. The second sub-module 7002 is configured to form a vector matrix of the conversational sentence according to the word segmentation vector of each word segmentation and the position coding information of each word segmentation. The third sub-module 7003 is configured to perform multi-head self-attention-based semantic feature extraction processing on the vector matrix of the conversational statement to obtain a semantic feature vector of the conversational statement. For example, the third sub-module 7003 may perform at least one multi-head self-attention processing, first-layer normalization processing, feed-forward processing, and second-layer normalization processing on the vector matrix of the conversational sentence, and obtain a semantic feature vector of the conversational sentence according to a result of the last second-layer normalization processing.

The hidden layer state obtaining module 701 is configured to obtain hidden layer state information of a current time sequence according to the semantic feature vector and hidden layer state information of a time sequence previous to the current time sequence. For example, the hidden layer state obtaining module 701 may perform long-term and short-term memory processing on the semantic feature vector and the hidden layer state information of the previous time sequence of the current time sequence to obtain the hidden layer state information of the current time sequence. For example, long-short term memory hidden state information and long-short term memory cell state information are obtained.

The dialog state information generating module 702 is configured to generate dialog state information of a dialog statement according to the hidden layer state information of the current time sequence. For example, the dialog state information generating module 702 may perform classification processing on hidden layer state information of the current time sequence, and obtain dialog state information of a dialog statement according to a result of the classification processing. The dialog state information of the dialog statement obtained by the generate dialog state information module 702 may be a slot value probability distribution of all slots.

The modules, sub-modules, units and operations specifically executed by the units included in the apparatus of the present disclosure may be referred to in the description of the above method embodiments, and are not described in detail here.

In an alternative example, the semantic feature extraction module 700 in the apparatus for implementing a human-computer conversation in the present disclosure may include: an input layer and a transform (which is a feature extractor); the Transformer can be considered as a semantic feature extraction layer (as shown in fig. 8). The module 701 for acquiring a hidden layer state in the present disclosure may include: the LSTM neural network. Generating dialog state information module 702 may include: a full connectivity layer and a Softmax (active) layer. The LSTM neural network, the fully-connected layer, and the Softmax layer may be considered dialog state extraction layers (as shown in fig. 8).

In fig. 8, each participle in the conversation sentence of the conversation party in each turn is subjected to position coding processing by the input layer to obtain a position coding vector of each participle, the participle vector of each participle and the position vector of each participle are added or spliced by the input layer, and the vector matrix formed by the added or spliced vectors is output by the output layer. Input _ s1 in fig. 8 represents a vector matrix of conversational utterances of the first round of conversants, input _ s2 represents a vector matrix of conversational utterances of the second round of conversants, and input _ sN represents a vector matrix of conversational utterances of the nth round of conversants.

The vectors output by the input layer are provided to the Transformer, and semantic feature vectors of conversational sentences of the conversing parties of each turn are extracted through the Transformer. The Transformer may include a plurality of concatenated encoders, such as 4 concatenated encoders. The structure of each encoder can be seen in the subsequent description with respect to fig. 9.

The output of the Transformer in fig. 8 may be represented as round _ vec, e.g., round1_ vec, round2_ vec, … …, and round dn _ vec in fig. 8. The roundi _ vec is provided to the LSTM neural network (i.e., LSTMcell in fig. 8). The LSTM neural network generates h and c for the current time sequence from its output for h, c and transform for the previous time sequence output. For example, in the first round, the LSTM neural network generates h1 and c1 for the first round from the initial values of h and c (i.e., init in fig. 8) and the vector matrix of the transform for the conversational utterances of the first round; in the second round, the LSTM neural network generates h2 and c2 for the second round according to h1, c1, and the vector matrix of the transform for the conversational utterances of the second round; by analogy, in the nth turn, the LSTM neural network generates hn and cn for the second turn according to h (n-1), c (n-1) and the vector matrix of the transform for the conversational sentence of the n-1 turn.

H (such as h1 and h2 … … hn) generated by the LSTM neural network in each round is provided for the full-connectivity layer and the Softmax layer, and the slot value probability distribution of each slot is generated through the processing of the full-connectivity layer and the Softmax layer. The c (e.g., c1, c2 … … cn) generated by the LSTM neural network in each round is stored for provision to the LSTM neural network in the next round.

It should be noted that the present disclosure generally provides a transform, an LSTM neural network, a full link layer, and a Softmax layer for each slot. And generating slot value probability distribution only aiming at the corresponding slot positions of the Transformer, the LSTM neural network, the full connection layer and the Softmax layer which correspond to the same slot position. That is, the network parameters of the transform, LSTM neural network, full connectivity layer, and Softmax layer for each slot are different. When training the Transformer, the LSTM neural network, the full-link layer, and the Softmax layer, network parameters corresponding to each slot position should be obtained.

The input layer 900 in fig. 9 is configured to perform position coding processing on the participles in the conversational sentence, and add or splice the participle vector of each participle in the conversational sentence and the position coding vector of each participle to form an input of the first encoder 901. The position encoding process can be expressed by the following equations (1) and (2):

in the above-mentioned formula (1) and formula (2)PE is a vector with the same dimension as the segmentation vector, and pos represents the position of the segmentation in the conversation statement; d_modelA dimension (e.g., 128) representing a participle vector; i denotes the ith element in the participle vector, i.e. the position in the participle vector.

The encoder 900 may include: two units. Wherein the first unit comprises: a multi-headed self-attention layer 9011, and a layer normalization layer 9012. The second unit includes: a feed-forward neural network layer 9013 and a layer normalization layer 9014. Fig. 9 shows 4 encoders 901 connected in series, i.e. the output of a first encoder 900 is connected to the input of a second encoder 900, the output of the second encoder 900 is connected to the input of a third encoder 900, the output of the third encoder 900 is connected to the input of a fourth encoder 900, the output of the fourth encoder is connected to the LSTM neural network.

Alternatively, the multi-head self-attention processing performed by the head self-attention layer 9011 in the present disclosure may be expressed in the form of the following formula (3):

MultiHeadAttention＝Concat(head₁,head₂,......,head_m) Formula (3)

In the above formula (3), MultiHeadAttention represents the output of the multi-head self-attention layer 9011; head₁、head₂And head_mRespectively representing the output of a first head, the output of a second head and the output of an m-th head in the multi-head self-attention layer; concat (×) represents the output of the m headers connected, i.e. the outputs of the m headers are spliced together.

Head in the above formula (3)₁、head₂… … and head_mCan be expressed in the form of the following equation (4):

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V) Formula (4)

In the above formula (4), Q ═ K ═ V, and are all vector matrices of the conversation sentences of the current round; w_i ^QA parameter matrix representing Q in the calculation process for the ith of the current conversation turn (i.e., the ith of the m headers); w_i ^KTo representA parameter matrix of K in the ith calculation process aiming at the current conversation turn; w_i ^VA parameter matrix representing V in the ith calculation process for the current conversation turn; attention (, denotes the operation performed by the self-Attention tier based on the vector matrix to which it is input. For example, the Attention (, v) may be expressed in the form of the following formula (5):

in the above equation (5), Softmax is a function for classification; softmax (x) represents performing a sort operation on the input vector matrix based on the parameter x; k^TA transpose matrix representing K;

representing the dimensions of the vector (e.g., 128).

The feedforward neural network layer 9013 may include two layers, and the operation performed by the feedforward neural network layer 9013 may be expressed in the form of the following equation (6):

FFN(Z)＝max(0,ZW₁+b₁)W₂+b₂formula (6)

In equation (6) above, Z represents the input to the feedforward neural network layer 9013, i.e., the output of the layer normalization layer 9012; w₁And b₁Parameters representing a first layer in the feed-forward neural network layer 9013; w₂And b₂Parameters of the second layer in the neural network layer 9013 are fed forward.

The LSTM neural network in this disclosure generates the nth round of h (i.e., h)⁽ⁿ⁾) And c (i.e., c)⁽ⁿ⁾) Can be expressed in the form of the following equations (7) and (8):

h⁽ⁿ⁾＝o⁽ⁿ⁾⊙tanh(c⁽ⁿ⁾) Formula (7)

c⁽ⁿ⁾＝c^(n-1)⊙f⁽ⁿ⁾+i⁽ⁿ⁾⊙a⁽ⁿ⁾5 formula (8)

In the above equations (7) and (8), ⊙ represents a vector dot product, o⁽ⁿ⁾Can be represented by the following formula (9); f. of⁽ⁿ⁾Can be expressed by the following formula (10) < i >⁽ⁿ⁾Can be represented by the following formula (11) a⁽ⁿ⁾The following formula (12) can be used.

o⁽ⁿ⁾＝σ(W_oh^(n-1)+U_ox⁽ⁿ⁾+b_o) Formula (9)

f⁽ⁿ⁾＝σ(W_fh^(n-1)+U_fx⁽ⁿ⁾+b_f) Formula (10)

i⁽ⁿ⁾＝σ(W_ih^(n-1)+U_ix⁽ⁿ⁾+b_i) Formula (11)

a⁽ⁿ⁾＝tanh(W_ah^(n-1)+U_ax⁽ⁿ⁾+b_a) Formula (12)

In the above equations (9) to (12), n represents the nth turn of the dialog; sigma, W_o、U_o、b_o、W_f、U_f、b_f、W_i、U_i、b_i、W_a、U_aAnd b_aNetwork parameters of the LSTM neural network are respectively, for example, some network parameters are specifically parameter matrixes; x is the number of⁽ⁿ⁾Representing the output of the Transformer; h is^(n-1)And c^(n-1)H and c in the n-1 th round.

The Transformer, the LSTM neural network, the full link layer and the Softmax layer in the present disclosure should be trained together during training, and the adopted training data labels can be the intention labels of the historical conversational sentences. The present disclosure may determine the predicted intent of the apparatus of the present disclosure from the slot value probability distribution of the slot output by the Softmax layer, and adjust the network parameters of the transform, LSTM neural network, full connectivity layer, and Softmax layer based on the loss resulting from the difference between the predicted intent and the intent label. In addition, during the training process, the number of hidden nodes (hidden _ size) in the hyper-parameters used by the LSTM neural network may be 512, the learning rate may be 0.01, and the dropout probability may be 0.5.

The training method has the advantages that the Transformer, the LSTM neural network, the full connection layer and the Softmax layer are trained together, the problem that when the NLU and the DST are trained respectively, the problems caused by the fact that the intention of conversation sentences recognized by the NLU is inaccurate and the DST is trained on the basis of inaccurate intention can be solved, and accordingly the training effect can be improved.

Exemplary electronic device

An electronic device according to an embodiment of the present disclosure is described below with reference to fig. 10. FIG. 10 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 10, the electronic device 101 includes one or more processors 1011 and memory 1012.

The processor 1011 may be a Central Processing Unit (CPU) or other form of processing unit having capabilities for implementing human-machine interaction and/or instruction execution capabilities, and may control other components in the electronic device 101 to perform desired functions.

Memory 1012 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory, for example, may include: random Access Memory (RAM) and/or cache memory (cache), etc. The nonvolatile memory, for example, may include: read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 1011 to implement the methods for implementing human-machine dialogs of the various embodiments of the present disclosure described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 101 may further include: an input device 1013, an output device 1014, etc., which are interconnected by a bus system and/or other form of connection mechanism (not shown). Further, the input device 1013 may include, for example, a keyboard, a mouse, and the like. The output device 1014 can output various kinds of information to the outside. The output devices 1014 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.

Of course, for simplicity, only some of the components of the electronic device 101 relevant to the present disclosure are shown in fig. 10, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 101 may include any other suitable components, depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method for implementing human-machine dialogs according to various embodiments of the present disclosure described in the "exemplary methods" section of this specification above.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a method for implementing human-machine dialog according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium may include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, and systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," comprising, "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects, and the like, will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method for implementing a human-machine dialog, comprising:

performing multi-head self-attention-based semantic feature extraction processing on a conversation statement of a current time sequence of a conversation party to obtain a semantic feature vector of the conversation statement;

acquiring hidden layer state information of the current time sequence according to the semantic feature vector and the hidden layer state information of the previous time sequence of the current time sequence;

and generating the dialogue state information of the dialogue statement according to the hidden layer state information of the current time sequence.

2. The method of claim 1, wherein the performing a multi-headed self-attention based semantic feature extraction process on the current time-series conversational sentence of the conversing party comprises:

acquiring word segmentation vectors of all the word segments contained in the current time sequence conversation statement of the conversation party;

forming a vector matrix of the conversation statement according to the word segmentation vector of each word segmentation and the position coding information of each word segmentation;

and performing multi-head self-attention-based semantic feature extraction processing on the vector matrix of the conversation statement to obtain a semantic feature vector of the conversation statement.

3. The method of claim 2, wherein the performing a multi-headed self-attention based semantic feature extraction process on the vector matrix of conversational utterances comprises:

performing multi-head self-attention processing, first-layer normalization processing, feedforward processing and second-layer normalization processing on the vector matrix of the conversational statement at least once;

and obtaining the semantic feature vector of the conversation statement according to the result of the last second-layer normalized processing.

4. The method according to any one of claims 1 to 3, wherein the obtaining hidden layer state information of the current time sequence according to the semantic feature vector and hidden layer state information of a time sequence previous to the current time sequence comprises:

and performing long-term and short-term memory processing on the semantic feature vector and the hidden layer state information of the previous time sequence of the current time sequence to obtain the hidden layer state information of the current time sequence.

5. The method of any of claims 1-4, wherein the hidden layer state information comprises: long-short term memory hidden state information and long-short term memory cell state information.

6. The method of claim 5, wherein the generating dialog state information for the conversational sentence from the hidden layer state information for the current time sequence comprises:

classifying the hidden layer state information of the current time sequence;

and obtaining the dialog state information of the dialog statement according to the classification processing result.

7. The method of any of claims 1 to 6, wherein the dialog state information of the conversational sentence comprises: slot value probability distribution for all slots.

8. An apparatus for implementing a human-machine conversation, wherein the apparatus comprises:

the semantic feature extraction module is used for extracting and processing the semantic features of the conversation sentences of the current time sequence of the conversation party based on multi-head self-attention to obtain semantic feature vectors of the conversation sentences;

a hidden layer state obtaining module, configured to obtain hidden layer state information of the current time sequence according to the semantic feature vector and hidden layer state information of a time sequence previous to the current time sequence;

and the dialog state information generating module is used for generating the dialog state information of the conversation statement according to the hidden layer state information of the current time sequence.

9. A computer-readable storage medium, the storage medium storing a computer program for performing the method of any of the preceding claims 1-7.

10. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of claims 1-7.