CN110955765A

CN110955765A - Corpus construction method and apparatus of intelligent assistant, computer device and storage medium

Info

Publication number: CN110955765A
Application number: CN201911158765.7A
Authority: CN
Inventors: 林志达; 吴石松; 吴丹
Original assignee: China Southern Power Grid Co Ltd; Southern Power Grid Digital Grid Research Institute Co Ltd
Current assignee: China Southern Power Grid Co Ltd; Southern Power Grid Digital Grid Research Institute Co Ltd
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2020-04-03

Abstract

The application relates to a corpus construction method and device of an intelligent assistant, computer equipment and a storage medium. The method comprises the following steps: extracting current question and answer text data from a system log of an electric power operation management system corresponding to the intelligent assistant of the corpus to be constructed; performing text vectorization operation on the current question and answer text data to obtain a current question and answer text vector; inputting the current question-answer text vector into a corpus construction model, and training the corpus construction model according to question-answer text data extracted from a system log; and obtaining a question text and an answer text according to an output result of the corpus construction model, associating the obtained question text with the answer text, and taking the associated question text and the associated answer text as question and answer corpus data of the intelligent assistant. By adopting the method, the construction period of the corpus of the intelligent assistant can be shortened, and the applicability of the constructed question and answer corpus data and the application scene can be improved.

Description

Corpus construction method and apparatus of intelligent assistant, computer device and storage medium

Technical Field

The present application relates to the field of power technologies, and in particular, to a corpus construction method and apparatus for an intelligent assistant, a computer device, and a storage medium.

Background

With the rapid development of the power industry and artificial intelligence technologies, more and more power services are beginning to use intelligent assistants (e.g., also referred to as "intelligent conversation assistants," or simply "assistants"). A user may chat, speak, or otherwise interact with the smart assistant based on a user interface provided by the computer device to cause the smart assistant to output corresponding information in response to the user's needs, or otherwise perform certain operations. The intelligent assistant is similar to the chat robot in implementation logic, but adds the flow of business processing, and the intelligent assistant can process related business according to the result returned by the conversation management.

As an important branch of the field of artificial intelligence, intelligent assistants have gained increasing attention and application. The intelligent assistant interacts with the user through a natural language understanding and question-and-answer system, where the acquisition and construction of the original corpus is crucial to the response accuracy and functional coverage of the intelligent assistant. The corpus data of the existing intelligent assistant is mainly constructed in a mode of manual collection and labeling or based on a template, so that the period for constructing a corpus is long, and the corpus data is difficult to be suitable for corresponding application scenes.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a corpus construction method, an apparatus, a computer device, and a storage medium for an intelligent assistant, which can shorten a period of corpus construction of the intelligent assistant and improve applicability to an application scenario.

A corpus construction method of an intelligent assistant comprises the following steps:

extracting current question and answer text data from a system log of an electric power operation management system corresponding to the intelligent assistant of the corpus to be constructed;

performing text vectorization operation on the current question and answer text data to obtain a current question and answer text vector;

inputting the current question-answer text vector into a corpus construction model, and training the corpus construction model according to question-answer text data extracted from a system log;

and obtaining a question text and an answer text according to an output result of the corpus construction model, associating the obtained question text with the answer text, and taking the associated question text and the associated answer text as question and answer corpus data of the intelligent assistant.

In one embodiment, the training process of the corpus building model includes:

extracting question and answer text data from a system log, and dividing a training sample set from the question and answer text data, wherein the training sample set comprises a plurality of first question and answer text data;

performing text vectorization operation on the plurality of first question and answer text data to obtain first question and answer text vectors;

constructing an antagonistic network, wherein the antagonistic network comprises a generator model and a discriminator model; the generator model is used for generating a first question text vector and a first answer text vector corresponding to the first question and answer text vector according to the first question and answer text vector; the discriminator model is used for discriminating the authenticity of the first question-answer text and the first answer text vector;

and training based on the generated countermeasure network to obtain a corpus construction model.

In one embodiment, the training based on generation of the countermeasure network to obtain the corpus construction model includes:

acquiring a default real sample set, wherein the default real sample set comprises a default real question text and a default real answer text corresponding to the default real question text;

performing text vectorization operation on the default real question text and the default real answer text to obtain a second question text vector and a second answer text vector corresponding to the second question text vector;

training a countermeasure network through a first question text vector, a first answer text vector, a second question text vector and a second answer text vector, wherein a discriminator model is used for outputting a first probability and a second probability, the first probability is the probability that an input sample is judged to be from the first question text vector and the first answer text vector, the second probability is the probability that an input sample is judged to be from the second question text vector and the second answer text vector, and an objective function of the countermeasure network is used for optimizing network parameters of the countermeasure network so that the objective function for minimizing the first probability and the objective function for maximizing the second probability mutually play a game to achieve balance;

and obtaining a corpus construction model according to the trained confrontation network.

In one embodiment, the obtaining a corpus construction model according to the trained confrontation network includes:

dividing a test sample set from the question and answer text data, wherein the test sample set comprises a plurality of second question and answer text data;

performing text vectorization on the plurality of second question and answer text data to obtain second question and answer text vectors;

testing the trained confrontation network by using the second question-answering text vector to obtain a test result;

and if the test result meets the test condition, taking the trained confrontation network as a corpus construction model.

In one embodiment, the generator model includes a question generator model and an answer generator model; the question generator model is used for generating a first question text vector according to the first question and answer text vector, and the answer generator model is used for generating a first answer text vector according to the first question and answer text vector.

In one embodiment, the question generator model uses an encoding-decoding model with attention mechanism, the encoding layer and the decoding layer of the encoding-decoding model with attention mechanism use a GRU model, and the answer generator model uses an LSTM model and a dialogue generation model.

In one embodiment, the extracting current question and answer text data from the system log of the power operation management system corresponding to the intelligent assistant to be used for constructing the corpus includes:

extracting initial corpus data and/or user behavior data from the system log, wherein the user behavior data comprises user click behavior data, user search behavior data and/or user dialogue data;

and obtaining current question and answer text data according to the initial corpus data, the user click behavior data, the user search behavior data and/or the user dialogue data.

In one embodiment, the training process of the corpus building model further includes: and acquiring generation time information of the question and answer text data, and dividing the question and answer text data into a training sample set and a test sample set according to the generation time information.

A corpus construction apparatus of a smart assistant, the apparatus comprising:

the data acquisition module is used for extracting current question and answer text data from a system log of the power operation management system corresponding to the intelligent assistant of the corpus to be constructed;

the vectorization operation module is used for performing text vectorization operation on the current question and answer text data to obtain a current question and answer text vector;

the input module is used for inputting the current question-answer text vector into a corpus construction model, and the corpus construction model is obtained by training according to the question-answer text data extracted from the system log;

and the corpus building module is used for obtaining a question text and an answer text according to an output result of the corpus building model, associating the obtained question text with the answer text, and taking the associated question text and the associated answer text as question and answer corpus data of the intelligent assistant.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

The corpus construction method, the apparatus, the computer device and the storage medium of the intelligent assistant extract current question and answer text data from a system log of an electric power operation management system corresponding to the intelligent assistant to be constructed with a corpus, perform text vectorization operation on the current question and answer text data to obtain a current question and answer text vector, input the current question and answer text vector into a corpus construction model, the corpus construction model is obtained by training according to the question and answer text data extracted from the system log, obtain a question text and an answer text according to an output result of the corpus construction model, associate the obtained question text and the answer text, and use the associated question text and answer text as question and answer corpus data of the intelligent assistant. By adopting the embodiment, the corpus is constructed based on the current question and answer text number extracted from the system log of the electric power operation management system corresponding to the intelligent assistant of the corpus to be constructed, and the corpus construction model is obtained by training the question and answer text data extracted from the system log of the electric power operation management system corresponding to the intelligent assistant of the corpus to be constructed, so that the obtained question and answer corpus data is matched with the intelligent assistant of the corpus to be constructed, and the applicability of the corpus to the application scene is improved; meanwhile, the current question-answer text data is subjected to text vectorization operation to obtain a current question-answer text vector, the current question-answer text vector is input into the corpus construction model to obtain a question text and an answer text according to an output result of the corpus construction model, the obtained question text and the obtained answer text are associated, and the associated question text and the associated answer text are used as the question-answer corpus data of the intelligent assistant, so that the manual labeling work is reduced, and the period for constructing the corpus of the intelligent assistant can be shortened.

Drawings

FIG. 1 is a diagram of an embodiment of an application environment for a corpus construction method of an intelligent assistant;

FIG. 2 is a flowchart illustrating a corpus construction method of an intelligent assistant in one embodiment;

FIG. 3 is a schematic flow chart illustrating a corpus construction model training process in one embodiment;

FIG. 4 is a schematic flow chart diagram illustrating the training step based on generation of the countermeasure network in one embodiment;

FIG. 5 is a flow diagram illustrating the testing steps for a trained countermeasure network in one embodiment;

FIG. 6 is a flowchart illustrating a corpus construction method of an intelligent assistant in one embodiment;

FIG. 7 is a diagram of an LSTM + DA model of the answer generation model in one embodiment;

FIG. 8 is a schematic diagram of a CNN-based discriminator model;

FIG. 9 is a block diagram of a corpus construction device of the intelligent assistant in one embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It is to be understood that the term "and/or", as used herein, describes an associative relationship of associated objects, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The corpus construction method of the intelligent assistant provided by the application can be applied to the application environment shown in fig. 1. The application environment includes a terminal 102, a server 106, a database device 106, and a network 108, and the terminal 102, the server 104, and the database device 106 may be communicatively connected via the network 108. The network system formed by the terminal 102, the server 106, the database device 106, and the network 108 may be based on the internet, or may be based on a local area network, or may be based on a combination network of the internet and the local area network, which is not described herein again.

The terminal 102 may be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers. The database device 106 includes a plurality of databases or database clusters, the database device 106 is configured to store related system logs, for example, the system logs of the power operation management system corresponding to the intelligent assistant to be constructed with linguistic data, and the database device 106 may also be generally connected to the power operation management system corresponding to the intelligent assistant. The network 108 is used to implement network connections between the terminal 102 and the server 104, the server 104 and the database device 106, and so on, and may include various types of wired or wireless networks. The network 108 may include the internet, a local area network ("LAN"), a wide area network ("WAN"), an intranet, a mobile phone network, a Virtual Private Network (VPN), a cellular or other mobile communication network, bluetooth, NFC, or any combination thereof. The network 108 may also be based on a corresponding communication protocol when performing data transmission, for example, the web browser may be based on an HTTP communication protocol when receiving a service code corresponding to a web page, and the mobile application may be based on a Websocket communication protocol when receiving a service code.

In a specific application, after receiving a corpus construction instruction sent by the terminal 102, the server 106 may determine an electric power operation management system corresponding to an intelligent assistant of a corpus to be constructed, extract current question-answer text data from a system log of the electric power operation management system, perform text vectorization operation on the current question-answer text data to obtain a current question-answer text vector, input the current question-answer text vector into a corpus construction model, where the corpus construction model is obtained by training according to the question-answer text data extracted from the system log, obtain a question text and an answer text according to an output result of the corpus construction model, associate the obtained question text with the answer text, and use the associated question text and answer text as question-answer corpus data of the intelligent assistant. The corpus data of the questions and answers obtained by the server 106 can be fed back to the terminal 102, or can be stored locally or sent to other computer devices. The resulting corpus data may be used to train a dialog generation model that is used to generate a dialog between the user and the intelligent assistant.

In an embodiment, as shown in fig. 2, a corpus building method of an intelligent assistant is provided, which is described by taking the method as an example for being applied to the server in fig. 1, and includes the following steps:

step 202, extracting current question and answer text data from a system log of the power operation management system corresponding to the intelligent assistant of the corpus to be constructed.

The user logs in a certain power operation management system (or referred to as an intelligent assistant operation management and control platform), a large amount of user behavior sequence data and a small amount of dialogue text between the user behavior sequence data and an intelligent assistant or an artificial customer service of the power operation management system are correspondingly recorded in a system log of the power operation management system, so that the server can extract the information, namely current question and answer text data, from the system log of the power operation management system corresponding to the intelligent assistant to be constructed with linguistic data, and the user behavior sequence data can include user click behavior data, user search behavior data and/or user dialogue data. For example, the user clicks a "human resources" button, and jumps to a human resources billboard, which is used as user click behavior data; the user searches for a certain financial index, and the system displays data of the certain financial index as user searching behavior data; the user and the manual customer service dialog 'how to inquire the XX index', 'you need to enter the XX billboard first and then click the XX index', and the XX index is used as user dialog data.

Specifically, the server determines an intelligent assistant to be constructed with linguistic data, obtains a system log of an electric power operation management system corresponding to the intelligent assistant, and extracts current question and answer text data from the system log.

And step 204, performing text vectorization operation on the current question and answer text data to obtain a current question and answer text vector.

Specifically, the server may convert the question and answer text data into a vector form through a word2vec model, so as to obtain a current question and answer text vector. word2vec is a group of correlation models used to generate word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text.

Step 206, inputting the current question-answer text vector into a corpus construction model, wherein the corpus construction model is obtained by training according to question-answer text data extracted from a system log corresponding to the intelligent assistant of the corpus to be constructed.

Specifically, the server may perform model training in advance according to question and answer text data extracted from a system log corresponding to the smart assistant of the corpus to be constructed to obtain a corpus construction model, and after obtaining a current question and answer text vector, input the current question and answer text vector into the corpus construction model to obtain an output result, where the output result generally includes at least a question text vector and an answer text vector.

And 208, obtaining a question text and an answer text according to the output result of the corpus building model, associating the obtained question text with the answer text, and taking the associated question text and the associated answer text as question and answer corpus data of the intelligent assistant.

Specifically, the server obtains a question text and an answer text according to an output result of the corpus building model, associates the obtained question text with the answer text, and uses the associated question text and the associated answer text as question and answer corpus data of the intelligent assistant.

In the embodiment, the corpus is constructed based on the current question and answer text number extracted from the system log of the electric power operation management system corresponding to the intelligent assistant of the corpus to be constructed, and the corpus construction model is obtained by training the question and answer text data extracted from the system log of the electric power operation management system corresponding to the intelligent assistant of the corpus to be constructed, so that the obtained question and answer corpus data is matched with the intelligent assistant of the corpus to be constructed, and the applicability of the corpus to the application scene is improved; meanwhile, the current question-answer text data is subjected to text vectorization operation to obtain a current question-answer text vector, the current question-answer text vector is input into the corpus construction model to obtain a question text and an answer text according to an output result of the corpus construction model, the obtained question text and the obtained answer text are associated, and the associated question text and the associated answer text are used as the question-answer corpus data of the intelligent assistant, so that the manual labeling work is reduced, and the period for constructing the corpus of the intelligent assistant can be shortened.

In one embodiment, as shown in fig. 3, the above training process of the corpus building model may include the following steps:

step 302, extracting question and answer text data from a system log of the power operation management system corresponding to the intelligent assistant of the corpus to be constructed, and dividing a training sample set from the question and answer text data, wherein the training sample set comprises a plurality of first question and answer text data.

Generally, the question and answer text data needs to be divided into a training sample set and a testing sample set, wherein the training sample set is used for model training, and the testing sample set is used for testing the trained model. Meanwhile, the labels of the training sample set and the test sample set generally do not intersect.

Step 304, performing text vectorization operation on the plurality of first question and answer text data to obtain first question and answer text vectors.

Specifically, the server may convert the plurality of first question and answer text data into a vector form through a word2vec model, so as to obtain a first question and answer text vector.

Step 306, constructing a countermeasure network, wherein the countermeasure network comprises a generator model and a discriminator model; the generator model is used for generating a first question text vector and a first answer text vector corresponding to the first question and answer text vector according to the first question and answer text vector; the discriminator model is used for discriminating the authenticity of the first question-answer text and the first answer text vector.

Here, the generator model generally includes a question generator model and an answer generator model; the question generator model is used for generating a first question text vector according to the first question and answer text vector, and the answer generator model is used for generating a first answer text vector according to the first question and answer text vector.

In one embodiment, the problem generator model adopts an encoding-decoding model with attention mechanism, and an encoding layer and a decoding layer of the encoding-decoding model with attention mechanism adopt a GRU (Gated recursive Unit) model, and the training speed of the problem generator model can be improved because the GRU has one less gate function compared with an LSTM (Long Short-Term Memory network) model.

In one embodiment, the answer generator model employs an LSTM model and a dialog generation model. A dialog generation model (dialog Act, DA) is introduced into the LSTM model, and is used to determine which information is retained for a step at a future time, and discard other information, so that the answer generator model can be optimized.

And 308, training based on the generated countermeasure network to obtain a corpus construction model.

Specifically, training of the generator model and the discriminator model may be performed alternately until both the generator model and the discriminator model converge, and the corpus building model is finally obtained.

In this embodiment, training based on the generation countermeasure network is performed, so that the first question and answer text vector generated by the generator model is as close to real sample data as possible, and the model is optimized as much as possible.

In one embodiment, as shown in fig. 4, the training based on generation of the confrontation network to obtain the corpus construction model includes the following steps:

step 402, acquiring a default real sample set, wherein the default real sample set comprises a default real question text and a default real answer text corresponding to the default real question text;

in particular, the user dialog data may be taken as a default set of true samples.

Step 404, performing text vectorization operation on the default real question text and the default real answer text to obtain a second question text vector and a second answer text vector corresponding to the second question text vector;

step 406, training a countermeasure network by using a first question text vector, a first answer text vector, a second question text vector and a second answer text vector, wherein the discriminator model is used for outputting a first probability and a second probability, the first probability is a probability that the input samples are judged to be from the first question text vector and the first answer text vector, the second probability is a probability that the input samples are judged to be from the second question text vector and the second answer text vector, and an objective function of the countermeasure network is used for optimizing network parameters of the countermeasure network, so that an objective function for minimizing the first probability and an objective function for maximizing the second probability mutually play a game to reach balance.

And the first question text vector and the corresponding first answer text vector or the second question text vector and the corresponding second answer text vector are used as input samples of the discriminator model.

And step 408, obtaining a corpus construction model according to the trained confrontation network.

Specifically, the trained confrontation network may be directly used as the corpus construction model, or the trained confrontation network satisfying the specified condition may be used as the corpus construction model.

In one embodiment, as shown in fig. 5, the obtaining a corpus construction model according to the trained confrontation network may include:

step 502, dividing a test sample set from the question and answer text data, wherein the test sample set comprises a plurality of second question and answer text data;

step 504, performing text vectorization on the plurality of second question and answer text data to obtain second question and answer text vectors;

specifically, the server may convert the plurality of second question and answer text data into a vector form through a word2vec model, so as to obtain a second question and answer text vector.

Step 506, testing the trained confrontation network by using a second question-and-answer text vector to obtain a test result;

here, the test result may be a mass fraction value.

And step 508, if the test result meets the test condition, taking the trained confrontation network as a corpus construction model.

Here, the test condition may be set according to actual needs, for example, the mass fraction value is larger than a preset threshold value.

If the test result does not meet the test condition, prompt information can be generated for prompting the user that the completed confrontation network does not meet the requirements, and the user can further optimize or adjust the model according to the prompt information.

In one embodiment, the extracting current question and answer text data from the system log of the power operation management system corresponding to the intelligent assistant to be used for constructing the corpus includes: extracting initial corpus data and/or user behavior data from a system log of an electric power operation management system corresponding to an intelligent assistant of a corpus to be constructed, wherein the user behavior data comprises user click behavior data, user search behavior data and/or user dialogue data; and obtaining current question and answer text data according to the initial corpus data, the user click behavior data, the user search behavior data and/or the user dialogue data.

In one embodiment, the generation time information of the question and answer text data is obtained, and the question and answer text data is divided into a training sample set and a test sample set according to the generation time information.

Specifically, a plurality of consecutive equal-interval or unequal-interval time periods, for example, time period 1, time period 2, time period 3, and time period … may be time-divided, the question and answer text data whose generation time information is an odd time period is divided into the training sample set (or the test sample set), and the question and answer text data whose generation time information is an even time period is divided into the test sample set (or the training sample set). The length of each odd time period may be equal, the length of each even time period may also be equal, and the ratio of the length of the odd time period to the length of the even time period may be set according to actual needs, for example, 7 to 3.

In this embodiment, the question and answer text data is divided into a training sample set and a test sample set according to the generated time information, so that the influence of time on the model training effect or the test result can be eliminated as much as possible.

In order to facilitate understanding of the present application, a preferred embodiment of the present application will be described in detail. As shown in fig. 6, the corpus construction method of the intelligent assistant in this embodiment includes the following steps:

step 602, obtaining a basic corpus, a user access node sequence in a system log, and node text content.

The data processing method adopted by the embodiment is as follows: and acquiring system log data from an intelligent assistant operation management and control platform in reality. After a user logs in the intelligent assistant operation control platform, a large number of behavior sequences such as clicks and the like and a small part of various dialog texts generated to the platform assistant exist. Each user behavior node (e.g., click node) corresponds to a key text message. The acquired data can be divided into a training set and a test set according to a time relation, and the labels of the two sets are not crossed. The data processing input may include a node number and corresponding key text information, and may also include a dialog text formed by a part of an existing access sequence, i.e., a base corpus.

In order to facilitate understanding of subsequent contents, the question and answer text data is used as the user click behavior data for illustration.

Click behavior: the user clicks the 'manpower resource monitoring' button and enters the manpower resource billboard.

The training results for the initial generator model are:

the generator inputs: 1. click "manpower resources control" and jump the manpower resources billboard.

The generator outputs the problem: view manpower resources kanban data?

The generator outputs the answer: "human resources monitor" button.

Arbiter model in the countermeasure network (GAN):

real data: asking how to check human resource data, and clicking the human resource monitoring button to check. "

And judging that the output perfectness of the generator is not enough according to the real data, and further training the generator is needed.

The training result of the generator model after the antagonistic network is introduced is as follows:

the generator model inputs: click "manpower resources control" and jump the manpower resources billboard.

The generator model outputs the problem: how do requests to view the human resources billboard data?

The generator model outputs the answer: please click the "human resources monitor" button.

The generator model and the discriminator model are subjected to antagonistic training through an antagonistic network, so that the generation result of the generator model is closer to real data.

The construction of the generator model and the discriminator model and the countermeasure training are explained below.

Step 604, a generator model is constructed, wherein the generator model comprises a question generation model and an answer generation model.

The generator model is used for generating question texts (or called question text vectors) and answer texts (or called answer text vectors). The input data of the generator model is vectors formed by n nodes in sequence number and corresponding text content, and the output data of the generator model is question text and answer text.

The method comprises the following specific steps: converting text contents of each node of the training set into a vector form X through a word2vec model_n。

For a given text sequence X ═ X₁，x₂，...x_nThe first question-answer text vector outputs a question text vector Y ═ Y₁，y₂，...y_nAnd answer text vector a ═ a₁，A₂，...A_n}。

The problem generation model mainly adopts an encoder-attribute-decoder structure, namely an encoding-decoding model with an attention mechanism, wherein the encoder layer and the decoder layer both use GRU models, and the specific structure is as follows:

an Encoder: problem text vector generation is performed using a GRU model.

q_i＝GRU(x_i,q_i-1) (1)

Wherein i represents the ith character in the sequence, x represents the initial text vector of the text data, and q represents the hidden layer state output by the coding layer.

Attention: utilizing an Attention mechanism;

wherein, a_ijThe representative weight parameter is not fixed but obtained by neural network training. Adding the sequence of concealment vectors by weight means that the attention allocation at the time of generating the jth output is different. a is_ijThe higher the value of (d), the more attention the ith output is assigned to the jth input, the more influenced by the jth input when the ith output is generated. c represents the text semantic vector that is the final output.

Decoder：

s_j＝GRU(y_j-1,s_j-1,c_j) (3)

Wherein s represents the decoded problem text vector, and y represents the hidden layer state output by the decoding layer.

The LSTM and GRU retain important features through various gate functions, which ensures that they are not lost even in long-term propagation. In contrast, the GRU has one less gate function than the LSTM, and therefore the number of parameters is less than the LSTM, so that the training speed of the GRU as a whole is faster than that of the LSTM. The probability distribution for generating all words is generated under conditions of real data, source text vectors and hidden layer states:

t_j＝g(y_j-1,c_j,s_j) (4)

P_j＝softmax(Wt_j) (5)

where P is the probability distribution of each generated target text vector, g is a linear function, W is a parameter matrix of the neural network, softmax is an activation function, which can be understood as projecting the probability of generating each text to a space of 0-1, and the probabilities sum to 1.

As shown in fig. 7, the answer generation model uses an LSTM model and a dialog generation model (DA) for generating an answer text vector. The core point is to improve the LSTM network, introduce DA (dialogue act) mechanism and add keywords. Wherein, the representation form of the keyword is one-hot code, the length of the code is the length of the keyword table, where 0 represents that the keyword is not expressed, and 1 represents that the keyword is input, for example: keyword table [ "china", "us," france "] (here N is 3): china >100, us >010, france > 001. The DA unit plays the role of sentence planning because it is able to accurately encode the surface realizations of the input information during the generation process. The DA unit decides which information to keep for future time steps and discards other information. In the mathematical formula expression of LSTM, by introducing DA unit, the expression of the text vector of the final output answer is added with a term which directly influences the planning of the whole generated sentence.

Wherein, the standard LSTM module is:

i_t＝σ(W_Wix_t)+W_hih_t-1(6)

f_t＝σ(W_Wfx_t)+W_hfh_t-1(7)

o_t＝σ(W_Wox_t)+W_hoh_t-1(8)

c_t′＝tanh(W_wcx_t)+W_hch_t-1(9)

c_t＝f_t⊙c_t-1+i_t⊙c_t‘ (10)

h_t＝o_t⊙tanh(c_t) (11)

wherein i is an input gate, o is an output gate, f is a forgetting gate, the input gate indicates whether current input information is allowed to be added into the hidden layer state, the output gate indicates whether an output value of a current hidden layer node is allowed to be transmitted to the next layer, the forgetting gate indicates whether a historical state of the current node is reserved, c is a cell state of the LSTM model, h is a hidden layer state of the LSTM module, t is the current moment, sigma is a sigmoid function and is an activation function, W is a parameter matrix of the neural network, different subscripts are used for distinguishing the parameter matrix multiplied by different vectors in calculation, and a circle ⊙ indicates that matrix elements are correspondingly multiplied.

The dialog Act dialog generation module is as follows:

r_t＝σ(W_wrx_t)+W_hrh_t-1(12)

d_t＝r_t⊙d_t-1(13)

c_t＝f_t⊙c_t-1+i_t⊙c_t′+tanh(W_dcd_t) (14)

wherein r is the DA module unit state, d is the keyword vector, t is the current time, c represents the final output answer text vector, W is the parameter matrix of the neural network, different subscripts distinguish the parameter matrix multiplied by different vectors in the calculation, and tanh is the activation function.

And 606, constructing a discriminator model and discriminating the samples generated by the discriminator model.

The samples generated by the generator are the question text vector and the answer text vector generated by the generator model, and the reality of the text can be increased by distinguishing the samples generated by the generator.

As shown in fig. 8, in the discriminator model, a Convolutional Neural Network (CNN) based model is used to discriminate the authenticity of the sample generated by the generator to increase the authenticity of the text.

The countermeasure network is mainly divided into two parts: a generator model and a discriminator model. The generator model is used for simulating the distribution of real data, the discriminator model is used for judging whether a sample is a real sample or a generated sample, and the aim of the anti-network is to train the generator model to perfectly fit the distribution of the real data so that the discriminator model cannot distinguish.

In the generative countermeasure network, the aim is to make the text distribution generated by the generator model fit the real data distribution as much as possible, and the output of the discriminator model is the probability that the input data comes from the real data rather than the generator, so the probability that the generator distribution is close to the real distribution is improved as much as possible, the probability that the discriminator model judges that the data comes from the real data is reduced, and optimization is realized.

At step 608, an objective function is constructed.

The countermeasure network trains the generator model and the discriminator model by using a countermeasure generation mode, and a deep learning neural network optimization method based on Adam (Adaptive motion Estimation) can be adopted.

Training of nerves is performed using Adam optimization algorithm, whose optimization objective function is defined as:

wherein, P_dataIs the distribution of real samples, P_zIs the distribution of the samples generated by the generator, and the discriminator model D is responsible for discriminating between real samples and generated samples, with the goal that D (x) (the probability that the decision input comes from real data) is as large as possible, D (G (z) (the decision input) is as large as possibleThe probability of coming from the generated data) is as small as possible, i.e. V (D, G) is as large as possible as a whole; the generator model G is responsible for generating samples that are as true as possible, i.e., D (G (z)) is as large as possible, i.e., V (D, G) is as small as possible. The generator model G and the discriminator model D oppose each other, i.e.

And finally, global optimization is achieved.

The Adam algorithm is a self-adaptive optimization algorithm, can well solve some problems of the random gradient descent method, can adjust different learning rates for each different parameter, updates the frequently-changed parameters with smaller step size, and updates the sparse parameters with larger step size. The update step is calculated by taking into account the first moment estimate of the gradient (i.e., the mean of the gradient) and the second moment estimate (i.e., the non-centered variance of the gradient).

And step 610, training based on generation of the countermeasure network is carried out, and the neural network construction is completed.

Based on the steps, question-answer text data are extracted from the system logs, a first question-answer text vector is constructed, and question and answer matching is achieved through training of the neural network, so that the construction of corpus data (or called as an original corpus) of the intelligent assistant is completed. Specifically, a first question text vector and a first answer text vector output by the generator model may be input into a recognizer model, the recognizer model takes a returned classification result as a return, and a reinforcement learning architecture is adopted to train the generator model; meanwhile, a first question text vector and a first answer text vector generated by the generator model, and a second question text vector and a second answer text vector obtained according to a default real sample set can be trained by adopting a training mode of a countermeasure generation network; and training the generator model and the discriminator model alternately until the generator model and the discriminator model are converged to finally obtain the corpus building model. The expected data of the intelligent assistant can be obtained by inputting the text vectorization processing of the question and answer text data extracted from the system log into the corpus construction model.

It should be understood that although the various steps in the flow charts of fig. 2-6 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-6 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 9, there is provided a corpus building apparatus of an intelligent assistant, including: a data acquisition module 902, a vectorization operation module 904, an input module 906, and a corpus construction module 908, wherein:

a data obtaining module 902, configured to extract current question and answer text data from a system log of an electric power operation management system corresponding to an intelligent assistant of a corpus to be constructed;

a vectorization operation module 904, configured to perform text vectorization operation on the current question-and-answer text data to obtain a current question-and-answer text vector;

an input module 906, configured to input the current question-answer text vector to a corpus construction model, where the corpus construction model is obtained by training according to question-answer text data extracted from a system log;

the corpus construction module 908 is configured to obtain a question text and an answer text according to an output result of the corpus construction model, associate the obtained question text and the obtained answer text, and use the associated question text and answer text as query and answer corpus data of the intelligent assistant.

In one embodiment, the corpus building apparatus of the smart assistant may further include a training module (not shown in the figure), and the training module may include an obtaining unit, a converting unit, and a building unit, wherein:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for extracting question and answer text data from a system log and dividing a training sample set from the question and answer text data, and the training sample set comprises a plurality of first question and answer text data;

the conversion unit is used for carrying out text vectorization operation on the plurality of first question and answer text data to obtain first question and answer text vectors;

the construction unit is used for constructing a confrontation network, and the confrontation network comprises a generator model and a discriminator model; the generator model is used for generating a first question text vector and a first answer text vector corresponding to the first question and answer text vector according to the first question and answer text vector; the discriminator model is used for discriminating the authenticity of the first question-answer text and the first answer text vector;

and the training unit is used for training based on the generated countermeasure network to obtain the corpus construction model.

In one embodiment, the training unit may obtain a default real sample set, the default real sample set including a default real question text and a default real answer text corresponding to the default real question text, perform text vectorization on the default real question text and the default real answer text to obtain a second question text vector and a second answer text vector corresponding to the second question text vector, train the countermeasure network by using the first question text vector, the first answer text vector, the second question text vector and the second answer text vector, the discriminator model is configured to output a first probability and a second probability, the first probability is a probability that the input sample is determined to be from the first question text vector and the first answer text vector, the second probability is a probability that the input sample is determined to be from the second question text vector and the second answer text vector, and the objective function of the countermeasure network is used for optimizing network parameters of the countermeasure network, so that the objective function minimizing the first probability and the objective function maximizing the second probability mutually play in a balanced manner, and a corpus construction model is obtained according to the trained countermeasure network.

In one embodiment, the training unit may divide a test sample set from the question and answer text data, where the test sample set includes a plurality of second question and answer text data, perform text vectorization on the plurality of second question and answer text data to obtain a second question and answer text vector, test the trained confrontation network with the second question and answer text vector to obtain a test result, and if the test result satisfies a test condition, use the trained confrontation network as a corpus construction model.

In one embodiment, the data obtaining module 902 may extract initial corpus data and/or user behavior data from a system log of an electric power operation management system corresponding to an intelligent assistant to be corpus, where the user behavior data includes user click behavior data, user search behavior data, and/or user dialogue data, and obtain current question and answer text data according to the initial corpus data, the user click behavior data, the user search behavior data, and/or the user dialogue data.

In one embodiment, the obtaining unit may be further configured to obtain generation time information of the question and answer text data, and divide the question and answer text data into a training sample set and a test sample set according to the generation time information.

For the specific definition of the corpus building device of the intelligent assistant, reference may be made to the above definition of the corpus building method of the intelligent assistant, which is not described herein again. The modules in the corpus building device of the intelligent assistant can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the question and answer corpus data of the intelligent assistant. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a corpus construction method for an intelligent assistant.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: extracting current question and answer text data from a system log of an electric power operation management system corresponding to the intelligent assistant of the corpus to be constructed; performing text vectorization operation on the current question and answer text data to obtain a current question and answer text vector; inputting the current question-answer text vector into a corpus construction model, and training the corpus construction model according to question-answer text data extracted from a system log; and obtaining a question text and an answer text according to an output result of the corpus construction model, associating the obtained question text with the answer text, and taking the associated question text and the associated answer text as question and answer corpus data of the intelligent assistant.

In one embodiment, the processor further implements a corpus building model training step when executing the computer program, and the processor implements the following steps when executing the computer program to implement the corpus building model training step: extracting question and answer text data from a system log, and dividing a training sample set from the question and answer text data, wherein the training sample set comprises a plurality of first question and answer text data; performing text vectorization operation on the plurality of first question and answer text data to obtain first question and answer text vectors; constructing an antagonistic network, wherein the antagonistic network comprises a generator model and a discriminator model; the generator model is used for generating a first question text vector and a first answer text vector corresponding to the first question and answer text vector according to the first question and answer text vector; the discriminator model is used for discriminating the authenticity of the first question-answer text and the first answer text vector; and training based on the generated countermeasure network to obtain a corpus construction model.

In one embodiment, when the processor executes the computer program to implement the above step of training based on generation of the confrontation network to obtain the corpus building model, the following steps are specifically implemented: acquiring a default real sample set, wherein the default real sample set comprises a default real question text and a default real answer text corresponding to the default real question text; performing text vectorization operation on the default real question text and the default real answer text to obtain a second question text vector and a second answer text vector corresponding to the second question text vector; training a countermeasure network through a first question text vector, a first answer text vector, a second question text vector and a second answer text vector, wherein a discriminator model is used for outputting a first probability and a second probability, the first probability is the probability that an input sample is judged to be from the first question text vector and the first answer text vector, the second probability is the probability that an input sample is judged to be from the second question text vector and the second answer text vector, and an objective function of the countermeasure network is used for optimizing network parameters of the countermeasure network so that the objective function for minimizing the first probability and the objective function for maximizing the second probability mutually play a game to achieve balance; and obtaining a corpus construction model according to the trained confrontation network.

In one embodiment, when the processor executes the computer program to implement the above step of obtaining the corpus construction model according to the trained confrontation network, the following steps are specifically implemented: dividing a test sample set from the question and answer text data, wherein the test sample set comprises a plurality of second question and answer text data; performing text vectorization on the plurality of second question and answer text data to obtain second question and answer text vectors; testing the trained confrontation network by using the second question-answering text vector to obtain a test result; and if the test result meets the test condition, taking the trained confrontation network as a corpus construction model.

The generator model comprises a question generator model and an answer generator model; the question generator model is used for generating a first question text vector according to the first question and answer text vector, and the answer generator model is used for generating a first answer text vector according to the first question and answer text vector.

In one embodiment, when the processor executes the computer program to implement the step of extracting the current question and answer text data from the system log of the power operation management system corresponding to the intelligent assistant of the corpus to be constructed, the processor specifically implements the following steps: extracting initial corpus data and/or user behavior data from a system log of an electric power operation management system corresponding to an intelligent assistant of a corpus to be constructed, wherein the user behavior data comprises user click behavior data, user search behavior data and/or user dialogue data; and obtaining current question and answer text data according to the initial corpus data, the user click behavior data, the user search behavior data and/or the user dialogue data.

In one embodiment, when the processor executes the computer program to implement the training step of the corpus building model, the following steps are further specifically implemented: and acquiring generation time information of the question and answer text data, and dividing the question and answer text data into a training sample set and a test sample set according to the generation time information.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: extracting current question and answer text data from a system log of an electric power operation management system corresponding to the intelligent assistant of the corpus to be constructed; performing text vectorization operation on the current question and answer text data to obtain a current question and answer text vector; inputting the current question-answer text vector into a corpus construction model, and training the corpus construction model according to question-answer text data extracted from a system log; and obtaining a question text and an answer text according to an output result of the corpus construction model, associating the obtained question text with the answer text, and taking the associated question text and the associated answer text as question and answer corpus data of the intelligent assistant.

In one embodiment, the computer program further implements a training step of the corpus building model when executed by the processor, and the computer program implements the following steps when the processor implements the training step of the corpus building model: extracting question and answer text data from a system log, and dividing a training sample set from the question and answer text data, wherein the training sample set comprises a plurality of first question and answer text data; performing text vectorization operation on the plurality of first question and answer text data to obtain first question and answer text vectors; constructing an antagonistic network, wherein the antagonistic network comprises a generator model and a discriminator model; the generator model is used for generating a first question text vector and a first answer text vector corresponding to the first question and answer text vector according to the first question and answer text vector; the discriminator model is used for discriminating the authenticity of the first question-answer text and the first answer text vector; and training based on the generated countermeasure network to obtain a corpus construction model.

In one embodiment, when the processor executes the step of performing the training based on the generated countermeasure network to obtain the corpus building model, the computer program specifically realizes the following steps: acquiring a default real sample set, wherein the default real sample set comprises a default real question text and a default real answer text corresponding to the default real question text; performing text vectorization operation on the default real question text and the default real answer text to obtain a second question text vector and a second answer text vector corresponding to the second question text vector; training a countermeasure network through a first question text vector, a first answer text vector, a second question text vector and a second answer text vector, wherein a discriminator model is used for outputting a first probability and a second probability, the first probability is the probability that an input sample is judged to be from the first question text vector and the first answer text vector, the second probability is the probability that an input sample is judged to be from the second question text vector and the second answer text vector, and an objective function of the countermeasure network is used for optimizing network parameters of the countermeasure network so that the objective function for minimizing the first probability and the objective function for maximizing the second probability mutually play a game to achieve balance; and obtaining a corpus construction model according to the trained confrontation network.

In one embodiment, when the processor executes the step of obtaining the corpus construction model according to the trained confrontation network, the computer program specifically realizes the following steps: dividing a test sample set from the question and answer text data, wherein the test sample set comprises a plurality of second question and answer text data; performing text vectorization on the plurality of second question and answer text data to obtain second question and answer text vectors; testing the trained confrontation network by using the second question-answering text vector to obtain a test result; and if the test result meets the test condition, taking the trained confrontation network as a corpus construction model.

In one embodiment, when the processor performs the step of extracting the current question and answer text data from the system log of the power operation management system corresponding to the intelligent assistant of the corpus to be constructed, the computer program specifically performs the following steps: extracting initial corpus data and/or user behavior data from a system log of an electric power operation management system corresponding to an intelligent assistant of a corpus to be constructed, wherein the user behavior data comprises user click behavior data, user search behavior data and/or user dialogue data; and obtaining current question and answer text data according to the initial corpus data, the user click behavior data, the user search behavior data and/or the user dialogue data.

In one embodiment, when the computer program is executed by the processor to implement the training step of the corpus building model, the following steps are further specifically implemented: and acquiring generation time information of the question and answer text data, and dividing the question and answer text data into a training sample set and a test sample set according to the generation time information.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A corpus construction method of an intelligent assistant, the method comprising:

inputting the current question-answer text vector into a corpus construction model, wherein the corpus construction model is obtained by training according to question-answer text data extracted from the system log;

and obtaining a question text and an answer text according to an output result of the corpus construction model, associating the obtained question text with the obtained answer text, and taking the associated question text and the associated answer text as question and answer corpus data of the intelligent assistant.

2. The method according to claim 1, wherein the training process of the corpus building model comprises:

extracting question and answer text data from the system log, and dividing a training sample set from the question and answer text data, wherein the training sample set comprises a plurality of first question and answer text data;

constructing a countermeasure network, the countermeasure network including a generator model and the discriminator model; the generator model is used for generating a first question text vector and a first answer text vector corresponding to the first question and answer text vector according to the first question and answer text vector; the discriminator model is used for discriminating the authenticity of the first question-answer text and the first answer text vector;

and training based on generation of a countermeasure network to obtain the corpus construction model.

3. The method according to claim 2, wherein the training based on generating the countermeasure network to obtain the corpus construction model comprises:

training the countermeasure network with the first question text vector, the first answer text vector, the second question text vector, and the second answer text vector, the discriminator model outputting a first probability that input samples are determined to be from the first question text vector and the first answer text vector and a second probability that input samples are determined to be from the second question text vector and the second answer text vector, the objective function of the countermeasure network optimizing network parameters of the countermeasure network such that an objective function that minimizes the first probability and an objective function that maximizes the second probability are in game with each other to achieve equilibrium;

4. The method of claim 3, wherein obtaining the corpus construction model based on the trained confrontation network comprises:

testing the trained confrontation network by using the second question and answer text vector to obtain a test result;

and if the test result meets the test condition, taking the trained confrontation network as the corpus construction model.

5. The method of any of claims 2 to 4, the generator model comprising a question generator model and an answer generator model; the question generator model is used for generating a first question text vector according to the first question and answer text vector, and the answer generator model is used for generating a first answer text vector according to the first question and answer text vector;

preferably, the question generator model adopts an encoding-decoding model with attention mechanism, the encoding layer and the decoding layer of the encoding-decoding model with attention mechanism adopt a GRU model, and the answer generator model adopts an LSTM model and a dialogue generation model.

6. The method according to claim 5, wherein the extracting current question and answer text data from the system log of the power operation management system corresponding to the intelligent assistant to be composed of:

extracting initial corpus data and/or user behavior data from the system log;

and obtaining the current question and answer text data according to the initial corpus data, the user click behavior data, the user search behavior data and/or the user dialogue data.

7. The method according to any of the claims 2 to 4, wherein the training process of the corpus building model further comprises:

and acquiring generation time information of the question and answer text data, and dividing the question and answer text data into the training sample set and the test sample set according to the generation time information.

8. A corpus construction apparatus of an intelligent assistant, the apparatus comprising:

the input module is used for inputting the current question-answer text vector into a corpus construction model, and the corpus construction model is obtained by training according to question-answer text data extracted from the system log;

and the corpus construction module is used for obtaining a question text and an answer text according to an output result of the corpus construction model, associating the obtained question text with the obtained answer text, and taking the associated question text and the associated answer text as question and answer corpus data of the intelligent assistant.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.