CN116204616A

CN116204616A - Artificial intelligence question-answering method based on semantic training algorithm

Info

Publication number: CN116204616A
Application number: CN202211711312.4A
Authority: CN
Inventors: 徐杭
Original assignee: Nanjing Baijue Technology Co ltd
Current assignee: Nanjing Baijue Technology Co ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-06-02

Abstract

The invention relates to an artificial intelligence technology, in particular to an artificial intelligence question-answering method based on a semantic training algorithm, which comprises the following steps: s1.1: collecting corpus data; s1.2: carrying out data preprocessing on the collected corpus data; s1.3: constructing a semantic training model based on the collected corpus data; s1.4: and carrying out model optimization training on the constructed semantic training model. According to the invention, the acquired corpus data is subjected to desensitization processing, so that the leakage of sensitive data is prevented, the safety of artificial intelligence question and answer is ensured, mask training and sequential logic training are respectively carried out by constructing a semantic training model, the learning capacity of the semantic training model is improved, the expansibility of an artificial intelligence question and answer system is improved, the accuracy and the recovery rate of the artificial intelligence question and answer are improved, the question can be recovered more efficiently, and the method can be applied to various intelligent fields.

Description

Artificial intelligence question-answering method based on semantic training algorithm

Technical Field

The invention relates to an artificial intelligence technology, in particular to an artificial intelligence question-answering method based on a semantic training algorithm.

Background

Natural language processing is a scientific research for processing media such as language, characters and the like used for information transmission by human beings, the English name of the natural language processing is Natural Language Processing, NLP is an important field in the current very popular artificial intelligence, a dialogue system is an important research direction in NLP, and the main research is how to enable a computer to have intelligent thinking for interaction with human beings, so that the natural language processing is an important work of the artificial intelligence and is a very challenging task. In 1950, turing published an evaluation method for computer systems through "computer and Intelligence", named "Turing test", which provided clear target direction for computer intelligence never before, namely represented the level of computer intelligence by means of man-machine conversation, and brought about a great deal of attention for students. The current dialogue system mainly comprises a question and answer system which helps users answer questions; the task-oriented dialog system mainly provides corresponding operation prompts and the like for users under the appointed scene tasks; the question-answering system is mainly a large-scale system for providing knowledge inquiry for users by processing questions input by users, analyzing the key contents of the questions according to the input, and searching and generating answers among candidate existing questions-answers.

The traditional task oriented system mainly comprises the steps of obtaining the intention input from a user side, carrying out a series of vectorization processing on the intention by utilizing the existing text processing methods, converting input information into vector state representation which can be recognized by a machine system, obtaining a matching value of a relevant answer by a series of calculation on the vector information, selecting the existing candidate answer according to a set matching strategy and the calculated matching value, and obtaining and returning the corresponding answer to the user. The system regards the dialogue as a pipeline, the mode mostly needs to manually label the characteristics of semantic representation in the dialogue process, thus a large amount of manpower and material resources are consumed, the cost is high, and the system is a task-oriented system, can only complete the work under a certain scene set for the system, cannot be applied to other fields, and has poor expansibility.

Disclosure of Invention

The invention aims to solve the defects in the background technology by providing an artificial intelligence question-answering method based on a semantic training algorithm.

The technical scheme adopted by the invention is as follows:

the artificial intelligence question-answering method based on the semantic training algorithm comprises the following steps:

s1.1: collecting corpus data;

s1.2: carrying out data preprocessing on the collected corpus data;

s1.3: constructing a semantic training model based on the collected corpus data;

s1.4: and carrying out model optimization training on the constructed semantic training model.

As a preferred technical scheme of the invention: in the step S1.2, error detection correction processing and desensitization processing are carried out on the collected corpus data.

As a preferred technical scheme of the invention: in the step S1.3, the input corpus data is defined, and the definition input is that

Where m represents the total turn of the dialog, i.e. [1, m]Represents the ith round of dialog, u _i Representing the speech replied to the ith round, defining +.>

n _i Is the length of the utterance returned by the ith round,/-for the ith round>

J's character representing the speech replied to in the ith pass of speech, j's [1, n ] _i ]。

As a preferred technical scheme of the invention: in the S1.3, the input of the semantic training model comprises token coding, segment coding and position coding, and the token coding, the segment coding and the position coding are carried outAdding as input, wherein token is encoded as

The corresponding character embedding table is E ^t Representing the vocabulary size, the fragment coding is +.>

The corresponding fragment embedding table is E ^s S represents the maximum number of fragments, the position coding is +.>

The corresponding position embedding table is +.>

N represents the sequence length of the entire dialog, i.e. +.>

Where m represents the total turn of the dialog, i.e. [1, m]Represents the ith round, n, of the dialog _i Is the length of the utterance recovered by the ith round,

the total input of the semantic training model is obtained as follows:

wherein ,e_ij The method comprises the steps of inputting a semantic training model, wherein each position corresponds to an embedded vector; i epsilon [1, m]，j∈[1,n _i ]The method comprises the steps of carrying out a first treatment on the surface of the Extracting to obtain the following components:

E _ij ＝transformer(e _ij )

wherein ,E_ij Representing the output vector for each character of the sequence.

As a preferred technical scheme of the invention: and identifying through a nonlinear classifier by the embedded vector corresponding to each position, and judging whether the embedded vectors at other positions have masks and actual texts corresponding to the masks through mask training.

As a preferred technical scheme of the invention: the mask training is as follows:

for the recovered utterance

Predicting characters by a nonlinear character classifier, the formula is as follows:

wherein ,

representing +.>

Predicted value of E ^tT Transpose of the character-embedded table, b ₁ Is a bias parameter of the nonlinear classifier.

As a preferred technical scheme of the invention: in the mask training process, a loss function L of mask training ₁ (θ,θ ₁ ) Expressed as:

wherein θ is encoder parameters of the semantic training model, θ ₁ For the non-line character classifier parameter, the number of characters of the input sequence mask is M, m=m' represents the number of mask characters, and V represents the vocabulary size.

As a preferred technical scheme of the invention: for the speech u replied to the ith round _i Sequential logic training is performed by first embedding vector E of the ith segment _i1 Predicting the turn of the segment in the dialog

And compared with the actual run:

wherein ,

for the prediction round of the ith fragment, W ₂ Embedding a unit vector of a table for a fragment, b ₂ Is a bias parameter of the nonlinear classifier.

As a preferred technical scheme of the invention: in the sequential logic training, a loss function L of the sequential logic training ₂ (θ,θ ₂ ) Expressed as:

wherein θ is encoder parameters of the semantic training model, θ ₂ Is a parameter of a nonlinear round classifier,

for the prediction round of the ith segment, e=e' represents the number of segments of the dialog and S represents the maximum number of segments.

As a preferred technical scheme of the invention: in the step S1.4, a total loss function L of the semantic training model is obtained:

L＝L ₁ (θ,θ ₁ )+L ₂ (θ,θ ₂ )

training the semantic training model with the minimum loss function as a target, wherein θ is the encoder parameter of the semantic training model, and θ is ₁ Is a non-linear character, θ ₂ Parameter classifier parameters, L, which are nonlinear round classifiers ₁ (θ,θ ₁ ) Loss function trained for mask, L ₂ (θ,θ ₂ ) A penalty function trained for sequential logic.

Compared with the prior art, the artificial intelligence question-answering method based on the semantic training algorithm has the beneficial effects that:

according to the invention, the acquired corpus data is subjected to desensitization processing, so that the leakage of sensitive data is prevented, the safety of artificial intelligence question and answer is ensured, mask training and sequential logic training are respectively carried out by constructing a semantic training model, the learning capacity of the semantic training model is improved, the expansibility of an artificial intelligence question and answer system is improved, the accuracy and the recovery rate of the artificial intelligence question and answer are improved, the question can be recovered more efficiently, and the method can be applied to various intelligent fields.

Drawings

FIG. 1 is a flow chart of a method of a preferred embodiment of the present invention.

Detailed Description

It should be noted that, under the condition of no conflict, the embodiments of the present embodiments and features in the embodiments may be combined with each other, and the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and obviously, the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a preferred embodiment of the present invention provides an artificial intelligence question-answering method based on a semantic training algorithm, comprising the steps of:

s1.1: collecting corpus data;

s1.2: carrying out data preprocessing on the collected corpus data;

In the step S1.2, error detection correction processing and desensitization processing are carried out on the collected corpus data.

In the step S1.3, the input corpus data is defined, and the definition input is that

n _i Is the length of the reply utterance of the ith round,/-for the ith round>

In the S1.3, the input of the semantic training model comprises token coding, segment coding and position coding, and the addition of the token coding, the segment coding and the position coding is used as the input, wherein the token coding is as follows

The corresponding character embedding table is E ^t Representing the fragment encoded +.>

The corresponding position embedding table is +.>

N represents the sequence length of the entire dialog, i.e. +.>

the total input of the semantic training model is obtained as follows:

wherein ,e_ij Training for semanticsTraining the input of a model, and embedding vectors corresponding to each position; i epsilon [1, m]，j∈[1,n _i ]The method comprises the steps of carrying out a first treatment on the surface of the The method comprises the steps of carrying out a first treatment on the surface of the Extracting to obtain the following components:

E _ij ＝transformer(e _ij )

And identifying through a nonlinear classifier by the embedded vector corresponding to each position, and judging whether the embedded vectors at other positions have masks and actual texts corresponding to the masks through mask training.

The mask training is as follows:

for the recovered utterance

wherein ,

representing +.>

In the mask training process, a loss function L of mask training ₁ (θ,θ ₁ ) Expressed as:

For the speech u replied to the ith round _i Sequential logic training is performed by first embedding vector E of the ith segment _i1 Predicting the turn of the segment in the dialog

And compared with the actual run:

wherein ,

In the sequential logic training, a loss function L of the sequential logic training ₂ (θ,θ ₂ ) Expressed as:

In the step S1.4, a total loss function L of the semantic training model is obtained:

L＝L ₁ (θ,θ ₁ )+L ₂ (θ,θ ₂ )

In this embodiment, corpus data is collected, error detection and correction are performed based on the collected corpus data, and desensitization processing is performed, so that digits, such as an identification card number, a collection number, a bank card number, an address house number and the like, and sensitive information, such as names and addresses and the like, appearing in the corpus data can be replaced randomly in the desensitization processing to prevent information leakage.

Defining the input corpus data, wherein the definition input is that

The input of the semantic training model comprises token coding, segment coding and position coding, and the token coding is added with the segment coding and the position coding as input, wherein the token coding is as follows

The corresponding character embedding table is E ^t V represents the vocabulary size, fragment encoded as +.>

Corresponding position is embeddedGo into the table to be +.>

N represents the sequence length of the entire dialog, i.e. +.>

the total input of the semantic training model is obtained as follows:

E _ij ＝transformer(e _ij )

And identifying through a nonlinear classifier by using the embedded vector corresponding to each position, and judging whether masks and actual texts corresponding to the masks exist in the embedded vectors at other positions through mask training.

For the recovered utterance

If the 2 nd and 3 rd characters are selected as masks, the 2 nd and 3 rd characters are masked to obtain: />

wherein ,

representing +.>

Predicted value of E ^tT Transpose of the character-embedded table, b ₁ Is a bias parameter of the nonlinear classifier. />

Through mask learning on the characters, the learning capacity of the semantic training model is improved, and the answer efficiency of the semantic training model is improved.

The interactive information has sequential logic questions, if the answer question is likely to be a reply of a sentence in the previous interactive information, the speech u replied by the ith round is needed _i Sequential logic training is performed by first embedding vector E of the ith segment _i1 Predicting the turn of the segment in the dialog

And compared with the actual run:

wherein ,

By sequential logic training of the reply utterances, the accuracy and the recovery rate of the reply are improved.

According to two training processes, loss functions are obtained respectively:

wherein θ is encoder parameters of the semantic training model, θ ₁ Is a non-linear wordAnd (3) a character classifier parameter, wherein the number of characters of an input sequence mask is M, m=m' represents the number of mask characters, and V represents the vocabulary size.

Obtaining a total loss function of the semantic training model:

L＝L ₁ (θ,θ ₁ )+L ₂ (θ,θ ₂ )

training the semantic training model with the minimum loss function as a target, wherein θ is the encoder parameter of the semantic training model, and θ is ₁ Is a non-linear character, θ ₂ Parameter classifier parameters, L, for a nonlinear round classifier ₁ (θ,θ ₁ ) Loss function trained for mask, L ₂ (θ,θ ₂ ) A penalty function trained for sequential logic. So as to improve the accuracy of the model and reduce the error of the model.

The model can be further trained by expanding the collected corpus data so as to improve the question-answer effect of the semantic training model.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. The artificial intelligence question-answering method based on the semantic training algorithm is characterized by comprising the following steps of: the method comprises the following steps:

s1.1: collecting corpus data;

s1.2: carrying out data preprocessing on the collected corpus data;

2. The artificial intelligence question-answering method based on semantic training algorithm according to claim 1, wherein: in the step S1.2, error detection correction processing and desensitization processing are carried out on the collected corpus data.

3. The artificial intelligence question-answering method based on semantic training algorithm according to claim 1, wherein: in the step S1.3, the input corpus data is defined, and the definition input is that

n _i Is the length of the reply utterance of the ith round,/-for the ith round>

4. The artificial intelligence question-answering method based on semantic training algorithm according to claim 3, wherein: in the S1.3, the input of the semantic training model comprises token coding, segment coding and position coding, and the addition of the token coding, the segment coding and the position coding is used as the input, wherein the token coding is as follows

The corresponding character embedding table is E ^t Representing the word segment encoded as +.>

The corresponding position embedding table is +.>

N represents the sequence length of the entire dialog, i.e. +.>

the total input of the semantic training model is obtained as follows:

wherein ,e_ij For inputting semantic training model, each position corresponds to an embedded vector；i∈[1,m]，j∈[1,n _i ]The method comprises the steps of carrying out a first treatment on the surface of the Extracting to obtain the following components:

E _ij ＝transformer(e _ij )

5. The artificial intelligence question-answering method based on semantic training algorithm according to claim 4, wherein: and identifying through a nonlinear classifier by the embedded vector corresponding to each position, and judging whether the embedded vectors at other positions have masks and actual texts corresponding to the masks through mask training.

6. The artificial intelligence question-answering method based on semantic training algorithm according to claim 5, wherein: the mask training is as follows:

for the recovered utterance

wherein ,

representing +.>

7. The artificial intelligence question-answering method based on semantic training algorithm according to claim 6, wherein: in the mask training process, a loss function L of mask training ₁ (θ,θ ₁ ) Expressed as:

8. The artificial intelligence question-answering method based on semantic training algorithm according to claim 7, wherein: for the speech u replied to the ith round _i Sequential logic training is performed by first embedding vector E of the ith segment _i1 Predicting the turn of the segment in the dialog

wherein ,

9. The artificial intelligence question-answering method based on semantic training algorithm according to claim 8, wherein: in the sequential logic training, a loss function L of the sequential logic training ₂ (θ,θ ₂ ) Expressed as:

wherein θ is semantic trainingEncoder parameters, θ of model ₂ Is a parameter of a nonlinear round classifier,

10. The artificial intelligence question-answering method based on semantic training algorithm according to claim 1, wherein: in the step S1.4, a total loss function L of the semantic training model is obtained:

L＝L ₁ (θ,θ ₁ )+L ₂ (θ,θ ₂ )