CN113672708A - Language model training method, question and answer pair generation method, device and equipment - Google Patents

Language model training method, question and answer pair generation method, device and equipment Download PDF

Info

Publication number
CN113672708A
CN113672708A CN202010400998.XA CN202010400998A CN113672708A CN 113672708 A CN113672708 A CN 113672708A CN 202010400998 A CN202010400998 A CN 202010400998A CN 113672708 A CN113672708 A CN 113672708A
Authority
CN
China
Prior art keywords
question
answer
training
text
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010400998.XA
Other languages
Chinese (zh)
Inventor
张高升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan TCL Group Industrial Research Institute Co Ltd
Original Assignee
Wuhan TCL Group Industrial Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan TCL Group Industrial Research Institute Co Ltd filed Critical Wuhan TCL Group Industrial Research Institute Co Ltd
Priority to CN202010400998.XA priority Critical patent/CN113672708A/en
Publication of CN113672708A publication Critical patent/CN113672708A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The application is applicable to the technical field of information processing, and provides a language model training method, a question and answer pair generation device and equipment. The language model training method comprises the steps of obtaining a training text; generating a plurality of training text sequences according to the training texts; each training text sequence comprises the training text, answer texts contained in the training text and question texts corresponding to the answer texts; training the language model based on the supervised learning algorithm by taking the training text and the answer text in each training text sequence as features and taking the problem text in each training text sequence as a label to obtain the trained target language model. And when the target language model is obtained based on the language model training method, the method is suitable for obtaining an application scene of the question corresponding to the answer according to the answer prediction.

Description

Language model training method, question and answer pair generation method, device and equipment
Technical Field
The application belongs to the technical field of information processing, and particularly relates to a language model training method, a question and answer pair generation device and language model training equipment.
Background
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence, and it enables interaction between a person and a computer through Natural Language. Language model training is one of the core methods of natural language processing, and is widely applied to various NLP tasks such as an intelligent question and answer task, a classroom auxiliary teaching task or an after-sales service task.
The language model training in the current NLP task generally encodes words in context in a training text, and then identifies the starting position and ending problem of a target word from the training text based on semantic information of the context to obtain the target word.
It can be known that the target words in the above language model training method are all derived from the training text, so the language model training method is suitable for the application scenario of predicting answers (included in the training text) according to questions, but not suitable for the application scenario of reasoning questions (not included in the training text) according to real answers.
Disclosure of Invention
In view of this, embodiments of the present application provide a method, an apparatus, and a device for language model training to solve the problem that a language model obtained based on training in the prior art cannot satisfy a requirement for predicting a problem corresponding to an answer according to an answer.
In a first aspect, an embodiment of the present application provides a language model training method, including:
acquiring a training text;
generating a plurality of training text sequences according to the training texts; each training text sequence comprises a training text, an answer text contained in the training text and a question text corresponding to the answer text;
training the language model based on the supervised learning algorithm by taking the training text and the answer text in each training text sequence as features and taking the problem text in each training text sequence as a label to obtain the trained target language model.
In a possible implementation manner of the first aspect, training a language model based on a supervised learning algorithm with a training text and an answer text in each training text sequence as features and a question text in each training text sequence as a label to obtain a trained target language model, includes:
aiming at each training text sequence, coding the training text sequence according to a preset dictionary to generate a multi-dimensional vector corresponding to the training text sequence; the multidimensional vector comprises a first vector corresponding to a training text, a second vector corresponding to an answer text and a third vector corresponding to a question text;
splicing the first vector of the multi-dimensional vector and the second vector of the multi-dimensional vector to generate a fourth vector of the multi-dimensional vector;
taking the fourth vector of the multi-dimensional vector as a feature, taking the third vector of the multi-dimensional vector as a label, and generating a training vector corresponding to the multi-dimensional vector;
and taking a plurality of training vectors as input, and training the language model based on a supervised learning algorithm to obtain a trained target language model.
In a possible implementation manner of the first aspect, training a language model based on a supervised learning algorithm with a plurality of training vectors as inputs to obtain a trained target language model includes:
performing mask replacement on the label of each training vector to generate a masked training vector;
inputting the training vector after the mask into a language model for processing to obtain the probability distribution of the predicted value of the label of the training vector on a preset dictionary;
determining the value of a cross entropy function according to the label and probability distribution of the training vector;
when the value of the cross entropy function does not meet the preset condition, updating the parameters of the language model, returning to execute the step of performing mask replacement on the label of the training vector aiming at each training vector and generating the masked training vector until the value of the cross entropy function reaches the preset condition;
and saving the model parameters of the current language model, and generating a trained target language model.
In a second aspect, an embodiment of the present application provides a question-answer pair generating method, including:
acquiring text information to be analyzed;
generating a candidate answer set according to the text information; the candidate answer set comprises at least one candidate answer;
inputting the text information and the candidate answer set into a target language model for processing, obtaining the question corresponding to each candidate answer in the candidate answer set, and generating a question set according to the question corresponding to each candidate answer;
according to the text information and the question set, obtaining a predicted answer corresponding to each question in the question set respectively;
and aiming at each question in the question set, performing semantic similarity comparison on the candidate answer corresponding to the question and the predicted answer corresponding to the question, and generating a question-answer pair according to the comparison result.
In a possible implementation manner of the second aspect, inputting the text information and the candidate answer set into a target language model for processing, and obtaining a question corresponding to each candidate answer in the candidate answer set, respectively, includes:
combining the text information and the candidate answers to generate a text sequence to be predicted aiming at each candidate answer in the candidate answer set;
coding a text sequence to be predicted to generate a corresponding array vector to be predicted;
inputting the array vector to be predicted into a target language model, and determining a predicted value corresponding to the array vector to be predicted;
if the predicted value is not matched with the preset termination value, splicing the array vector to be predicted and the predicted value to generate a spliced array vector; taking the spliced array vector as an array vector to be predicted, returning to execute the step of inputting the array vector to be predicted into the target language model and determining a predicted value corresponding to the array vector to be predicted until the current predicted value is matched with a preset termination value;
splicing the current array vector to be predicted and the current predicted value to generate a target array vector;
and decoding the target array vector to generate a question corresponding to the candidate answer.
In a possible implementation manner of the second aspect, for each question in the question set, performing semantic similarity comparison between a candidate answer corresponding to the question and a predicted answer corresponding to the question, and generating a question-answer pair according to a comparison result, the method includes:
calculating semantic similarity of candidate answers corresponding to the questions and predicted answers corresponding to the questions aiming at each question in the question set;
deleting the question and the predicted answer corresponding to the question under the condition that the semantic similarity is smaller than a preset threshold value;
and combining the question and the candidate answer corresponding to the question to generate a question-answer pair under the condition that the semantic similarity is greater than or equal to a preset threshold value.
In one possible implementation manner of the second aspect, generating the candidate answer set according to the text information includes:
dividing text information into a plurality of natural sentences;
extracting entity words from the natural sentences aiming at each natural sentence in the plurality of natural sentences to generate entity word vectors of the natural sentences;
and generating a hierarchical candidate answer set according to the entity word vectors of each natural sentence in the plurality of natural sentences.
In a possible implementation manner of the second aspect, the method for generating question-answer pairs further includes:
searching whether a candidate answer exists in a plurality of question-answer pairs or not aiming at each candidate answer in the hierarchical candidate answer set;
if the candidate answers do not exist in the multiple question-answer pairs, deleting the candidate answers from the hierarchical candidate answer set;
if the candidate answers exist in the plurality of question-answer pairs, adding the questions corresponding to the candidate answers to the corresponding positions of the candidate answers in the hierarchical candidate answer set to generate a hierarchical question-answer pair set.
In a third aspect, an embodiment of the present application provides a language model training apparatus, including:
the first acquisition module is used for acquiring a training text;
the first generation module is used for generating a plurality of training text sequences according to the training texts; each training text sequence comprises a training text, an answer text contained in the training text and a question text corresponding to the answer text;
and the training module is used for training the language model based on a supervised learning algorithm by taking the training text and the answer text in each training text sequence as features and taking the answer text in each training text sequence as a label to obtain the trained target language model.
In a fourth aspect, an embodiment of the present application provides a question-answer pair generating apparatus, including:
the second acquisition module is used for acquiring the text information to be analyzed;
the comparison module is used for generating a candidate answer set according to the text information; the candidate answer set comprises at least one candidate answer;
the processing module is used for inputting the text information and the candidate answer set into the target language model for processing to obtain a question corresponding to each candidate answer in the candidate answer set; generating a question set according to the question corresponding to each candidate answer;
the prediction module is used for obtaining a prediction answer corresponding to each question in the question set according to the text information and the question set;
and the comparison module is used for comparing the semantic similarity of the candidate answers of the questions with the predicted answers of the questions aiming at each question in the question set and generating a target question-answer pair according to the comparison result.
In a fifth aspect, an embodiment of the present application provides a language model training device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the steps of any one of the methods in the first aspect.
In a sixth aspect, an embodiment of the present application provides a question-answer pair generating device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of any one of the methods in the second aspect when executing the computer program.
In a seventh aspect, this application provides a computer-readable storage medium, where a computer program is stored, and when being executed by a processor, the computer program implements the steps of any one of the methods in the first aspect.
In an eighth aspect, the present application provides a computer-readable storage medium, where a computer program is stored, and when executed by a processor, the computer program implements the steps of any one of the methods in the second aspect.
In a ninth aspect, embodiments of the present application provide a computer program product, which, when run on a terminal device, causes the terminal device to execute the method of any one of the first aspect.
In a tenth aspect, embodiments of the present application provide a computer program product, which, when run on a terminal device, causes the terminal device to execute the method of any one of the second aspects.
The language model training method provided by the embodiment of the application is used for training a language model based on a supervised learning algorithm, wherein the input of a supervision task is text content and answers contained in the text content, and the questions corresponding to the answers are output. When the target language model obtained based on the training method is used, the question of the answer can be generated only by inputting the text content and the answer contained in the text content, and the output question can not be contained in the text content, so that the method is suitable for obtaining the application scene of the question corresponding to the answer according to the answer prediction.
On the other hand, the training text can be determined according to the application field of the target language model, so that the prediction accuracy of the trained target language model is higher.
It is understood that the beneficial effects of the third, fifth, seventh and ninth aspects can be referred to the related description of the first aspect, and are not described herein again.
According to the question-answer pair generation method provided by the embodiment of the application, the candidate answer set is directly obtained based on the text information to be analyzed, each candidate answer of the candidate answer set is directly obtained based on the text information instead of being generated by prediction, so that each candidate answer is clear in semantics and has no grammar error.
Inputting the candidate answer set and text information into a target language model after obtaining the candidate answer set, obtaining a question corresponding to each candidate answer in the candidate answer set, and generating a question set; then predicting to obtain a prediction answer corresponding to each question in the question set; and finally, aiming at each question in the question set, performing semantic similarity comparison on the candidate answer corresponding to the question and the predicted answer, and screening the predicted answer according to the comparison result, so that the semantic similarity between the reserved predicted answer and the corresponding candidate answer is high, namely the reserved predicted answer is clear in semantic and has no grammar error.
Because the predicted answer is obtained by the problem prediction, when the predicted answer is clear in semantics and has no grammar error, the semantics of the problem corresponding to the predicted answer can be determined to be clear and has no grammar error; therefore, the question and answer generated according to the question are clear in meaning and have no grammar error.
According to the question-answer pair generation method provided by the embodiment of the application, the corresponding candidate answers and the predicted answers are compared in semantic similarity, so that the predicted answers are screened, the questions obtained through reasoning are further screened, the problems that the semantics in the question-answer pair or the answers are unclear/grammatical errors are greatly reduced, and the quality of the question-answer pair is improved.
It can be understood that, the beneficial effects of the fourth, sixth, eighth, and tenth aspects can be referred to the related description in the second aspect, and are not described herein again.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic flow chart illustrating a method for training a language model according to an embodiment of the present application;
FIG. 2 is a schematic flow chart diagram for generating a target language model according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a question-answer pair generating method according to an embodiment of the present application;
fig. 4 is a schematic flowchart of a question corresponding to a candidate answer according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating a method for generating a hierarchical candidate answer set according to an embodiment of the present application;
fig. 6 is a schematic flowchart of generating a hierarchical question-answer pair set according to an embodiment of the present application;
FIG. 7 is a schematic structural diagram of a language model training apparatus according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of a question-answer pair generator according to an embodiment of the present application;
fig. 9 is a schematic hardware composition diagram of a language model training device according to an embodiment of the present application.
Fig. 10 is a schematic diagram of a hardware configuration of a question-answer pair generating device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. It is worth mentioning that the specific embodiments listed below may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.
Fig. 1 is a schematic flow chart of a language model training method according to an embodiment of the present application, where an execution subject of the embodiment is a language model training device; the language model training device comprises but not limited to mobile terminals such as smart phones, tablet computers, wearable devices and the like, and can also be desktop computers, robots, servers and the like. The language model training method as shown in fig. 1 may include:
and S11, acquiring the training text.
In this embodiment, the training text may refer to a large-scale text sequence library that is scientifically sampled and processed. Specifically, the training text may be obtained according to the application field of the language model training.
For example, if the language model training device is applied to machine teaching, it is used to automatically generate questions and answers for reading and understanding question types. The relevant information of the fields such as professional terms, keywords and course problems of the object-oriented course can be collected in advance through course data arrangement and expert consultation, and the training text is obtained on the basis of the information as a retrieval basis so as to improve the pertinence of the training text.
The training text may include a plurality of article texts, and each article text may include one or more sentences. For example, the training text may be all articles in a junior high school chinese text.
It should be appreciated that in one embodiment, a plurality of article texts may be obtained, each article text serving as a training text.
And S12, generating a plurality of training text sequences according to the training texts.
In this embodiment, each training text sequence includes a training text, an answer text included in the training text, and a question text corresponding to the answer text.
The answer text and the question text corresponding to the answer are obtained by analyzing the training text in advance.
In one embodiment, the training text may include a large amount of article text, each article text including one or more sentences.
Before the model is trained, for each article text, extracting a plurality of questions based on the article text, and generating a plurality of question texts, wherein the real answers corresponding to all the question texts can be searched and obtained in the article text. And then, each question and the corresponding answer of each question form a determined question-answer pair.
Illustratively, a plurality of certain question-answer pairs in the text of a known article can be obtained by means of crawling by a crawler, for example, crawling a hectic question-answer.
Optionally, the plurality of questions extracted based on the article text may not be included in the article text, for example, a general question in the field is set for an application field of the language training device, and it is only necessary to ensure that a real answer to the general question can be obtained in the article text.
Illustratively, if the language training device applies classroom assistance teaching, the general question may be "what is the subject of the text? "who is the host's father of the text? "and the like.
After obtaining a plurality of determined question-answer pairs in the article text, aiming at each question-answer pair, combining the article text, the answer text in the question-answer pair and the question text of the question-answer pair to generate a training text sequence. For example, the article text, the answer text, and the question text may be combined in this order, and an identifier may be added between the article text, the answer text, and the question text.
Alternatively, the article text-answer text-question text may be combined in the form < bos > article text < answer _ text > answer text < query _ text > question text < eos > as follows to generate the training text sequence Z.
Wherein < bos >, < answer _ text >, < request _ text >, and < eos > are all preset identification symbols, where < bos > represents a sentence start, < eos > represents a sentence end, < answer _ text > represents an answer text, and < query _ text > represents a question text.
Illustratively, the text of the article is "backshadow" of ju self-cleaning, and a plurality of question-answer pairs extracted based on "backshadow" may be the question 1 "what do i get home with father? ", answer 1" is grandmother's bereau "; question 2 "why do father decide to send me by oneself? ", answer 2" worry about not sticking properly in the tea house "; question 3 "what is the father to help me buy? ", answer 3" orange ".
And then combining the article text-answer text-question text to generate a training text sequence aiming at each question-answer pair. Wherein, the three training text sequences are obtained as follows:
training text sequence Z1:<bos>"Back shadow"<answer_text>For grandmother running on funeral<question_text>What I do with father going home<eos>;
Training text sequence Z2:<bos>"Back shadow"<answer_text>Worry about improper tea room<question_text>Why father decides to send me<eos>;
Training text sequence Z3:<bos>"Back shadow"<answer_text>Orange<question_text>Father is to help me buy what<eos>;
It should be understood that the "background" in the training text sequence specifically refers to the text content in the article text "background".
In another embodiment, the training text may be a complete article text, and each training text sequence includes the training text, the answer text, and the question text corresponding to the answer text.
And S13, training the language model based on the supervised learning algorithm by taking the training text and the answer text in each training text sequence as features and taking the question text in each training text sequence as a label to obtain the trained target language model.
In this embodiment, the language model is trained based on a supervised learning algorithm. Among other things, training samples in supervised learning algorithms need to include inputs and outputs, i.e., features and labels.
In this embodiment, the training text and the answer text in each training text sequence are used as features, and the question text in each training text sequence is used as a label.
For example, the process of determining the labels and features of each training text sequence may include: :
step A1: aiming at each training text sequence, coding the training text sequence according to a preset dictionary to generate a multi-dimensional vector corresponding to the training text sequence; the multidimensional vector comprises a first vector corresponding to the training text, a second vector corresponding to the answer text and a third vector corresponding to the question text.
In this step, the predetermined dictionary is used to encode the training text sequence. Optionally, the preset dictionary may contain all words in a standard modern chinese corpus, domain keywords for object-oriented courses, and specialized terms; the predetermined dictionary may further include respective numerical values of all the words. It should be understood that the numerical value corresponding to each word in the predetermined dictionary is generally different.
Alternatively, the update frequency of the preset dictionary may be set, and the domain keywords and the professional terms of the object-oriented course may be periodically added to the preset dictionary.
It should be understood that the preset dictionary further includes preset identifiers, such as the respective corresponding values of < bos >, < answer _ text >, < query _ text >, and < eos >.
In this step, the training text sequence is encoded according to the preset dictionary, which may mean that for each training text sequence, each word in the training text sequence is mapped to a corresponding numerical value in the preset dictionary, and a multidimensional vector corresponding to the training text sequence is obtained.
Each multi-dimensional vector may include a first vector corresponding to the training text, a second vector corresponding to the answer text, and a third vector corresponding to the question text.
It can be understood that the multi-dimensional vector obtained by encoding represents all semantic information of the training text sequence, wherein the second vector represents all semantic information of the answer text in the training text sequence, and the third vector represents all semantic information of the question text in the training text sequence. When the semantic information is represented by the vectors, when the two vectors are close to each other in space, the corresponding text semantic contents are similar.
Step A2: and splicing the first vector of the multi-dimensional vector and the second vector of the multi-dimensional vector to generate a fourth vector of the multi-dimensional vector.
Step A3: and generating a training vector corresponding to the multi-dimensional vector by taking the fourth vector of the multi-dimensional vector as a characteristic and the third vector of the multi-dimensional vector as a label.
In this embodiment, after obtaining the plurality of training vectors, the plurality of training vectors are used as input, and the language model is trained based on the supervised learning algorithm to obtain a trained target language model.
For example, a plurality of training vectors may be used as an input language model, a value of a loss function is calculated, and parameters of the language model are adjusted according to the value of the loss function; and when the value of the loss function reaches a preset condition, for example, is lower than a first preset threshold value, the model iteration is finished, and the parameters of the current language model are stored to obtain the trained target language model. The first preset threshold is a threshold preset by a user.
Wherein the value of the loss function is determined according to a difference between a predicted value obtained by language model prediction and a third vector value in the training array.
The language model training method provided by the embodiment of the application is used for training a language model based on a supervised learning algorithm, wherein the input of a supervision task is text content and answers contained in the text content, and the questions corresponding to the answers are output. When the target language model obtained based on the training method is used, the question of the answer can be generated only by inputting the text content and the answer contained in the text content, and the output question can not be contained in the text content, so that the method is suitable for obtaining the application scene of the question corresponding to the answer according to the answer prediction.
On the other hand, the training text can be determined according to the application field of the target language model, so that the prediction accuracy of the trained target language model is higher.
Fig. 2 is a schematic flowchart of generating a target language model according to an embodiment of the present application. One possible implementation of generating the target language model after obtaining the plurality of training vectors in step 13 in the embodiment of fig. 1 is described. As shown in fig. 2, training a language model based on a supervised learning algorithm with a plurality of training vectors as input to obtain a trained target language model, includes:
s131, carrying out mask replacement on the label of each training vector to generate a masked training vector.
The composition structure of each training vector is the same, the training vectors comprise a first vector corresponding to a training text, a second vector corresponding to an answer text and a third vector corresponding to a question text, and the combination mode is the first vector + the second vector + the third vector.
And the label of the training vector is a third vector corresponding to the problem text.
In some embodiments, the training text sequence Z is in the form: if < bos > article text < answer _ text > answer text < query _ text > question text < eos > is combined, the initial value of the third vector of the training vector obtained by coding the training text sequence is the numerical value x corresponding to the < query _ text >, and the final value is the numerical value y corresponding to the < eos >. Therefore, mask replacement is performed on the label of the training vector, specifically, mask replacement may be performed on the numerical value between x and Y in the training array to obtain the training vector Y after the mask.
S132, inputting the training vector after the mask into the language model for processing, and obtaining the probability distribution of the predicted value of the label of the training vector on a preset dictionary.
The language model is typically constructed as a probability distribution over a sequence of words, which, for a given length m of the sequence, yields a probability P (w _1, w _2, …, w _ m) for the entire sequence, characterizing the probability of each word in the sequence occurring.
The preset dictionary in this embodiment is the same as the preset dictionary in the embodiment of fig. 1.
For example, the training array after masking may be Y, and then Y is used as an input, and the probability distribution of the next word on the preset dictionary, that is, the probability that the next word is each word on the preset dictionary, is predicted through forward propagation of the language model.
And S133, determining the value of the cross entropy function according to the label and the probability distribution of the training vector.
In this embodiment, the loss function of the language model may be a cross entropy function, and a value of the cross entropy function is a value of the loss function.
In this embodiment, the first array vector corresponding to the article text and the second array vector corresponding to the answer text in each training vector are used as features (input), the array vector corresponding to the question text in each training vector is used as a label (output), and it is known that the true probability distribution is determined because the input data and the output are already determined.
The value of the cross entropy function may be determined according to a difference between the true probability distribution and the predicted probability distribution to update the parameters of the language model according to the value of the cross entropy function.
It should be understood that the smaller the value of the cross-entropy function, the better the predicted result is represented.
And S134, when the value of the cross entropy function does not meet the preset condition, updating the parameters of the language model, returning to execute the steps of performing mask replacement on the label of the training vector aiming at each training vector and generating the masked training vector until the value of the cross entropy function reaches the preset condition.
The preset condition may be that the value of the cross entropy function is less than or equal to a second preset threshold, or that the difference between the values of the cross entropy functions of two adjacent times is less than a third preset threshold.
In this embodiment, parameters of the language model may be updated according to the Adam optimizer.
And S135, saving the model parameters of the current language model, and generating the trained target language model.
And when the cross entropy function value reaches a preset condition, saving the model parameters of the current language model to generate the target language model.
The target language model can be widely applied to intelligent question-answering tasks, classroom auxiliary teaching tasks or after-sales service tasks. For example, in a classroom teaching assistance task, question-answer pairs of target reading texts can be automatically generated based on the target language model, so that a machine can become a teacher and interact with students in a way of asking questions, or a question manuscript for reading understanding can be automatically generated.
Then, if the question in the target language model is obtained by inference, the situation that semantic ambiguity or grammar error inevitably exists in a question-answer pair generated based on the question cannot well support the classroom auxiliary teaching task.
Based on this, the embodiments of the present application further provide a question-answer pair generating method to solve the above technical problem, and the following exemplary descriptions are provided in specific embodiments.
Fig. 3 is a schematic flow chart of a question-answer pair generating method according to an embodiment of the present application, where an executing subject of the present embodiment is a question-answer pair generating device; the question-answer pair generating device comprises but is not limited to a mobile terminal such as a smart phone, a tablet computer, a wearable device and the like, and can also be a desktop computer, a robot, a server and the like. The question-answer pair generating method shown in fig. 3 may include:
and S21, acquiring the text information to be analyzed.
In this embodiment, the text information may be text content input by a user, or text content extracted from any multimedia form including text content, such as a document, a web page, a text picture, and the like, or text content obtained by recognizing voice input by the user, which is not limited herein. For example, the text information may be an article for reading and understanding teaching and learning.
The text content includes a plurality of text sequences, wherein the text sequences are character strings formed by more than one character in sequence, and each text sequence can include one or more sentences.
Illustratively, acquiring text information to be analyzed may refer to acquiring webpage information from a network through an information analysis script; and preprocessing the webpage information such as removing navigation bars and advertisement noise data to obtain initial text information, and then extracting a text sequence from the initial text information to obtain text information to be analyzed.
And S22, generating a candidate answer set according to the text information, wherein the candidate answer set comprises at least one candidate answer.
In natural language, questions are presented in 5W2H, Where 5W refers to "What", "Where", "When", "Who", and "Why", 2H refers to "How" and "How much", and answers to any one or more of the questions in 5W2H may constitute candidate answers.
In this embodiment, semantic texts in which the text information includes one or more description information may form candidate answers; the description information may be event description information, state description information, or entity description information, and is not limited herein.
In one embodiment, the text information may be divided into a plurality of natural sentences, and each natural sentence may be used as a candidate answer.
The sentences or paragraphs of the text information are generally demarcated by delimiters, and the text information can be divided into a plurality of natural sentences by identifying the delimiters in the text information.
For example, the text information may be an article Pa for reading, understanding, teaching and learning, and the text is divided into a plurality of natural sentences S, so as to obtain:
Pa={S1,S2,S3…..}
where Pa is text information, Si represents the ith natural sentence in the text information, and i is a positive integer.
In another embodiment, domain information such as professional terms and course questions of the object-oriented course can be collected in advance through course data arrangement and expert consultation, a search keyword is determined according to the domain information, and a search is performed in text information according to the search keyword to generate a candidate answer set.
It should be understood that a candidate answer set of the text information may be generated by performing a plurality of searches to obtain a plurality of candidate answer sets of the search key and combining the candidate answer sets of the search key.
Each candidate answer in the candidate answer set obtained in the step is directly obtained based on text information instead of prediction generation, so that each candidate answer is clear in semantics and has no problem of grammar error.
S23, inputting the text information and the candidate answer set into a target language model for processing to obtain questions corresponding to each candidate answer in the candidate answer set; and generating a question set according to the question corresponding to each candidate answer.
The target language model in this embodiment is the trained target language model obtained in the embodiment of fig. 1 or fig. 2. The input of the target language model is text information and candidate answers, the output is a question, and the question and the candidate answers have correlation.
In this embodiment, for each candidate answer in the candidate answer set, text information and the candidate answer may be combined to generate a text sequence to be predicted, and then the text sequence to be predicted is encoded according to a preset dictionary to generate a corresponding array vector to be predicted; and inputting the array vector to be predicted into a pre-trained language model to obtain a question corresponding to the candidate answer.
Wherein the preset dictionary is used for encoding and decoding the text sequence. The preset dictionary in this embodiment is the same as the preset dictionary in fig. 1. Alternatively, the update frequency of the preset dictionary may be set, and the domain keywords and the professional terms of the object-oriented course may be periodically added to the preset dictionary.
After the question corresponding to each candidate answer is obtained, all the questions are combined to generate a question set.
And S24, obtaining the predicted answer corresponding to each question in the question set according to the text information and the question set.
The purpose of this step is to obtain answers corresponding to each question in the question set according to the text information and the question set, and the answers may be predicted based on language models such as BERT, XLNET, GPT transformer-xl in natural language processing, for example, and are not limited specifically herein.
In some embodiments, the predicted answer corresponding to each question may be obtained based on a BERT language model. Wherein the BERT language model is a pre-trained deep bidirectional Transformer language model.
For example, the above process of obtaining the predicted answer corresponding to each question based on the BERT language model may include:
step B1, converting each question in the text information and question set into a word vector sequence that the BERT language model can recognize. For example, word embedding processing may be performed on text information to obtain a first word vector sequence of the text information, and word embedding processing may be performed on each question in the question set to obtain a second word vector sequence corresponding to each question.
And B2, splicing the first word vector sequence with each second word vector sequence to obtain a third vector sequence corresponding to the second word vector sequences one by one.
And step B3, inputting the third vector sequence into a BERT language model for each third vector sequence to obtain a predicted answer corresponding to the third vector sequence.
And S25, aiming at each question in the question set, performing semantic similarity comparison on the candidate answer corresponding to the question and the predicted answer corresponding to the question, and generating a question-answer pair according to the comparison result.
The purpose of this step is to screen the predicted answer according to the semantic similarity between the corresponding candidate answer and the predicted answer.
Because each candidate answer in the candidate answer set is directly obtained based on text information instead of being generated by prediction, each candidate answer is clear in semanteme and has no problem of grammar error. If the semantic similarity between the candidate answer corresponding to the question and the predicted answer corresponding to the question is high, the semantic of the predicted answer is accurate and clear, and the grammar is inaccurate.
The semantic similarity comparison between the candidate answer corresponding to the question and the predicted answer corresponding to the question may be calculated to obtain the semantic similarity between the candidate answer corresponding to the question and the predicted answer corresponding to the question.
For example, the candidate answer predicted answers may be respectively converted into word vectors, and semantic similarity between the candidate answers and the predicted answers may be characterized according to distances of the word vectors in space.
Wherein, generating the question-answer pair according to the comparison result may refer to: deleting the question and the predicted answer corresponding to the question under the condition that the semantic similarity is smaller than a preset threshold; and combining the question and the candidate answer corresponding to the question to generate a question-answer pair under the condition that the semantic similarity is greater than or equal to a first preset threshold value.
The preset threshold may be a fourth preset threshold, and the fourth preset threshold is a threshold preset by the user.
When the semantic similarity is smaller than a fourth preset threshold, it is indicated that the predicted answer has unclear semantics or wrong grammar, and since the predicted answer is obtained by predicting a question, and when the predicted answer has unclear semantics or wrong grammar, the question corresponding to the predicted answer may have the above conditions, the question and the predicted answer corresponding to the question are cancelled according to the quality of the question and answer pair generated by the question.
When the semantic similarity is greater than or equal to a fourth preset threshold, the semantic similarity indicates that the predicted answer is clear in semantics and has no grammar error, and further the fact that the semantics of the question corresponding to the predicted answer are clear and has no grammar error can be determined. At the moment, the question and the candidate answer corresponding to the question are combined to generate a question-answer pair, so that the question-answer pair has high quality.
According to the question-answer pair generation method provided by the embodiment of the application, the corresponding candidate answers and the predicted answers are compared in semantic similarity, so that the predicted answers are screened, the questions obtained through reasoning are further screened, the problems of unclear semantics and wrong grammar in the questions or the answers in the question-answer pairs are greatly reduced, and the quality of the question-answer pairs is improved.
Fig. 4 is a flowchart illustrating a process of obtaining questions corresponding to candidate answers according to an embodiment of the present application, and describes one possible implementation manner of obtaining questions corresponding to each candidate answer in the candidate answer set in step 23 in the embodiment of fig. 3. As shown in fig. 4, inputting the text information and the candidate answer set into the target language model for processing, and obtaining a question corresponding to each candidate answer in the candidate answer set, respectively, includes:
and S231, aiming at each candidate answer in the candidate answer set, combining the text information and the candidate answer to generate a text sequence to be predicted.
In this embodiment, the article information and the candidate answer may be combined according to the following form < bos > text information < answer _ text > candidate answer < eos >, so as to generate a text sequence a to be predicted.
The < bos >, < answer _ text >, < eos > are all preset tag symbols, each tag symbol can be regarded as a word, and a numerical value corresponding to each tag symbol can be obtained by searching in a preset dictionary.
And S232, coding the text sequence to be predicted to generate a corresponding array vector to be predicted.
In this embodiment, the text sequence to be predicted may be encoded according to a preset dictionary;
the preset dictionary is the same as the preset dictionary in the embodiment of fig. 3, and includes all words in the training text and numerical values corresponding to each word. Wherein, the numerical values corresponding to each word are different from each other. It should be understood that the predetermined dictionary contains all words in the textual information.
The predetermined dictionary further includes predetermined tag symbols, such as values corresponding to < bos >, < answer _ text >, < query _ text >, and < eos >.
In this embodiment, encoding the text sequence to be predicted according to the preset dictionary may specifically refer to mapping each word in the text sequence to be predicted a to a corresponding numerical value in the preset dictionary, and obtaining an array vector B to be predicted corresponding to the text sequence to be predicted.
Each array to be predicted may include a fourth array vector corresponding to the text information and a fifth array vector corresponding to the candidate answer.
And S233, inputting the array vector to be predicted into the target language model, and determining a predicted value corresponding to the array vector to be predicted.
In this embodiment, the array vector to be predicted is input into the target language model, and may be a probability distribution of a next word in a preset dictionary obtained by using the text sequence a to be predicted as input through a forward propagation process of the model, and a numerical value in the preset dictionary corresponding to the selected maximum probability value is used as the predicted value y.
S234, if the predicted value is not matched with the preset termination value, splicing the array vector to be predicted and the predicted value to generate a spliced array vector; and taking the spliced array vector as an array vector to be predicted, and returning to execute the step S233 until the current predicted value is matched with the preset terminal value.
In this embodiment, the preset termination value refers to a numerical value of the termination tag symbol < eos > in the text sequence to be predicted, which corresponds to the preset dictionary.
After obtaining the predicted value C, determining whether the predicted value y is matched with a preset termination value, where the matching may specifically mean that the predicted value C is the same as the preset termination value.
And if the predicted value C is the same as the preset terminal value, the final prediction result is obtained.
If the predicted value C is different from the preset termination value, splicing the C and the B to obtain a splicing result B1, taking the splicing result B1 as a new input value, and obtaining the next predicted value C1 through forward propagation of a prediction model; and then repeating the process until the current predicted value is matched with the preset termination value.
And S235, splicing the current array vector to be predicted and the current predicted value to generate a target array vector.
Wherein the current predicted value is a predicted value C obtained after N iterative cyclesNThen the current array vector to be predicted is B + C1+ C2+ … + CN-1The target array vector is B + C1+ C2+ … + CN-1+CN
Wherein, N is an integer greater than or equal to 2, and the expression of "+" splices the arrays.
And S236, decoding the target array vector according to a preset dictionary to generate a question corresponding to the candidate answer.
In this embodiment, decoding the target array vector according to the preset dictionary may specifically refer to mapping each numerical value in the target array vector to a corresponding word in the preset dictionary, and obtaining an answer corresponding to the target array vector.
In the embodiment, after each predicted value is generated, the predicted value is added to each new word, the word is added to the back of the text sequence to be predicted which is generated before, a new text sequence to be predicted is formed, the new text sequence with the predicted value can become the next new input of the language model, an autoregressive mechanism is formed, and the accuracy of prediction can be greatly improved.
In the field of machine reading understanding application or machine teaching, reading understanding question type questions and answers can be automatically generated through question-answer pair generating equipment and used for simulating interaction of teachers and students to assist teaching. Questions in tutoring generally include both concrete questions, which can generally be answered in sentences, and abstract questions, which generally require refinement to concrete elements in a sentence, such as characters, time, actions, etc. In actual teaching, a teacher generally first presents an abstract question and then presents a concrete question for the answer to the abstract question. Therefore, in order to meet the question and answer habit in reading and understanding teaching, a hierarchical candidate answer set can be set, and a hierarchical question and answer pair set can be obtained according to the hierarchical candidate answer set. An exemplary description is given below by way of the embodiments of fig. 5 and 6, respectively.
Fig. 5 is a flowchart illustrating a process for generating a candidate answer set according to an embodiment of the present application, and describes one possible implementation manner of generating the candidate answer set according to the text message in step 22 in fig. 3. As shown in fig. 5, generating the candidate answer set according to the text information includes:
and S221, dividing the text information into a plurality of natural sentences.
In natural language processing, boundaries between sentences or paragraphs of text information are generally formed by delimiters, and text information can be divided into a plurality of natural sentences by identifying delimiters in the text information.
S222, for each natural sentence in the plurality of natural sentences, extracting an entity word from the natural sentence, and generating an entity word vector of the natural sentence.
The entity words include nouns, pronouns, and noun phrases.
The entity words in each natural sentence can be identified based on a part-of-speech tagging method in natural language processing, and then the entity words obtained by identification are combined in sequence to generate an entity word vector.
Illustratively, if the natural sentence S is "Xiaoming is doing mathematical operation for three hours at home today", the natural sentence S is subjected to part-of-speech tagging and entity word extraction to obtain an entity word vector of { Xiaoming, today, three hours, mathematical operation }.
The part-of-speech tagging in natural language processing refers to a process of determining a grammar category of each word in a given sentence, determining the part-of-speech of each word, and tagging the part-of-speech. The common methods are as follows: rule-based methods, statistical model-based methods, deep learning-based methods. And are not limited herein.
In this embodiment, verbs in the natural sentence may also be extracted, and the extracted entity words and verbs are combined according to the order in the natural sentence to generate a word vector.
And S223, generating a hierarchical candidate answer set according to the entity word vectors of each of the plurality of natural sentences and the plurality of natural sentences.
The candidate answer set comprises two layers of answer sets, wherein the first layer of answer set is a natural sentence obtained by dividing the text information, and the second layer of answer set is an entity word vector of each natural sentence.
Fig. 6 is a schematic flowchart of a process of generating a hierarchical question-answer pair set according to an embodiment of the present application, and describes one possible implementation manner of generating a hierarchical question-answer pair set after obtaining a plurality of question-answer pairs. As shown in fig. 6, the question-answer pair generating method further includes:
and S261, searching whether the candidate answer exists in a plurality of question-answer pairs or not aiming at each candidate answer in the hierarchical candidate answer set.
As can be seen from the embodiment shown in fig. 3, for each question in the question set, the semantic similarity comparison is performed between the candidate answer corresponding to the question and the predicted answer corresponding to the question, and only if the semantic similarity is greater than or equal to the first preset threshold, the question and the candidate answer corresponding to the question are generated into a question-answer pair.
Therefore, the number of candidate answers of the plurality of question-answer pairs is less than the number of candidate answers in the candidate answer set.
Whether a target question corresponding to the candidate answer exists in the multiple question-answer pairs or not can be searched, and matching can be performed according to the number of the candidate answer in the question-answer pairs and the number of the candidate answer in the hierarchical candidate answer set.
And S262, if the candidate answers do not exist in the multiple question-answer pairs, deleting the candidate answers from the hierarchical candidate answer set.
After the candidate answer and the number of the candidate answer are deleted, the numbers of other candidate answers in the hierarchical candidate answer set are kept unchanged.
For example, for the candidate answer S2, if the number 2 is not included in the candidate answer numbers in the question-answer pairs, the candidate answer in the hierarchical candidate answer set is deleted S2.
And S263, if the candidate answers exist in the plurality of question-answer pairs, adding the questions corresponding to the candidate answers to the corresponding positions of the candidate answers in the hierarchical candidate answer set to generate a hierarchical question-answer pair set. .
In this embodiment, each candidate answer may be sequentially determined according to the number of the candidate answer in the hierarchical candidate answer set, and a deletion or addition action may be performed according to the determination result until the last candidate answer in the hierarchical candidate answer set, so as to finally obtain a hierarchical question-answer pair set.
In the embodiment of the application, a hierarchical question-answer pair corresponding to a hierarchical candidate answer set is generated by constructing the hierarchical candidate answer set, and each question-answer pair in the hierarchical question-answer pair set has no clear semantic or grammar error. The hierarchical question-answer pairs accord with the question-answer habits in reading, understanding and teaching, can assist manual teaching, automatically generate the hierarchical initial question draft of reading and understanding, and interact with users.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Based on the question-answer pair generation method provided by the above embodiment, the embodiment of the present invention further provides an embodiment of an apparatus for implementing the above embodiment of the method.
Fig. 7 is a schematic structural diagram of a language model training device according to an embodiment of the present application. The included units are used for executing steps in the embodiments corresponding to fig. 1 and fig. 2, and refer to the related descriptions in the embodiments corresponding to fig. 1 and fig. 2. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 7, the language model training device 30 includes a first obtaining module 301, a first generating module 302, and a processing module 303.
The first obtaining module 301 is configured to obtain a training text.
A first generating module 302, configured to generate a plurality of training text sequences according to a training text; each training text sequence comprises a training text, an answer text contained in the training text and a question text corresponding to the answer text.
And the processing module 303 is configured to take the training text and the answer text in each training text sequence as features, take the problem text in each training text sequence as a label, and train the language model based on a supervised learning algorithm to obtain a trained target language model.
Optionally, the processing module 303 takes the training text and the answer text in each training text sequence as features, takes the question text in each training text sequence as a label, and trains the language model based on a supervised learning algorithm to obtain a trained target language model, which may include:
aiming at each training text sequence, coding the training text sequence according to a preset dictionary to generate a multi-dimensional vector corresponding to the training text sequence; the multidimensional vector comprises a first vector corresponding to a training text, a second vector corresponding to an answer text and a third vector corresponding to a question text;
splicing the first vector of the multi-dimensional vector and the second vector of the multi-dimensional vector to generate a fourth vector of the multi-dimensional vector;
taking the fourth vector of the multi-dimensional vector as a feature, taking the third vector of the multi-dimensional vector as a label, and generating a training vector corresponding to the multi-dimensional vector;
and taking a plurality of training vectors as input, and training the language model based on a supervised learning algorithm to obtain a trained target language model.
Optionally, the processing module 303 takes a plurality of training vectors as input, and trains the language model based on a supervised learning algorithm to obtain a trained target language model, which may include:
performing mask replacement on the label of each training vector to generate a masked training vector;
inputting the training vector after the mask into a language model for processing to obtain the probability distribution of the predicted value of the label of the training vector on a preset dictionary;
determining the value of a cross entropy function according to the label and probability distribution of the training vector;
when the value of the cross entropy function does not meet the preset condition, updating the parameters of the language model, returning to execute the step of performing mask replacement on the label of the training vector aiming at each training vector and generating the masked training vector until the value of the cross entropy function reaches the preset condition;
and saving the model parameters of the current language model, and generating a trained target language model.
The language model training device provided by the embodiment of the application trains the language model based on a supervised learning algorithm, the input of the supervision task is the text content and the answer contained in the text content, and the question corresponding to the answer is output. When the target language model obtained based on the training method is used, the question of the answer can be generated only by inputting the text content and the answer contained in the text content, and the output question can not be contained in the text content, so that the method is suitable for obtaining the application scene of the question corresponding to the answer according to the answer prediction.
On the other hand, the training text can be determined according to the application field of the target language model, so that the prediction accuracy of the trained target language model is higher.
Fig. 8 is a schematic structural diagram of a question-answer pair generating device according to an embodiment of the present application. The units included in the embodiments are used for executing the steps in the embodiments corresponding to fig. 3 to fig. 6, and refer to the related descriptions in the embodiments corresponding to fig. 3 to fig. 6. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 8, the question-answer pair generating apparatus 40 includes a second acquiring module 401, a second generating module 402, a processing module 403, a predicting module 404, and a comparing module 405.
The second obtaining module 401 is configured to obtain text information to be parsed.
A second generating module 402, configured to generate a candidate answer set according to the text information; the candidate answer set includes at least one candidate answer.
A processing module 403, configured to input the text information and the candidate answer set into a target language model for processing, so as to obtain a question corresponding to each candidate answer in the candidate answer set; generating a question set according to the question corresponding to each candidate answer;
and the predicting module 404 is configured to obtain a predicted answer corresponding to each question in the question set according to the text information and the question set.
The comparing module 405 is configured to, for each question in the question set, perform semantic similarity comparison between the candidate answer corresponding to the question and the predicted answer corresponding to the question, and generate a question-answer pair according to a comparison result.
Optionally, the inputting, by the processing module 403, the text information and the candidate answer set into the target language model for processing, to obtain a question corresponding to each candidate answer in the candidate answer set, where the obtaining includes:
combining the text information and the candidate answers to generate a text sequence to be predicted aiming at each candidate answer in the candidate answer set;
coding a text sequence to be predicted according to a preset dictionary to generate a corresponding array vector to be predicted;
inputting the array vector to be predicted into a target language model, and determining a predicted value corresponding to the array vector to be predicted;
if the predicted value is not matched with the preset termination value, splicing the array vector to be predicted and the predicted value to generate a spliced array vector; taking the spliced array vector as an array vector to be predicted, returning to execute the step of inputting the array vector to be predicted into the target language model and determining a predicted value corresponding to the array vector to be predicted until the current predicted value is matched with a preset termination value;
splicing the current array vector to be predicted and the current predicted value to generate a target array vector;
and decoding the target array vector according to a preset dictionary to generate a question corresponding to the candidate answer.
Optionally, for each question in the question set, the comparing module 405 compares the candidate answer corresponding to the question with the predicted answer corresponding to the question, and generates a question-answer pair according to the comparison result, which may include:
calculating semantic similarity between the candidate answer corresponding to the question and the predicted answer corresponding to the question;
deleting the question and the predicted answer corresponding to the question under the condition that the semantic similarity is smaller than a preset threshold value;
and combining the question and the predicted answer corresponding to the question to generate a question-answer pair under the condition that the semantic similarity is greater than or equal to a preset threshold value.
Optionally, the generating the candidate answer set by the second generating module 402 according to the text information may include:
dividing text information into a plurality of natural sentences;
extracting entity words from the natural sentences aiming at each natural sentence in the plurality of natural sentences to generate entity word vectors of the natural sentences;
and generating a hierarchical candidate answer set according to the entity word vectors of each natural sentence in the plurality of natural sentences.
The question-answer pair generating device further comprises a third generating module, and the third generating module is used for:
searching whether a candidate answer exists in a plurality of question-answer pairs or not aiming at each candidate answer in the hierarchical candidate answer set;
if the candidate answers do not exist in the multiple question-answer pairs, deleting the candidate answers from the hierarchical candidate answer set;
if the candidate answers exist in the plurality of question-answer pairs, adding the questions corresponding to the candidate answers to the corresponding positions of the candidate answers in the hierarchical candidate answer set to generate a hierarchical question-answer pair set.
The question-answer pair generation device provided by the embodiment of the application realizes the screening of the predicted answers by comparing the semantic similarity of the corresponding candidate answers and the predicted answers, further realizes the screening of the questions obtained by reasoning, greatly reduces the problems of unclear semantics and wrong grammar in the questions or the answers in the question-answer pair, and improves the quality of the question-answer pair.
Fig. 9 is a schematic diagram of a language model training device according to an embodiment of the present application. As shown in fig. 9, the language model training apparatus 50 of this embodiment includes: at least one first processor 501, a first memory 502 and a computer program stored in said first memory 502 and executable on said first processor 501. The language model training device further comprises a first communication means 503, wherein the first processor 501, the first memory 502 and the first communication means 503 are connected by a first bus 504.
The first processor 501, when executing the computer program, implements the steps in the above-described respective question-answer pair generation method embodiments, such as steps S11 to S13 in the embodiment shown in fig. 1. Alternatively, the first processor 501, when executing the computer program, implements the functions of the modules/units in the above-described device embodiments, such as the functions of the modules 301 to 303 shown in fig. 7.
Those skilled in the art will appreciate that FIG. 9 is merely an example of a language model training device and is not intended to be limiting and may include more or fewer components than shown, or some components in combination, or different components such as input output devices, network access devices, buses, etc.
Fig. 10 is a schematic diagram of a question-answer pair generating device provided in an embodiment of the present application. As shown in fig. 10, the question-answer pair generation apparatus 60 of this embodiment includes: at least one second processor 601, a second memory 602, and computer programs stored in said second memory 602 and executable on said second processor 601. The challenge-pair generating device further comprises a second communication means 603, wherein the second processor 601, the second memory 602 and the second communication means 603 are connected by a second bus 604.
The second processor 601, when executing the computer program, implements the steps in the above-described respective embodiments of the question-answer pair generation method, such as steps S21 to S25 in the embodiment shown in fig. 3. Alternatively, the second processor 601, when executing the computer program, implements the functions of the modules/units in the above-described device embodiments, for example, the functions of the modules 401 to 405 shown in fig. 8.
Those skilled in the art will appreciate that fig. 10 is merely an example of a question and answer pair generating device and does not constitute a limitation of a question and answer pair generating device and may include more or fewer components than shown, or combine certain components, or different components, such as input output devices, network access devices, buses, etc.
The first Processor or the second Processor referred to in the embodiments of fig. 9 and fig. 10 may be a Central Processing Unit (CPU), other general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The embodiments of the present application also provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.
The embodiments of the present application provide a computer program product, which when running on a mobile terminal, enables the mobile terminal to implement the steps in the above method embodiments when executed.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (11)

1. A method for training a language model, comprising:
acquiring a training text;
generating a plurality of training text sequences according to the training texts; wherein each training text sequence comprises the training text, an answer text contained in the training text and a question text corresponding to the answer text;
training a language model based on a supervised learning algorithm by taking the training text and the answer text in each training text sequence as features and taking the question text in each training text sequence as a label to obtain a trained target language model.
2. The method for training a language model according to claim 1, wherein the training text and the answer text in each training text sequence are used as features, the question text in each training text sequence is used as a label, and the language model is trained based on a supervised learning algorithm to obtain the trained target language model, comprising:
aiming at each training text sequence, coding the training text sequence according to a preset dictionary to generate a multi-dimensional vector corresponding to the training text sequence; the multidimensional vector comprises a first vector corresponding to a training text, a second vector corresponding to an answer text and a third vector corresponding to a question text;
splicing the first vector of the multi-dimensional vector and the second vector of the multi-dimensional vector to generate a fourth vector of the multi-dimensional vector;
taking a fourth vector of the multi-dimensional vector as a feature, taking a third vector of the multi-dimensional vector as a label, and generating a training vector corresponding to the multi-dimensional vector;
and taking a plurality of training vectors as input, and training a language model based on a supervised learning algorithm to obtain a trained target language model.
3. The method for training a language model according to claim 2, wherein the training a language model based on a supervised learning algorithm using a plurality of the training vectors as input to obtain a trained target language model comprises:
performing mask replacement on the label of each training vector to generate a masked training vector;
inputting the training vector after the mask to the language model for processing to obtain the probability distribution of the predicted value of the label of the training vector on the preset dictionary;
determining the value of a cross entropy function according to the label of the training vector and the probability distribution;
when the value of the cross entropy function does not meet a preset condition, updating the parameters of the language model, returning to execute the step of performing mask replacement on the label of the training vector aiming at each training vector and generating a masked training vector until the value of the cross entropy function reaches the preset condition;
and saving the model parameters of the current language model, and generating a trained target language model.
4. A method for generating a question-answer pair, comprising:
acquiring text information to be analyzed;
generating a candidate answer set according to the text information; the candidate answer set comprises at least one candidate answer;
inputting the text information and the candidate answer set into a target language model for processing to obtain questions corresponding to each candidate answer in the candidate answer set respectively, and generating a question set according to the questions corresponding to each candidate answer;
according to the text information and the question set, obtaining a predicted answer corresponding to each question in the question set respectively;
and aiming at each question in the question set, performing semantic similarity comparison on the candidate answer corresponding to the question and the predicted answer corresponding to the question, and generating a question-answer pair according to the comparison result.
5. The question-answer pair generating method according to claim 4, wherein the inputting the text information and the candidate answer set into a target language model for processing to obtain a question corresponding to each candidate answer in the candidate answer set respectively comprises:
for each candidate answer in the candidate answer set, combining the text information and the candidate answer to generate a text sequence to be predicted;
coding the text sequence to be predicted to generate a corresponding array vector to be predicted;
inputting the array vector to be predicted into a target language model, and determining a predicted value corresponding to the array vector to be predicted;
if the predicted value is not matched with a preset termination value, splicing the array vector to be predicted and the predicted value to generate a spliced array vector; taking the spliced array vector as an array vector to be predicted, returning to execute the step of inputting the array vector to be predicted into a target language model and determining a predicted value corresponding to the array vector to be predicted until the current predicted value is matched with the preset termination value;
splicing the current array vector to be predicted and the current predicted value to generate a target array vector;
and decoding the target array vector to generate a question corresponding to the candidate answer.
6. The question-answer pair generating method according to any one of claims 4 or 5, wherein for each question in the question set, comparing semantic similarity between a candidate answer corresponding to the question and a predicted answer corresponding to the question, and generating a question-answer pair according to a result of the comparison, comprises:
calculating and obtaining semantic similarity between a candidate answer corresponding to the question and a predicted answer corresponding to the question for each question in the question set;
deleting the question and the predicted answer corresponding to the question under the condition that the semantic similarity is smaller than a preset threshold value;
and combining the question and the candidate answer corresponding to the question to generate a question-answer pair under the condition that the semantic similarity is greater than or equal to a preset threshold value.
7. The question-answer pair generating method according to any one of claims 4 or 5, wherein the generating of the candidate answer set based on the text information comprises:
dividing the text information into a plurality of natural sentences;
extracting entity words from the natural sentences to generate entity word vectors of the natural sentences aiming at each natural sentence in the natural sentences;
and generating a hierarchical candidate answer set according to the plurality of natural sentences and the entity word vector of each natural sentence in the plurality of natural sentences.
8. The question-answer pair generating method of claim 7, wherein the method further comprises:
for each candidate answer in the hierarchical candidate answer set, searching whether the candidate answer exists in the question-answer pairs;
if the candidate answer does not exist in the question-answer pairs, deleting the candidate answer from the hierarchical candidate answer set;
if the candidate answers exist in the multiple question-answer pairs, adding the questions corresponding to the candidate answers to the corresponding positions of the candidate answers in the hierarchical candidate answer set to generate a hierarchical question-answer pair set.
9. A language model training apparatus, comprising: comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 3 are implemented when the computer program is executed by the processor.
10. A question-answer pair generating device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that said processor implements the steps of the method according to any one of claims 4 to 8 when executing said computer program.
11. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 3 or carries out the steps of the method according to any one of claims 4 to 8.
CN202010400998.XA 2020-05-13 2020-05-13 Language model training method, question and answer pair generation method, device and equipment Pending CN113672708A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010400998.XA CN113672708A (en) 2020-05-13 2020-05-13 Language model training method, question and answer pair generation method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010400998.XA CN113672708A (en) 2020-05-13 2020-05-13 Language model training method, question and answer pair generation method, device and equipment

Publications (1)

Publication Number Publication Date
CN113672708A true CN113672708A (en) 2021-11-19

Family

ID=78536765

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010400998.XA Pending CN113672708A (en) 2020-05-13 2020-05-13 Language model training method, question and answer pair generation method, device and equipment

Country Status (1)

Country Link
CN (1) CN113672708A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114416936A (en) * 2021-12-27 2022-04-29 北京百度网讯科技有限公司 Answer selection method, answer selection model training method and related equipment
CN114996424A (en) * 2022-06-01 2022-09-02 吴艳 Weak supervision cross-domain question-answer pair generation method based on deep learning
CN115080722A (en) * 2022-08-19 2022-09-20 科大讯飞股份有限公司 Question generation method, question generation device, and storage medium
CN115905500A (en) * 2023-02-07 2023-04-04 北京面壁智能科技有限责任公司 Question-answer pair data generation method and device
CN116523031A (en) * 2023-07-05 2023-08-01 深圳须弥云图空间科技有限公司 Training method of language generation model, language generation method and electronic equipment
CN116842155A (en) * 2023-06-30 2023-10-03 北京百度网讯科技有限公司 Text generation method, training method and device of text generation model
CN117271751A (en) * 2023-11-16 2023-12-22 北京百悟科技有限公司 Interaction method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657041A (en) * 2018-12-04 2019-04-19 南京理工大学 The problem of based on deep learning automatic generation method
CN109726274A (en) * 2018-12-29 2019-05-07 北京百度网讯科技有限公司 Problem generation method, device and storage medium
CN110263143A (en) * 2019-06-27 2019-09-20 苏州大学 Improve the neurologic problems generation method of correlation
US20190340172A1 (en) * 2018-05-03 2019-11-07 Thomson Reuters Global Resources Unlimited Company Systems and methods for generating a contextually and conversationally correct response to a query
CN110532369A (en) * 2019-09-04 2019-12-03 腾讯科技(深圳)有限公司 A kind of generation method of question and answer pair, device and server
CN110852110A (en) * 2018-07-25 2020-02-28 富士通株式会社 Target sentence extraction method, question generation method, and information processing apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190340172A1 (en) * 2018-05-03 2019-11-07 Thomson Reuters Global Resources Unlimited Company Systems and methods for generating a contextually and conversationally correct response to a query
CN110852110A (en) * 2018-07-25 2020-02-28 富士通株式会社 Target sentence extraction method, question generation method, and information processing apparatus
CN109657041A (en) * 2018-12-04 2019-04-19 南京理工大学 The problem of based on deep learning automatic generation method
CN109726274A (en) * 2018-12-29 2019-05-07 北京百度网讯科技有限公司 Problem generation method, device and storage medium
CN110263143A (en) * 2019-06-27 2019-09-20 苏州大学 Improve the neurologic problems generation method of correlation
CN110532369A (en) * 2019-09-04 2019-12-03 腾讯科技(深圳)有限公司 A kind of generation method of question and answer pair, device and server

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ALEC RADFORD ET AL: "Improving Language Understanding by Generative Pre-Training", HTTPS://WWW.MIKECAPTAIN.COM/RESOURCES/PDF/GPT-1.PDF, 31 December 2018 (2018-12-31) *
YING-HONG CHAN AND YAO-CHUNG FAN: "A Recurrent BERT-based Model for Question Generation", IN PROCEEDINGS OF THE 2ND WORKSHOP ON MACHINE READING FOR QUESTION ANSWERING, pages 154 - 162 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114416936A (en) * 2021-12-27 2022-04-29 北京百度网讯科技有限公司 Answer selection method, answer selection model training method and related equipment
CN114416936B (en) * 2021-12-27 2023-05-26 北京百度网讯科技有限公司 Answer selection method, training method of answer selection model and related equipment
CN114996424A (en) * 2022-06-01 2022-09-02 吴艳 Weak supervision cross-domain question-answer pair generation method based on deep learning
CN114996424B (en) * 2022-06-01 2023-05-09 吴艳 Weak supervision cross-domain question-answer pair generation method based on deep learning
CN115080722A (en) * 2022-08-19 2022-09-20 科大讯飞股份有限公司 Question generation method, question generation device, and storage medium
CN115080722B (en) * 2022-08-19 2023-02-17 科大讯飞股份有限公司 Question generation method, question generation device, and storage medium
CN115905500A (en) * 2023-02-07 2023-04-04 北京面壁智能科技有限责任公司 Question-answer pair data generation method and device
CN116842155A (en) * 2023-06-30 2023-10-03 北京百度网讯科技有限公司 Text generation method, training method and device of text generation model
CN116523031A (en) * 2023-07-05 2023-08-01 深圳须弥云图空间科技有限公司 Training method of language generation model, language generation method and electronic equipment
CN117271751A (en) * 2023-11-16 2023-12-22 北京百悟科技有限公司 Interaction method, device, equipment and storage medium
CN117271751B (en) * 2023-11-16 2024-02-13 北京百悟科技有限公司 Interaction method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110795543B (en) Unstructured data extraction method, device and storage medium based on deep learning
CN113672708A (en) Language model training method, question and answer pair generation method, device and equipment
CN110162749B (en) Information extraction method, information extraction device, computer equipment and computer readable storage medium
CN108829822B (en) Media content recommendation method and device, storage medium and electronic device
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
CN110727779A (en) Question-answering method and system based on multi-model fusion
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN114757176B (en) Method for acquiring target intention recognition model and intention recognition method
CN114580382A (en) Text error correction method and device
CN113342958B (en) Question-answer matching method, text matching model training method and related equipment
CN116820429B (en) Training method and device of code processing model, electronic equipment and storage medium
CN111930914A (en) Question generation method and device, electronic equipment and computer-readable storage medium
CN110597968A (en) Reply selection method and device
CN113392265A (en) Multimedia processing method, device and equipment
CN111353026A (en) Intelligent law attorney assistant customer service system
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
CN113128431B (en) Video clip retrieval method, device, medium and electronic equipment
CN113704434A (en) Knowledge base question and answer method, electronic equipment and readable storage medium
CN113705207A (en) Grammar error recognition method and device
CN117009456A (en) Medical query text processing method, device, equipment, medium and electronic product
CN113468311B (en) Knowledge graph-based complex question and answer method, device and storage medium
CN115221306A (en) Automatic response evaluation method and device
CN115221284A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN114638239A (en) Machine translation method and system based on knowledge base
CN115062123A (en) Knowledge base question-answer pair generation method of conversation generation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination