WO2021217935A1 - Method for training question generation model, question generation method, and related device - Google Patents

Method for training question generation model, question generation method, and related device Download PDF

Info

Publication number
WO2021217935A1
WO2021217935A1 PCT/CN2020/105777 CN2020105777W WO2021217935A1 WO 2021217935 A1 WO2021217935 A1 WO 2021217935A1 CN 2020105777 W CN2020105777 W CN 2020105777W WO 2021217935 A1 WO2021217935 A1 WO 2021217935A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
text
question
training
answer
Prior art date
Application number
PCT/CN2020/105777
Other languages
French (fr)
Chinese (zh)
Inventor
曹辰捷
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021217935A1 publication Critical patent/WO2021217935A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a training method of a question generation model, a question generation method and related equipment.
  • problem generation involves machine learning and natural language processing in the field of artificial intelligence, as well as smart life in the field of smart cities.
  • Question generation research how to generate natural language-based questions is an important issue in the field of natural language processing.
  • Question generation has a wide range of applications.
  • machine knowledge bases can use active questioning to build or supplement knowledge bases and expand data sets; in the field of education, question generation can help students ask questions; in the field of dialogue, question generation can be started as a cold start A topic, or to get feedback by asking questions, is very rich in application scenarios.
  • the inventor realizes that the existing problem generation technology is usually based on known grammatical rules, using syntax trees to generate problems, and filling existing templates with entities in the knowledge base.
  • This technology has poor migration capabilities.
  • a large amount of prior expert knowledge is required for construction or migration; another technique is to use deep learning models to generate questions based on pre-labeled answers.
  • This technique requires manual labeling of large amounts of data in advance, which is time-consuming and labor-intensive, and most of the labeled text Shorter, affecting the generation of problems. It can be seen that the existing problem generation technology has poor problem generation performance.
  • an embodiment of the present application provides a method for training a problem generation model, which adopts the following technical solutions:
  • the network in the initial model is realized as a one-way model, a two-way model, and a sequence-to-sequence model;
  • the pre-training language model is adjusted according to the prediction error until the prediction error satisfies the training stop condition, and a problem generation model is obtained.
  • a problem generation method including:
  • the question generation model is a model obtained by using any one of the above-mentioned training methods of the question generation model
  • an embodiment of the present application also provides a training device for a problem generation model, including:
  • the model training module is used to pre-train the initial model to obtain the pre-trained language model, and adjust the mask matrix in the pre-training to realize the one-way model, the two-way model and the sequence-to-sequence model of the network in the initial model;
  • An information acquisition module for acquiring question and answer information from a web page through a web crawler, where the question and answer information includes question text and answer text;
  • An entity extraction module for extracting key entities related to the question text from the answer text
  • a model setting module configured to set the network in the pre-training language model to a sequence-to-sequence model to obtain a pre-training language model for Chinese text generation;
  • a text input module configured to input the key entity and the answer text into the pre-training language model to obtain the predicted question text output by the pre-training language model;
  • An error determination module configured to determine a prediction error according to the prediction question text and the question text
  • the model adjustment module is configured to adjust the pre-training language model according to the prediction error until the prediction error satisfies the training stop condition to obtain a problem generation model.
  • an embodiment of the present application also provides a computer device, including a memory and a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor executes
  • the computer-readable instructions implement the following steps:
  • the network in the initial model is realized as a one-way model, a two-way model, and a sequence-to-sequence model;
  • the pre-training language model is adjusted according to the prediction error until the prediction error satisfies the training stop condition, and a problem generation model is obtained.
  • an embodiment of the present application also provides a computer device, including a memory and a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor executes
  • the computer-readable instructions implement the following steps:
  • embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions implement the following steps when executed by a processor:
  • the network in the initial model is realized as a one-way model, a two-way model, and a sequence-to-sequence model;
  • the pre-training language model is adjusted according to the prediction error until the prediction error satisfies the training stop condition, and a problem generation model is obtained.
  • embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions implement the following steps when executed by a processor:
  • the embodiment of the training method of the problem generation model of the present application mainly has the following beneficial effects: the network in the initial model is realized by adjusting the mask matrix to realize three language models, so as to carry out a comprehensive prediction of the initial model. Training to obtain a pre-trained language model that can understand natural language and generate natural language; through web crawlers, a large amount of question and answer information can be obtained from web pages for model training.
  • the question and answer information includes question text and answer text, and the answer is automatically obtained Extract key entities related to the question text from the text, without relying on manual a large number of annotations, improve the efficiency of obtaining key entities, thereby improving the efficiency of model training; adjust the network in the pre-training language model to a sequence-to-sequence model, making The pre-training language model is oriented to text generative tasks and has good text generation capabilities; key entities and answer text are input into the pre-training language model to obtain the predicted question text, and the pre-trained language model is based on the error between the predicted question text and the real question text Adjustments are made to obtain a problem generation model.
  • the problem generation model is obtained by fine-tuning the pre-training language model according to downstream tasks, which ensures the quality of the generated problems, thereby improving the performance of the generated problems.
  • Figure 1 is an exemplary system architecture diagram to which the present application can be applied;
  • Fig. 2 is a flowchart of an embodiment of a training method for a question generation model according to the present application
  • FIG. 3 is a flowchart of a specific implementation of step 201 in FIG. 2;
  • FIG. 4 is a flowchart of a specific implementation of step 203 in FIG. 2;
  • FIG. 5 is a flowchart of a specific implementation of step 205 in FIG. 2;
  • Fig. 6 is a flowchart of an embodiment of the question generation method according to the present application.
  • FIG. 7 is a flowchart of a specific implementation of step 302 in FIG. 4;
  • Fig. 8 is a schematic structural diagram of an embodiment of a training device for a question generation model according to the present application.
  • Fig. 9 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
  • Various communication client applications can be installed on the terminal devices 101, 102, 103.
  • the terminal devices 101, 102, 103 may be various electronic devices with display screens and support for web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4) players, laptop portable computers and desktop computers, etc.
  • the server 105 may be a server that provides various services.
  • the training method of the question generation model provided by the embodiments of the present application is generally executed by the server, and accordingly, the processing device of the question generation model is generally set in the server.
  • terminal devices, networks, and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, and servers according to implementation needs.
  • the training method of the problem generation model includes the following steps:
  • Step 201 Pre-train the initial model to obtain a pre-trained language model, and adjust the mask matrix in the pre-training to implement a one-way model, a two-way model, and a sequence-to-sequence model for the network in the initial model.
  • the electronic device for example, the server shown in FIG. 1
  • the training method of the question generation model runs can communicate with the terminal through various wired connection methods or wireless connection methods.
  • the initial model may be a model that has not been pre-trained.
  • the mask matrix can be the mask matrix of the network in the initial model, which is used to control the context information used in training;
  • the one-way model is one-way LM,
  • the two-way model is two-way LM, and
  • the sequence-to-sequence model is seq2seq LM.
  • the server first obtains the pre-built initial model, and pre-trains the initial model.
  • the server sets the initial model to three different language models by adjusting the mask matrix of the network in the initial model, including one-way model, two-way model and sequence-to-sequence model, so as to enrich the pre-training
  • a pre-trained language model that can understand natural language and generate natural language is obtained.
  • Step 202 Obtain question and answer information from a web page through a web crawler, where the question and answer information includes question text and answer text.
  • the user can configure the web crawler at the terminal, and the terminal generates an information acquisition instruction according to the crawler configuration information input by the user, and sends the information acquisition instruction to the server.
  • the configured web crawler is used to crawl information from the World Wide Web.
  • the crawler configuration information may include the URL of the page, the storage address of the information, and so on.
  • the server After the server receives the information acquisition instruction, it extracts the crawler configuration information in the information acquisition instruction, and generates a web crawler according to the crawler configuration information.
  • the server runs the generated web crawler, the web crawler crawls the question and answer information from the web page, and the server saves the question and answer information crawled by the web crawler into the database.
  • the question and answer information may be composed of question text and answer text corresponding to the question text.
  • the web crawler may be a Scrapy-based web crawler.
  • Scrapy is a fast, high-level screen scraping and web scraping framework developed by python, used to scrape web sites and extract structured data from pages.
  • Scrapy-based web crawlers can crawl a large amount of question and answer information from public question and answer community websites such as Zhihu and Baidu Zhizhi, and store the crawled question and answer information in the form of a JSON file in the database of the server.
  • a question in the web page has at least one answer, and at least one sub-answer text is obtained after crawling the at least one answer; at least one sub-answer text corresponding to one question text constitutes an answer corresponding to the question text text.
  • the step of obtaining question and answer information from a web page through a web crawler specifically includes: receiving the target text; splitting the target text to obtain several sentences; generating the same number of web crawlers as the several sentences; embedding the several sentences respectively
  • Each web crawler Run each web crawler to get the question and answer information that each web crawler crawls from the web page according to the embedded sentence.
  • the target text may be text that instructs the web crawler to crawl the question and answer information.
  • the server receives the target text sent by the user through the terminal, and performs sentence-level disassembly of the target text according to punctuation to obtain several sentences.
  • the server generates the same number of web crawlers as the sentences obtained by the split, and embeds the sentences obtained by the split into the code layer of each web crawler.
  • the server runs the web crawler after the embedded statement, and the web crawler crawls the question and answer information related to the embedded statement from the web page through columnar crawling.
  • the target text after receiving the target text, the target text is split to obtain several sentences, and the several sentences are embedded in different web crawlers. After running the web crawler, question and answer information related to the embedded sentences can be crawled.
  • Step 203 Extract key entities related to the question text from the answer text.
  • the key entity can be an entity in the answer text, and the key entity is related to the question text.
  • the server performs word segmentation on the question text and the answer text respectively, and each obtains multiple entities.
  • the server recognizes the part-of-speech of the entity, and filters the entities with the preset part-of-speech, which can be verbs and nouns.
  • the server performs precise matching and fuzzy matching on the key entities selected from the question text and the answer text, and uses the matching entities in the answer text as the key entities.
  • the answer text includes at least one sub-answer text; the server respectively extracts key entities related to the question text from the sub-answer texts, and associates the sub-answer texts with the key entities extracted from the sub-answer texts.
  • the step before the step of extracting key entities from the question text and answer text in the question and answer information, the step further includes: matching the question and answer information through regular expressions to obtain the character string to be cleaned; deleting the matched character string to be cleaned to correct Q&A information is data cleaned.
  • the character string to be cleaned may be a meaningless character string in the question and answer message.
  • the server matches the question and answer information through a preset regular expression, so as to obtain the string to be cleaned in the question and answer information, and match the matched
  • the string to be cleaned is deleted to clean the question and answer information.
  • Regular expressions are pre-configured, and a regular expression can correspond to a meaningless string.
  • the Q&A information when the Q&A information is crawled from Zhihu, the Q&A information may include hyperlinks, dividing lines, and invalid characters in the Q&A information; "Source:", “Author:” in the column of Zhihu ......" and other content that has nothing to do with the main body of the text.
  • the question and answer information when the question and answer information is crawled from Baidu, the question and answer information may include a large number of meaningless characters.
  • the server can delete the above meaningless content through regular expressions.
  • the question and answer information is matched by a regular expression to obtain the character string to be cleaned, and the matched character string to be cleaned is deleted, so as to realize the data cleaning of the question and answer information and increase the proportion of effective content in the question and answer information.
  • Step 204 Set the network in the pre-training language model to a sequence-to-sequence model to obtain a pre-training language model for Chinese text generation.
  • the pre-trained language model (Unified pre-trained Language Model, UNILM) is a model that can process natural language understanding and natural language generation at the same time.
  • UNILM Unified pre-trained Language Model
  • the pre-training of the pre-training language model adopts three unsupervised language model goals: one-way model is one-way LM (including left to right and right to left), two-way model is two-way LM and sequence-to-sequence model is sequence- to-sequence LM (seq2seq LM), where LM is language model.
  • the pre-training language model uses a Transformer network with shared parameters, and also uses specific self-attention masks to control the context information used in prediction.
  • the above three LMs are realized by adjusting the mask matrix in the Transformer network.
  • the pre-training language model can be regarded as a one-way encoder, a two-way encoder or a sequence-to-sequence model.
  • the mask matrix in the Transformer network can be adjusted to adapt to different downstream tasks (naturally Language understanding and generative tasks).
  • Seq2seq is an Encoder-Deocder structure model with good text generation effect; the input of seq2seq is a sequence, and the output is also a sequence.
  • the Encoder turns a variable-length input sequence into a fixed-length vector, and the Decoder decodes the fixed-length vector into a variable-length output sequence.
  • the server obtains a pre-trained language model, and the pre-trained language model is used for Chinese processing, can be used for natural language understanding, and can also be used for text generation.
  • the pre-training language model needs to be fine-tuned to a model for problem generation, so it is necessary to set the mask matrix of the Transformer network in the pre-training language model, so as to realize the sequence-to-sequence model, that is, seq2seq LM.
  • the mask matrix of seq2seq LM the matrix elements on the left are all 0, which means that both the above information and the following information can be obtained; in the right matrix, the upper right matrix element is infinite, which means that only the upper Text information.
  • Step 205 Input the key entity and the answer text into the pre-training language model to obtain the predicted question text output by the pre-training language model.
  • the predictive question text may be a question text related to the answer text generated by the pre-training language model according to the key entity and the answer text.
  • the server fine-tunes the pre-training language model according to key entities, question text, and answer text.
  • the pre-training language model converts key entities and question texts into vectors, processes the vectors, and outputs prediction question texts.
  • the pre-training language model divides the key entities and the question text in units of words, converts each word into a vector according to the character conversion table, and processes the vector.
  • the character conversion table is created in advance, and the correspondence between words and vectors is determined.
  • the server converts the word, it queries the word in the character conversion table, and uses the vector corresponding to the inquired word as the vector after the word conversion.
  • Step 206 Determine the prediction error according to the prediction question text and the question text.
  • the above-mentioned prediction question text can also be stored in a node of a blockchain.
  • the question text in the question and answer information is the target output of the pre-trained language model.
  • the server obtains the prediction question text output by the pre-training language model and the question text in the Q&A information, and calculates the prediction error according to the preset error formula.
  • the calculation formula of the prediction error is:
  • y i is the identifier of the i-th character in the question text when it is converted into a vector according to the character conversion table; logits i is the score of the i-th character in the prediction question text in the character conversion table; softmaxLoss is the prediction problem The prediction error between the text and the question text.
  • each word can be regarded as a token, and each token has a unique identifier in the character conversion table, that is, the identifier token_id.
  • the identifier token_id For example, when the size of the character conversion table is 20000, that is, the character conversion table records the conversion relationship between 20000 words and vectors, the range of token_id is 0-19999.
  • the pre-training language model is to get the token_id sequence of the predicted problem text.
  • the question text contains N (N is a positive integer) words.
  • the pre-training language model encodes the answer text and key entities to obtain N Hs, where H is the word in the predicted question text to be generated.
  • the pre-training language model calculates the score logits of H at each word in the character conversion table. It can be understood that the score logits is equivalent to the similarity between H and each word in the character table, and the word with the highest score is selected as the corresponding H Character.
  • y i is the identifier token_id of the i-th word in the crawled question text
  • logits i is the predicted question text
  • the score of the i-th word in the middle is calculated by cross-entropy to get the prediction error.
  • the prediction error can be accurately measured by the error formula, which ensures that the pre-training language model can be accurately adjusted according to the error.
  • step 207 the pre-training language model is adjusted according to the prediction error until the prediction error meets the training stop condition, and the problem generation model is obtained.
  • the training stop condition is a condition for stopping model training, and the training stop condition may be that the prediction error is less than a predetermined error threshold.
  • the terminal obtains a predetermined error threshold, and compares the prediction error with the error threshold.
  • the terminal adjusts the model parameters in the pre-training language model in the direction of reducing the prediction error.
  • the key entities and answer text are reprocessed to obtain the prediction question text, and the prediction error is obtained according to the prediction question text and the question text, and the prediction error is compared with the error threshold. If the prediction error is still greater than or equal to the error Threshold, adjust the model again, and loop iteratively until the prediction error is less than the error threshold, stop training, and use the pre-trained language model at the time of stopping training as the problem generation model.
  • the output of the current layer and the gradient back propagated back are required, and the output of each layer is stored in the video memory.
  • the output of the 24 layers needs to be saved, which takes up a lot of video memory resources. For this reason, you can only save the output of a part of the layer.
  • backpropagation needs to update the model parameters, you can calculate the output of the current layer through the saved output of the part of the layer, so as to save video memory resources and reduce the hardware equipment requirements for model training.
  • the Transformr network has 24 layers, now save The output of the layer, that is, the output of the 1, 7, 13, 19, and 24 layers are saved.
  • the output of the 2nd-6th layer is recalculated from the output of the 1st layer, and the 8-12th layer
  • the output is recalculated from the output of layer 7, the output of layers 14-18 is recalculated from the output of layer 13, and the output of layers 20-23 is recalculated from the 19th layer.
  • the network in the initial model is adjusted to realize three language models by adjusting the mask matrix, so as to perform all-round pre-training on the initial model to obtain a pre-trained language model that can understand natural language and generate natural language;
  • a large amount of question and answer information can be obtained from web pages for model training.
  • the question and answer information includes question text and answer text, and key entities related to the question text are automatically extracted from the answer text, without relying on manual large-scale labeling, improving The efficiency of obtaining key entities is improved, thereby improving the efficiency of model training;
  • the network in the pre-training language model is adjusted to a sequence-to-sequence model, so that the pre-training language model is oriented to text generative tasks and has good text generation capabilities;
  • the entity and answer text are input into the pre-trained language model to obtain the predicted question text.
  • the pre-trained language model is adjusted according to the error between the predicted question text and the real question text to obtain the question generation model.
  • the question generation model is based on the downstream of the pre-trained language model The task is fine-tuned to ensure the quality of the generated problem, thereby improving the performance of the generated problem.
  • step 201 specifically includes:
  • Step 2011 Obtain an initial model for pre-training and multiple sets of pre-training samples.
  • the pre-training sample set may be a data set used to train the initial model.
  • the built initial model and multiple sets of pre-training sample sets for pre-training the initial model are pre-stored in the server.
  • the server obtains the initial model and the pre-training sample set, and needs to pre-train the initial model to obtain the pre-trained language model.
  • step 2012 the mask identifiers corresponding to each group of pre-training sample sets are randomly generated; the mask matrix corresponding to the mask identifiers realizes a one-way model, a two-way model, and a sequence-to-sequence model.
  • the mask mark may be the mark of the mask matrix of the network in the model.
  • the initial model constructed is a Transformer network, and the Transformer network can be 12 layers or 24 layers.
  • Pre-training uses three unsupervised language model targets: one-way LM (including left-to-right and right-to-left), two-way LM and seq2seq LM.
  • the server randomly generates the mask identification of the training sample set, the mask identification corresponds to the mask matrix, and the server sets the Transformer network to a different LM according to the mask matrix; each group of training is randomly generated
  • the mask identification of the sample set realizes equal pre-training of different LMs.
  • the model parameters in the initial model are half-precision, and before the step of randomly generating the mask identifiers corresponding to each group of pre-training sample sets, it further includes: setting the model parameters of the layernorm layer and the embedding layer in the initial model to Single precision.
  • half-precision or half-precision floating-point number is a binary floating-point number data type used by computers.
  • Half-precision floating-point numbers use 2 bytes (16 bits) for storage; and single-precision floating-point numbers (FP32) occupy 4 bytes (32 bits) of storage space.
  • model training requires higher hardware equipment resources of the computer, and the training time is longer.
  • the model parameters in the initial model are half-precision;
  • set the model parameters of the embedding layer in the initial model to single precision.
  • set the model parameters of the layernorm layer in the initial model to Single precision.
  • model parameters in the initial model are set to half precision, and the model parameters of the layernorm layer and the embedding layer are set to single precision, which improves the speed and accuracy of model training.
  • each group of pre-training sample sets are respectively input to the initial model, and the mask matrix of the network in the initial model is adjusted according to the mask identifier corresponding to the pre-training sample set.
  • the server sequentially inputs the pre-training sample set into the initial model. After inputting a set of pre-training sample sets, the server adjusts the mask matrix of the Transformer network in the initial model according to the mask identifier corresponding to the pre-training sample set, thereby setting the Transformer network to one-way LM, two-way LM or seq2seq LM.
  • step 2014 the initial model adjusted by the mask matrix is sequentially pre-trained according to the input pre-training sample set to obtain a pre-training language model.
  • the server after the server adjusts the mask matrix, it pre-trains the initial matrix according to the pre-training sample set; when the training is completed according to a set of pre-training sample sets, input the next set of pre-training sample sets to adjust the mask matrix , Proceed to the next round of pre-training. After all the pre-training sample sets are trained, the server obtains the pre-trained language model.
  • the Transformer network randomly switches between one-way LM (including left-to-right and right-to-left), two-way LM, and seq2seq LM. Each layer in the Transformer network shares model parameters in multiple rounds of pre-training.
  • the mask mark of the pre-training sample set is randomly generated.
  • the mask matrix in the initial model is adjusted according to the mask mark, so that the initial model can complete 3 languages on average
  • the pre-training goal of the model ensures the scientificity of pre-training.
  • the above step 203 may include:
  • Step 2031 Extract text entities from the question text and the answer text in the question and answer information respectively.
  • the text entity can be an entity in the question text and the answer text.
  • the server may segment the question text and the answer text to obtain multiple entities.
  • the server can use pkuseg for word segmentation to segment the question text and the answer text in word units.
  • pkuseg is an open source Chinese word segmentation toolkit released by Peking University, which has a high accuracy rate of word segmentation.
  • Stop words are stop words, which are words that have no obvious meaning and can be deleted, such as " ⁇ ", " ⁇ " and " ⁇ ". Then, entities whose parts of speech are verbs and nouns are extracted as text entities.
  • Step 2032 Calculate the similarity between each text entity in the answer text and each text entity in the question text.
  • the text entities in the answer text form the first data set
  • the text entities in the question text form the second data set
  • the server calculates the similarity between each entity in the first data set and each entity in the second data set.
  • the server can calculate the similarity through exact matching and fuzzy matching, and the similarity between text entities that can be accurately matched is 100%.
  • the server can convert text entities into vectors and calculate the cosine similarity between vectors; or calculate the text edit distance between text entities (also known as Levenshtein distance, which converts a string into another character)
  • Levenshtein distance which converts a string into another character
  • the minimum number of operations required for string, operations include insert, delete, and replace). The shorter the text editing distance, the higher the similarity.
  • Step 2033 Extract text entities whose similarity meets a preset similarity threshold from each text entity of the answer text as key entities.
  • M*N groups of similarities are calculated.
  • the server obtains the preset similarity threshold, and selects the similarity whose similarity value is greater than the similarity threshold from the M*N group of similarities.
  • the two text entities corresponding to each selected similarity will be from the first
  • the text entity of the data set is used as the key entity.
  • the server may also arrange the M*N groups of similarities in descending order, select a preset number of similarities according to the arrangement order, and use the first data set text entity corresponding to the selected similarity as the key entity.
  • the question text is "What is the ranking of Fudan University in China?", divided into ⁇ "Fudan University", “in”, “domestic”, “of”, “ranking”, “approximately” through pkuseg, "how many”,”?” ⁇ .
  • the stop words ⁇ " ⁇ ", “ ⁇ ", “YES” ⁇ are removed, and the verbs and nouns ⁇ "Fudan University", “ranking” ⁇ are extracted, and the answer text is processed in the same way.
  • the similarity between "Fudan University” and “Fudan” is calculated to meet the similarity threshold, and "Fudan” is taken as the key entity.
  • the extracted key entities are highly related to the question text and the answer text, which can assist the pre-training language model to output the question text.
  • the answer text includes at least one sub-answer text.
  • the above step 205 may include:
  • Step 2051 Input at least one sub-answer text and key entities corresponding to the sub-answer text into the pre-training language model to obtain at least one three-dimensional word vector matrix.
  • the answer text corresponding to one question text may be composed of at least one sub-answer text, and each sub-answer text is extracted to obtain a key entity.
  • the server performs batch processing, and at least one sub-answer text corresponding to a question text and key entities corresponding to the sub-answer text are processed as a batch.
  • the server fills in the text length of the sub-answer text (that is, the number of characters in the sub-answer text) by adding zeros, and then converts it into a one-hot vector (also known as "one-hot encoding") according to the character conversion table to obtain one -hot matrix.
  • a one-hot vector also known as "one-hot encoding”
  • the number of sub-answer texts is batch
  • the length of the text after completion is length
  • the number of characters in the character conversion table is M
  • the three dimensions of the one-hot matrix are batch, length, and M in turn, where batch represents one-hot matrix Which sub-answer text comes from
  • length is the number of rows in the one-hot matrix
  • M is the number of columns in the one-hot matrix.
  • the server needs to convert the one-hot vector into a word vector, input the three-dimensional one-hot matrix into the embedding layer of the pre-trained language model, and replace the M dimension with the dim dimension to obtain a three-dimensional word vector matrix; dim is the feature dimension, in a model
  • the dim is a uniform constant, for example, dim can be 512, 768, or 1024.
  • Step 2052 Combine the converted three-dimensional word vector matrix into a two-dimensional word vector matrix.
  • the three-dimensional word vector matrices are combined to obtain a larger matrix, that is, the two-dimensional word vector matrix.
  • the matrix merging cancels the batch dimension, so that the calculation of the matrix in the pre-training language model becomes correct.
  • the operation of the two-dimensional matrix improves the calculation speed and reduces the training time.
  • step 2053 the two-dimensional word vector matrix is processed through the pre-training language model to obtain the prediction question text output by the pre-training language model, where the prediction question text is stored in the blockchain.
  • the server processes the pre-trained language model through a two-dimensional word vector matrix to obtain the score logits of each word in the predicted question text. At each word, the word with the highest score is selected as the word in that place, thereby Output prediction question text.
  • the server can also upload the predicted question text to the blockchain for storage to record the training process of the pre-trained language model, while ensuring the privacy and security of the predicted question text.
  • each sub-answer text and corresponding key entities are converted into multiple three-dimensional word vector matrices, and then the three-dimensional word vector matrices are merged into a two-dimensional word vector matrix, so that the pre-training language model performs the two-dimensional word vector matrix Processing improves the efficiency of outputting prediction problem text.
  • a method for generating a question is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:
  • Step 301 Obtain the source text for question generation.
  • the question generation model generates question text based on the input text.
  • the user sends the source text to the server through the terminal, and the question generation model generates the question text based on the source text.
  • the terminal may also send voice data to the server, and the server converts the voice data into text data through voice recognition to obtain the source text.
  • Step 302 Filter several groups of source entities from the source text.
  • the server performs word segmentation on the source text to obtain multiple entities.
  • the server can randomly screen multiple entities to obtain a group of source entities, and can screen several groups of source entities.
  • the server can also filter several groups of source entities according to the instruction information sent by the terminal.
  • Step 303 Input several groups of source entities into the question generation model respectively; wherein, the question generation model is a model obtained by using the training method of the above question generation model.
  • the server inputs the selected groups of source entities into the question generation model, and the question generation model converts the source entities into vectors in units of characters to perform question generation processing.
  • the question generation model is a model obtained using the training method of the above question generation model.
  • the server When the server generates the question text, it can generate the question text based on the entire source text, or it can generate the question text based on several groups of source entities extracted from the source text.
  • Step 304 Obtain several question texts generated by the question generation model based on several groups of source entities.
  • the question generation model is based on a set of source entities to process and generate a set of question texts.
  • the server When there are several groups of source entities, the server generates question texts corresponding to the several groups of source entities.
  • the server sends several generated question texts to the terminal, and the user selects the question texts through the terminal for subsequent use.
  • step 302 may include:
  • Step 3021 Identify text entities in the source text.
  • the server after receiving the source text, the server performs word segmentation on the source text to obtain multiple entities, recognizes the part of speech of each entity, and uses the entity that meets the preset part of speech as the text entity.
  • the part of speech of the text entity can include nouns, verbs, adjectives, etc.
  • Step 3022 Randomly extract several groups of text entities from the recognized text entities to obtain several groups of source entities.
  • the server After the server recognizes the text entities, it randomly selects several groups of text entities, and uses each group of text entities as a group of source entities to obtain multiple groups of source entities.
  • Step 3023 Perform semantic annotation on the text entities in the source text according to a preset semantic knowledge base to obtain a semantic annotation result.
  • a semantic knowledge base is preset in the server.
  • the server recognizes the semantics of each text entity according to the semantic knowledge base, and performs semantic annotation on each text entity to obtain the semantic annotation result.
  • Step 3024 According to the semantic annotation result, filter several text entities that meet the preset semantic range to obtain several groups of source entities.
  • the semantic information expressed by the text entity can be determined according to the semantic annotation result.
  • the server obtains the preset semantic range, filters several text entities whose semantic information meets the preset semantic range, and obtains several sets of source entities.
  • the preset semantic range may come from the instruction information sent by the terminal.
  • the preset semantic range in the instruction information is set to the financial field, and the server filters the text entities belonging to the financial field to obtain the source entity.
  • the text entities in the source text are recognized, and the text entities are extracted randomly or semantically, so as to ensure the flexibility of text entity extraction, thereby ensuring the flexibility of generating the question text.
  • this application provides an embodiment of a training device for a question generation model.
  • the device embodiment corresponds to the method embodiment shown in FIG.
  • the device can be applied to various electronic devices.
  • the training device 400 for the problem generation model described in this embodiment includes: a model training module 401, an information acquisition module 402, an entity extraction module 403, a model setting module 404, a text input module 405, and an error determination module 406 And the model adjustment module 407. in:
  • the model training module 401 is used to pre-train the initial model to obtain the pre-trained language model, and adjust the mask matrix in the pre-training to realize the one-way model, the two-way model and the sequence-to-sequence model of the network in the initial model.
  • the information obtaining module 402 is configured to obtain question and answer information from a web page through a web crawler, and the question and answer information includes question text and answer text.
  • the entity extraction module 403 is used to extract key entities related to the question text from the answer text.
  • the model setting module 404 is used to set the network in the pre-training language model to a sequence-to-sequence model to obtain a pre-training language model for Chinese text generation.
  • the text input module 405 is used to input key entities and answer text into the pre-training language model to obtain the predicted question text output by the pre-training language model.
  • the error determination module 406 is configured to determine the prediction error according to the prediction question text and the question text.
  • the model adjustment module 407 is configured to adjust the pre-training language model according to the prediction error until the prediction error meets the training stop condition, and the problem generation model is obtained.
  • the network in the pre-training language model is adjusted to a sequence-to-sequence model, so that the pre-training language model is oriented to text generative tasks and has good text generation capabilities, and then the pre-training language model is fine-tuned to obtain a problem generation model , To ensure the quality of the generated problems.
  • the above-mentioned model training module 401 includes: an acquisition sub-module, an identity generation sub-module, an input sub-module, and a pre-training sub-module, wherein:
  • the acquisition sub-module is used to acquire the initial model used for pre-training and multiple sets of pre-training samples
  • the identification generation sub-module is used to randomly generate the mask identification corresponding to each group of pre-training sample sets; the mask matrix corresponding to the mask identification realizes one-way model, two-way model and sequence-to-sequence model;
  • the input sub-module is used to input each group of pre-training sample sets into the initial model, and adjust the mask matrix of the network in the initial model according to the mask identifier corresponding to the pre-training sample set;
  • the pre-training sub-module is used to sequentially pre-train the initial model adjusted by the mask matrix according to the input pre-training sample set to obtain the pre-training language model.
  • the model parameters in the initial model are half-precision
  • the above model training module 401 also includes a parameter setting submodule.
  • the parameter setting submodule is used to combine the layernorm and embedding layers in the initial model.
  • the model parameters are set to single precision.
  • the entity extraction module 403 is further configured to: extract text entities from the question text and answer text in the question and answer information, respectively; calculate each text entity and question text in the answer text The similarity of each text entity in the answer text; from each text entity of the answer text, extract the text entity whose similarity meets the preset similarity threshold as the key entity.
  • the answer text includes at least one sub-answer text
  • the text input module 405 is further configured to: input at least one sub-answer text and key entities corresponding to the sub-answer text into the pre-training language Model to obtain at least one three-dimensional word vector matrix; merge the converted three-dimensional word vector matrix into a two-dimensional word vector matrix; process the two-dimensional word vector matrix through the pre-training language model to obtain the prediction problem text output by the pre-training language model , Where the prediction question text is stored in the blockchain.
  • a question generation device including: a source text acquisition module, a source entity extraction module, a source entity input module, and a question generation module, wherein:
  • the source text obtaining module is used to obtain the source text used for question generation.
  • the source entity extraction module is used to filter several groups of source entities from the source text.
  • the source entity input module is used to input several groups of source entities into the question generation model; wherein, the question generation model is a model obtained by using the training method of the above question generation model.
  • the question generation module is used to obtain several question texts generated by the question generation model based on several groups of source entities.
  • the aforementioned source entity extraction module is further used to: identify text entities in the source text; randomly extract several groups of text entities from the recognized text entities to obtain several groups of source entities; Or, perform semantic annotation on the text entities in the source text according to a preset semantic knowledge base to obtain a semantic annotation result; according to the semantic annotation result, filter several text entities that meet the preset semantic range to obtain several groups of source entities.
  • FIG. 9 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 5 includes a memory 51, a processor 52, and a network interface 53 that communicate with each other through a system bus. It should be pointed out that the figure only shows the computer device 5 with the components 51-53, and it is not required to implement all the shown components, and more or fewer components may be implemented instead.
  • the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • the memory 51 includes at least one type of computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium includes flash memory, hard disk, and multimedia card. , Card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), Programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 51 may be an internal storage unit of the computer device 5, such as a hard disk or a memory of the computer device 5.
  • the memory 51 may also be an external storage device of the computer device 5, for example, a plug-in hard disk equipped on the computer device 5, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, Flash Card, etc.
  • the memory 51 may also include both the internal storage unit of the computer device 5 and its external storage device.
  • the memory 51 is generally used to store an operating system and various application software installed in the computer device 5, such as a training method of a question generation model, or computer readable instructions of a question generation method, and the like.
  • the memory 51 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 52 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 52 is generally used to control the overall operation of the computer device 5.
  • the processor 52 is configured to run computer-readable instructions or processed data stored in the memory 51, for example, run a training method of a problem generation model or a computer-readable instruction of a problem generation method.
  • the network interface 53 may include a wireless network interface or a wired network interface, and the network interface 53 is generally used to establish a communication connection between the computer device 5 and other electronic devices.
  • the computer device provided in this embodiment can execute the steps of the training method of the problem generation model described above.
  • the steps of the training method of the question generation model may be the steps in the training method of the question generation model of each of the foregoing embodiments.
  • the network in the pre-training language model is adjusted to a sequence-to-sequence model, so that the pre-training language model is oriented to text generative tasks and has good text generation capabilities, and then the pre-training language model is fine-tuned to obtain a problem generation model , To ensure the quality of the generated problems.
  • the computer device provided in this embodiment can execute the steps of the above-mentioned problem generation method.
  • the steps of the question generation method may be the steps in the question generation method of each of the foregoing embodiments.
  • This application also provides another implementation manner, that is, a computer-readable storage medium storing computer-readable instructions for training a question generation model, and the computer-readable training question generation model
  • the instructions may be executed by at least one processor, so that the at least one processor executes the steps of the training method of the problem generation model as described above.
  • the network in the pre-training language model is adjusted to a sequence-to-sequence model, so that the pre-training language model is oriented to text generative tasks and has good text generation capabilities, and then the pre-training language model is fine-tuned to obtain a problem generation model , To ensure the quality of the generated problems.
  • This application also provides another implementation manner, that is, a computer-readable storage medium storing computer-readable instructions for question generation, and the computer-readable instructions for question generation It may be executed by at least one processor, so that the at least one processor executes the steps of the problem generation method as described above.
  • the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. ⁇
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for training a question generation model, a question generation method, and a related device. The method comprises: pre-training an initial model to obtain a pre-trained language model, and adjusting a mask matrix during pre-training so as to realize three language models; acquiring question-and-answer information that comprises a question text and an answer text; extracting, from the answer text, a key entity related to the question text; configuring a network in the pre-trained language model such that same adapts to the generation of a Chinese text; inputting the key entity and the answer text into the pre-trained language model, so as to obtain a predicted question text; according to the predicted question text and the question text, determining a prediction error; and adjusting the model according to the prediction error, so as to obtain a question generation model. The method does not need to rely on manual data labeling. The method belongs to the field of artificial intelligence and further relates to blockchain technology, and the predicted question text can be stored in a blockchain node.

Description

问题生成模型的训练方法、问题生成方法及其相关设备Training method of problem generation model, problem generation method and related equipment
本申请要求于2020年04月29日提交中国专利局、申请号为202010356637.X,发明名称为“问题生成模型的训练方法、问题生成方法及其相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 29, 2020, the application number is 202010356637.X, and the invention title is "Training method of problem generation model, problem generation method and related equipment", which The entire content is incorporated into this application by reference.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种问题生成模型的训练方法、问题生成方法及其相关设备。This application relates to the field of artificial intelligence technology, and in particular to a training method of a question generation model, a question generation method and related equipment.
背景技术Background technique
随着自然语言处理技术的发展,出现了问题生成技术。问题生成涉及人工智能领域中的机器学习和自然语言处理等,还涉及智慧城市领域中的智慧生活。问题生成研究如何生成基于自然语言的问题,是自然语言处理领域的一个重要议题。问题生成应用十分广泛,例如,机器知识库可以利用主动提问来构建或者补充知识库、扩充数据集;在教育领域,问题生成可以帮助学生来提问;在对话领域,问题生成可以作为冷启动来开始一个话题,或者通过提问来获得反馈,应用场景非常丰富。With the development of natural language processing technology, problem generation technology has emerged. Problem generation involves machine learning and natural language processing in the field of artificial intelligence, as well as smart life in the field of smart cities. Question generation research how to generate natural language-based questions is an important issue in the field of natural language processing. Question generation has a wide range of applications. For example, machine knowledge bases can use active questioning to build or supplement knowledge bases and expand data sets; in the field of education, question generation can help students ask questions; in the field of dialogue, question generation can be started as a cold start A topic, or to get feedback by asking questions, is very rich in application scenarios.
发明人意识到,现有的问题生成技术,通常是基于已知的语法规则,利用语法树来生成问题,用知识库中的实体对已有模板进行填充,这种技术迁移能力较差,在构建或迁移时需要大量的先验专家知识;另有的技术是利用深度学习模型,基于预先标注的答案进行问题生成,这种技术需要预先由人工标注大量数据,费时费力,而且标注的文本大多较短,影响问题的生成。由此可见现有的问题生成技术,生成问题的性能较差。The inventor realizes that the existing problem generation technology is usually based on known grammatical rules, using syntax trees to generate problems, and filling existing templates with entities in the knowledge base. This technology has poor migration capabilities. A large amount of prior expert knowledge is required for construction or migration; another technique is to use deep learning models to generate questions based on pre-labeled answers. This technique requires manual labeling of large amounts of data in advance, which is time-consuming and labor-intensive, and most of the labeled text Shorter, affecting the generation of problems. It can be seen that the existing problem generation technology has poor problem generation performance.
发明内容Summary of the invention
本申请实施例的目的在于提出一种提高问题生成性能的问题生成模型的训练方法、问题生成方法及其相关设备。为了解决上述技术问题,本申请实施例提供一种问题生成模型的训练方法,采用了如下所述的技术方案:The purpose of the embodiments of the present application is to propose a method for training a question generation model, a method for question generation, and related equipment for improving the performance of question generation. In order to solve the above technical problems, an embodiment of the present application provides a method for training a problem generation model, which adopts the following technical solutions:
对初始模型进行预训练得到预训练语言模型,并在预训练中通过调整掩膜矩阵将所述初始模型中的网络实现单向模型、双向模型和序列到序列模型;Pre-training the initial model to obtain a pre-training language model, and in the pre-training, by adjusting the mask matrix, the network in the initial model is realized as a one-way model, a two-way model, and a sequence-to-sequence model;
通过网络爬虫从网络页面中获取问答信息,所述问答信息包括问题文本和答案文本;Obtaining question and answer information from a web page through a web crawler, where the question and answer information includes question text and answer text;
从所述答案文本中,提取与所述问题文本相关的关键实体;Extract key entities related to the question text from the answer text;
将所述预训练语言模型中的网络设置为序列到序列模型,以得到用于中文文本生成的预训练语言模型;Setting the network in the pre-training language model to a sequence-to-sequence model to obtain a pre-training language model for Chinese text generation;
将所述关键实体和所述答案文本输入预先构建的用于中文文本生成的预训练语言模型,得到所述预训练语言模型输出的预测问题文本;Inputting the key entity and the answer text into a pre-trained language model constructed in advance for Chinese text generation to obtain the predicted question text output by the pre-training language model;
根据所述预测问题文本和所述问题文本,确定预测误差;Determine the prediction error according to the prediction question text and the question text;
根据所述预测误差对所述预训练语言模型进行调整,直至所述预测误差满足训练停止条件,得到问题生成模型。The pre-training language model is adjusted according to the prediction error until the prediction error satisfies the training stop condition, and a problem generation model is obtained.
一种问题生成方法,包括:A problem generation method, including:
获取用于问题生成的源文本;Obtain the source text used for question generation;
从所述源文本中筛选若干组源实体;Filter several groups of source entities from the source text;
分别将所述若干组源实体输入问题生成模型;其中,所述问题生成模型是采用上述任一项问题生成模型的训练方法获取的模型;Respectively inputting the several groups of source entities into a question generation model; wherein, the question generation model is a model obtained by using any one of the above-mentioned training methods of the question generation model;
获取所述问题生成模型基于所述若干组源实体生成的若干问题文本。Acquiring the question generation model based on several question texts generated by the several groups of source entities.
为了解决上述技术问题,本申请实施例还提供一种问题生成模型的训练装置,包括:In order to solve the above technical problems, an embodiment of the present application also provides a training device for a problem generation model, including:
模型训练模块,用于对初始模型进行预训练得到预训练语言模型,并在预训练中通过 调整掩膜矩阵将所述初始模型中的网络实现单向模型、双向模型和序列到序列模型;The model training module is used to pre-train the initial model to obtain the pre-trained language model, and adjust the mask matrix in the pre-training to realize the one-way model, the two-way model and the sequence-to-sequence model of the network in the initial model;
信息获取模块,用于通过网络爬虫从网络页面中获取问答信息,所述问答信息包括问题文本和答案文本;An information acquisition module for acquiring question and answer information from a web page through a web crawler, where the question and answer information includes question text and answer text;
实体提取模块,用于从所述答案文本中,提取与所述问题文本相关的关键实体;An entity extraction module for extracting key entities related to the question text from the answer text;
模型设置模块,用于将所述预训练语言模型中的网络设置为序列到序列模型,以得到用于中文文本生成的预训练语言模型;A model setting module, configured to set the network in the pre-training language model to a sequence-to-sequence model to obtain a pre-training language model for Chinese text generation;
文本输入模块,用于将所述关键实体和所述答案文本输入所述预训练语言模型,得到所述预训练语言模型输出的预测问题文本;A text input module, configured to input the key entity and the answer text into the pre-training language model to obtain the predicted question text output by the pre-training language model;
误差确定模块,用于根据所述预测问题文本和所述问题文本,确定预测误差;An error determination module, configured to determine a prediction error according to the prediction question text and the question text;
模型调整模块,用于根据所述预测误差对所述预训练语言模型进行调整,直至所述预测误差满足训练停止条件,得到问题生成模型。The model adjustment module is configured to adjust the pre-training language model according to the prediction error until the prediction error satisfies the training stop condition to obtain a problem generation model.
为了解决上述技术问题,本申请实施例还提供一种计算机设备,包括存储器和处理器,以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:In order to solve the above technical problems, an embodiment of the present application also provides a computer device, including a memory and a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor executes The computer-readable instructions implement the following steps:
对初始模型进行预训练得到预训练语言模型,并在预训练中通过调整掩膜矩阵将所述初始模型中的网络实现单向模型、双向模型和序列到序列模型;Pre-training the initial model to obtain a pre-training language model, and in the pre-training, by adjusting the mask matrix, the network in the initial model is realized as a one-way model, a two-way model, and a sequence-to-sequence model;
通过网络爬虫从网络页面中获取问答信息,所述问答信息包括问题文本和答案文本;Obtaining question and answer information from a web page through a web crawler, where the question and answer information includes question text and answer text;
从所述答案文本中,提取与所述问题文本相关的关键实体;Extract key entities related to the question text from the answer text;
将所述预训练语言模型中的网络设置为序列到序列模型,以得到用于中文文本生成的预训练语言模型;Setting the network in the pre-training language model to a sequence-to-sequence model to obtain a pre-training language model for Chinese text generation;
将所述关键实体和所述答案文本输入所述预训练语言模型,得到所述预训练语言模型输出的预测问题文本;Inputting the key entity and the answer text into the pre-training language model to obtain the predicted question text output by the pre-training language model;
根据所述预测问题文本和所述问题文本,确定预测误差;Determine the prediction error according to the prediction question text and the question text;
根据所述预测误差对所述预训练语言模型进行调整,直至所述预测误差满足训练停止条件,得到问题生成模型。The pre-training language model is adjusted according to the prediction error until the prediction error satisfies the training stop condition, and a problem generation model is obtained.
为了解决上述技术问题,本申请实施例还提供一种计算机设备,包括存储器和处理器,以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:In order to solve the above technical problems, an embodiment of the present application also provides a computer device, including a memory and a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor executes The computer-readable instructions implement the following steps:
获取用于问题生成的源文本;Obtain the source text used for question generation;
从所述源文本中筛选若干组源实体;Filter several groups of source entities from the source text;
分别将所述若干组源实体输入问题生成模型,其中,所述问题生成模型是采用上述问题生成模型的训练方法获取的模型;Respectively inputting the several groups of source entities into a question generation model, where the question generation model is a model obtained by using the above-mentioned question generation model training method;
获取所述问题生成模型基于所述若干组源实体生成的若干问题文本。Acquiring the question generation model based on several question texts generated by the several groups of source entities.
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤:In order to solve the above technical problems, embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions implement the following steps when executed by a processor:
对初始模型进行预训练得到预训练语言模型,并在预训练中通过调整掩膜矩阵将所述初始模型中的网络实现单向模型、双向模型和序列到序列模型;Pre-training the initial model to obtain a pre-training language model, and in the pre-training, by adjusting the mask matrix, the network in the initial model is realized as a one-way model, a two-way model, and a sequence-to-sequence model;
通过网络爬虫从网络页面中获取问答信息,所述问答信息包括问题文本和答案文本;Obtaining question and answer information from a web page through a web crawler, where the question and answer information includes question text and answer text;
从所述答案文本中,提取与所述问题文本相关的关键实体;Extract key entities related to the question text from the answer text;
将所述预训练语言模型中的网络设置为序列到序列模型,以得到用于中文文本生成的预训练语言模型;Setting the network in the pre-training language model to a sequence-to-sequence model to obtain a pre-training language model for Chinese text generation;
将所述关键实体和所述答案文本输入所述预训练语言模型,得到所述预训练语言模型输出的预测问题文本;Inputting the key entity and the answer text into the pre-training language model to obtain the predicted question text output by the pre-training language model;
根据所述预测问题文本和所述问题文本,确定预测误差;Determine the prediction error according to the prediction question text and the question text;
根据所述预测误差对所述预训练语言模型进行调整,直至所述预测误差满足训练停止条件,得到问题生成模型。The pre-training language model is adjusted according to the prediction error until the prediction error satisfies the training stop condition, and a problem generation model is obtained.
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤:In order to solve the above technical problems, embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions implement the following steps when executed by a processor:
获取用于问题生成的源文本;Obtain the source text used for question generation;
从所述源文本中筛选若干组源实体;Filter several groups of source entities from the source text;
分别将所述若干组源实体输入问题生成模型,其中,所述问题生成模型是采用上述问题生成模型的训练方法获取的模型;Respectively inputting the several groups of source entities into a question generation model, where the question generation model is a model obtained by using the above-mentioned question generation model training method;
获取所述问题生成模型基于所述若干组源实体生成的若干问题文本。Acquiring the question generation model based on several question texts generated by the several groups of source entities.
与现有技术相比,本申请的问题生成模型的训练方法实施例主要有以下有益效果:通过调整掩膜矩阵将初始模型中的网络实现三种语言模型,以对初始模型进行全方位的预训练,得到既能理解自然语言又能生成自然语言的预训练语言模型;通过网络爬虫可以从网络页面中获取到大量问答信息用于模型训练,问答信息包括问题文本和答案文本,并自动从答案文本中提取与问题文本相关的关键实体,无需依赖人工进行大量标注,提高了获取关键实体的效率,从而提高了模型训练的效率;将预训练语言模型中的网络调整为序列到序列模型,使得预训练语言模型面向文本生成式任务且具备良好的文本生成能力;将关键实体和答案文本输入预训练语言模型得到预测问题文本,根据预测问题文本和真实的问题文本间的误差对预训练语言模型进行调整从而得到问题生成模型,问题生成模型是对预训练语言模型依据下游任务进行微调得到,保证了生成问题的质量,从而提高了生成问题的性能。Compared with the prior art, the embodiment of the training method of the problem generation model of the present application mainly has the following beneficial effects: the network in the initial model is realized by adjusting the mask matrix to realize three language models, so as to carry out a comprehensive prediction of the initial model. Training to obtain a pre-trained language model that can understand natural language and generate natural language; through web crawlers, a large amount of question and answer information can be obtained from web pages for model training. The question and answer information includes question text and answer text, and the answer is automatically obtained Extract key entities related to the question text from the text, without relying on manual a large number of annotations, improve the efficiency of obtaining key entities, thereby improving the efficiency of model training; adjust the network in the pre-training language model to a sequence-to-sequence model, making The pre-training language model is oriented to text generative tasks and has good text generation capabilities; key entities and answer text are input into the pre-training language model to obtain the predicted question text, and the pre-trained language model is based on the error between the predicted question text and the real question text Adjustments are made to obtain a problem generation model. The problem generation model is obtained by fine-tuning the pre-training language model according to downstream tasks, which ensures the quality of the generated problems, thereby improving the performance of the generated problems.
附图说明Description of the drawings
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the solution in this application more clearly, the following will briefly introduce the drawings used in the description of the embodiments of the application. Obviously, the drawings in the following description are some embodiments of the application. Ordinary technicians can obtain other drawings based on these drawings without creative work.
图1是本申请可以应用于其中的示例性系统架构图;Figure 1 is an exemplary system architecture diagram to which the present application can be applied;
图2为根据本申请的问题生成模型的训练方法的一个实施例的流程图;Fig. 2 is a flowchart of an embodiment of a training method for a question generation model according to the present application;
图3是图2中步骤201的一种具体实施方式的流程图;FIG. 3 is a flowchart of a specific implementation of step 201 in FIG. 2;
图4是图2中步骤203的一种具体实施方式的流程图;FIG. 4 is a flowchart of a specific implementation of step 203 in FIG. 2;
图5是图2中步骤205的一种具体实施方式的流程图;FIG. 5 is a flowchart of a specific implementation of step 205 in FIG. 2;
图6为根据本申请的问题生成方法的一个实施例的流程图;Fig. 6 is a flowchart of an embodiment of the question generation method according to the present application;
图7是图4中步骤302的一种具体实施方式的流程图;FIG. 7 is a flowchart of a specific implementation of step 302 in FIG. 4;
图8是根据本申请的问题生成模型的训练装置的一个实施例的结构示意图;Fig. 8 is a schematic structural diagram of an embodiment of a training device for a question generation model according to the present application;
图9是根据本申请的计算机设备的一个实施例的结构示意图。Fig. 9 is a schematic structural diagram of an embodiment of a computer device according to the present application.
具体实施方式Detailed ways
本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。The terminology used in the specification of the application herein is only for the purpose of describing specific embodiments, and is not intended to limit the application.
下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings.
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用。The user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on. Various communication client applications can be installed on the terminal devices 101, 102, 103.
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式 计算机等等。服务器105可以是提供各种服务的服务器。The terminal devices 101, 102, 103 may be various electronic devices with display screens and support for web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4) players, laptop portable computers and desktop computers, etc. The server 105 may be a server that provides various services.
需要说明的是,本申请实施例所提供的问题生成模型的训练方法一般由服务器执行,相应地,问题生成模型的处理装置一般设置于服务器中。It should be noted that the training method of the question generation model provided by the embodiments of the present application is generally executed by the server, and accordingly, the processing device of the question generation model is generally set in the server.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks, and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, and servers according to implementation needs.
继续参考图2,示出了根据本申请的问题生成模型的训练方法的一个实施例的流程图。所述的问题生成模型的训练方法,包括以下步骤:Continuing to refer to FIG. 2, there is shown a flowchart of an embodiment of the training method of the question generation model according to the present application. The training method of the problem generation model includes the following steps:
步骤201,对初始模型进行预训练得到预训练语言模型,并在预训练中通过调整掩膜矩阵将初始模型中的网络实现单向模型、双向模型和序列到序列模型。Step 201: Pre-train the initial model to obtain a pre-trained language model, and adjust the mask matrix in the pre-training to implement a one-way model, a two-way model, and a sequence-to-sequence model for the network in the initial model.
在本实施例中,问题生成模型的训练方法运行于其上的电子设备(例如图1所示的服务器)可以通过各种有线连接方式或者无线连接方式与终端进行通信。In this embodiment, the electronic device (for example, the server shown in FIG. 1) on which the training method of the question generation model runs can communicate with the terminal through various wired connection methods or wireless connection methods.
其中,初始模型可以是未经过预训练的模型。掩膜矩阵可以是初始模型中网络的mask矩阵,用于控制训练中所用到的上下文信息;单向模型即单向LM,双向模型即双向LM,序列到序列模型即seq2seq LM。Wherein, the initial model may be a model that has not been pre-trained. The mask matrix can be the mask matrix of the network in the initial model, which is used to control the context information used in training; the one-way model is one-way LM, the two-way model is two-way LM, and the sequence-to-sequence model is seq2seq LM.
具体地,服务器先获取预先构建的初始模型,并对初始模型进行预训练。在预训练的过程中,服务器通过调整初始模型中网络的掩膜矩阵,将初始模型设置为三种不同的语言模型,包括单向模型、双向模型和序列到序列模型,以此丰富预训练中得到的信息,得到既能理解自然语言、又能生成自然语言的预训练语言模型。Specifically, the server first obtains the pre-built initial model, and pre-trains the initial model. During the pre-training process, the server sets the initial model to three different language models by adjusting the mask matrix of the network in the initial model, including one-way model, two-way model and sequence-to-sequence model, so as to enrich the pre-training With the information obtained, a pre-trained language model that can understand natural language and generate natural language is obtained.
步骤202,通过网络爬虫从网络页面中获取问答信息,问答信息包括问题文本和答案文本。Step 202: Obtain question and answer information from a web page through a web crawler, where the question and answer information includes question text and answer text.
具体地,用户可以在终端对网络爬虫将进行配置,终端根据用户输入的爬虫配置信息生成信息获取指令,并将信息获取指令发送至服务器。配置的网络爬虫用于从万维网中爬取信息。爬虫配置信息可以包括页面的网址、信息的储存地址等。Specifically, the user can configure the web crawler at the terminal, and the terminal generates an information acquisition instruction according to the crawler configuration information input by the user, and sends the information acquisition instruction to the server. The configured web crawler is used to crawl information from the World Wide Web. The crawler configuration information may include the URL of the page, the storage address of the information, and so on.
服务器接收到信息获取指令后,提取信息获取指令中的爬虫配置信息,根据爬虫配置信息生成网络爬虫。服务器运行生成的网络爬虫,网络爬虫从网络页面中爬取问答信息,服务器将网络爬虫爬取到的问答信息保存至数据库中。其中,问答信息可以由问题文本以及与问题文本对应的答案文本组成。After the server receives the information acquisition instruction, it extracts the crawler configuration information in the information acquisition instruction, and generates a web crawler according to the crawler configuration information. The server runs the generated web crawler, the web crawler crawls the question and answer information from the web page, and the server saves the question and answer information crawled by the web crawler into the database. Among them, the question and answer information may be composed of question text and answer text corresponding to the question text.
在一个实施例中,网络爬虫可以是基于Scrapy的网络爬虫。Scrapy是python开发的一个快速、高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。基于Scrapy的网络爬虫可以从知乎、百度知道等公开的问答社区网站上爬取大量问答信息,并将爬取到的问答信息以JSON文件的形式存放在服务器的数据库中。In one embodiment, the web crawler may be a Scrapy-based web crawler. Scrapy is a fast, high-level screen scraping and web scraping framework developed by python, used to scrape web sites and extract structured data from pages. Scrapy-based web crawlers can crawl a large amount of question and answer information from public question and answer community websites such as Zhihu and Baidu Zhizhi, and store the crawled question and answer information in the form of a JSON file in the database of the server.
在一个实施例中,网络页面中的一个问题有至少一个答案,对至少一个答案爬取后得到至少一个子答案文本;一个问题文本所对应的至少一个子答案文本构成与该问题文本对应的答案文本。In one embodiment, a question in the web page has at least one answer, and at least one sub-answer text is obtained after crawling the at least one answer; at least one sub-answer text corresponding to one question text constitutes an answer corresponding to the question text text.
在一个实施例中,通过网络爬虫从网络页面中获取问答信息的步骤具体包括:接收目标文本;将目标文本进行拆分得到若干语句;生成与若干语句相同数量的网络爬虫;将若干语句分别嵌入各网络爬虫;运行各网络爬虫,得到各网络爬虫根据嵌入的语句从网络页面中爬取到的问答信息。In one embodiment, the step of obtaining question and answer information from a web page through a web crawler specifically includes: receiving the target text; splitting the target text to obtain several sentences; generating the same number of web crawlers as the several sentences; embedding the several sentences respectively Each web crawler: Run each web crawler to get the question and answer information that each web crawler crawls from the web page according to the embedded sentence.
其中,目标文本可以是指示网络爬虫爬取问答信息的文本。Wherein, the target text may be text that instructs the web crawler to crawl the question and answer information.
具体地,服务器接收用户通过终端发送的目标文本,根据标点符号对目标文本进行句级拆解,得到若干语句。服务器生成与拆分得到的语句相同数量的网络爬虫,将拆分得到语句分别嵌入到各网络爬虫的代码层中。服务器运行嵌入语句后的网络爬虫,网络爬虫通过分列式爬取从网络页面中爬取与嵌入的语句相关的问答信息。Specifically, the server receives the target text sent by the user through the terminal, and performs sentence-level disassembly of the target text according to punctuation to obtain several sentences. The server generates the same number of web crawlers as the sentences obtained by the split, and embeds the sentences obtained by the split into the code layer of each web crawler. The server runs the web crawler after the embedded statement, and the web crawler crawls the question and answer information related to the embedded statement from the web page through columnar crawling.
本实施例中,接收到目标文本后,将目标文本进行拆分得到若干语句,将若干语句嵌入不同的网络爬虫,运行网络爬虫后可以爬取到与嵌入语句相关的问答信息。In this embodiment, after receiving the target text, the target text is split to obtain several sentences, and the several sentences are embedded in different web crawlers. After running the web crawler, question and answer information related to the embedded sentences can be crawled.
步骤203,从答案文本中,提取与问题文本相关的关键实体。Step 203: Extract key entities related to the question text from the answer text.
其中,关键实体可以是答案文本中的实体,关键实体与问题文本存在相关性。Among them, the key entity can be an entity in the answer text, and the key entity is related to the question text.
具体地,服务器对问题文本和答案文本分别进行分词,各自得到多个实体。服务器识别实体的词性,筛选预设词性的实体,预设词性可以是动词和名词。服务器将从问题文本和答案文本中筛选到的关键实体进行精准匹配和模糊匹配,将答案文本中能够进行匹配的实体作为关键实体。Specifically, the server performs word segmentation on the question text and the answer text respectively, and each obtains multiple entities. The server recognizes the part-of-speech of the entity, and filters the entities with the preset part-of-speech, which can be verbs and nouns. The server performs precise matching and fuzzy matching on the key entities selected from the question text and the answer text, and uses the matching entities in the answer text as the key entities.
在一个实施例中,答案文本包括至少一个子答案文本;服务器分别从子答案文本中提取与问题文本相关的关键实体,并将子答案文本以及从子答案文本中提取到的关键实体相关联。In one embodiment, the answer text includes at least one sub-answer text; the server respectively extracts key entities related to the question text from the sub-answer texts, and associates the sub-answer texts with the key entities extracted from the sub-answer texts.
在一个实施例中,从问答信息内的问题文本和答案文本中提取关键实体的步骤之前还包括:通过正则表达式匹配问答信息,得到待清洗字符串;删除匹配到的待清洗字符串以对问答信息进行数据清洗。In one embodiment, before the step of extracting key entities from the question text and answer text in the question and answer information, the step further includes: matching the question and answer information through regular expressions to obtain the character string to be cleaned; deleting the matched character string to be cleaned to correct Q&A information is data cleaned.
其中,待清洗字符串可以是问答信息中无意义的字符串。Among them, the character string to be cleaned may be a meaningless character string in the question and answer message.
具体地,爬取到的问答信息中存在无意义内容,为提高有效内容的比例,服务器通过预设的正则表达式匹配问答信息,从而得到问答信息中的待清洗字符串,并将匹配到的待清洗字符串进行删除,以对问答信息进行数据清洗。正则表达式为预先配置好的,一种正则表达式可以对应一种无意义的字符串。Specifically, there is meaningless content in the crawled question and answer information. In order to increase the proportion of effective content, the server matches the question and answer information through a preset regular expression, so as to obtain the string to be cleaned in the question and answer information, and match the matched The string to be cleaned is deleted to clean the question and answer information. Regular expressions are pre-configured, and a regular expression can correspond to a meaningless string.
举例说明,当从知乎爬取问答信息时,问答信息中可能包括问答信息中的超链接、分割线、无效字符;知乎专栏中的“来源:......”、“作者:......”等与文本主体无关的内容。当从百度知道爬取问答信息时,问答信息中可能包括大量的无意义字符等。服务器可通过正则表达式将上述无意义的内容进行删除。For example, when the Q&A information is crawled from Zhihu, the Q&A information may include hyperlinks, dividing lines, and invalid characters in the Q&A information; "Source:...", "Author:" in the column of Zhihu ......" and other content that has nothing to do with the main body of the text. When the question and answer information is crawled from Baidu, the question and answer information may include a large number of meaningless characters. The server can delete the above meaningless content through regular expressions.
本实施例中,通过正则表达式匹配问答信息从而得到待清洗字符串,并删除匹配到的待清洗字符串,实现对问答信息的数据清洗,提高了问答信息中有效内容的比例。In this embodiment, the question and answer information is matched by a regular expression to obtain the character string to be cleaned, and the matched character string to be cleaned is deleted, so as to realize the data cleaning of the question and answer information and increase the proportion of effective content in the question and answer information.
步骤204,将预训练语言模型中的网络设置为序列到序列模型,以得到用于中文文本生成的预训练语言模型。Step 204: Set the network in the pre-training language model to a sequence-to-sequence model to obtain a pre-training language model for Chinese text generation.
其中,预训练语言模型(Unified pre-trained Language Model,UNILM)是一种能够同时处理自然语言理解和自然语言生成的模型。Among them, the pre-trained language model (Unified pre-trained Language Model, UNILM) is a model that can process natural language understanding and natural language generation at the same time.
预训练语言模型的预训练采用3种无监督的语言模型目标:单向模型即单向LM(包括从左到右和从右到左)、双向模型即双向LM和序列到序列模型即sequence-to-sequence LM(seq2seq LM),其中LM为language model。The pre-training of the pre-training language model adopts three unsupervised language model goals: one-way model is one-way LM (including left to right and right to left), two-way model is two-way LM and sequence-to-sequence model is sequence- to-sequence LM (seq2seq LM), where LM is language model.
预训练语言模型采用一个共享参数的Transformer网络,同时还使用了特定的self-attention masks用以控制预测时所用到的上下文信息。在预训练时,通过调整Transformer网络中的掩膜(mask)矩阵以实现上述三种LM。The pre-training language model uses a Transformer network with shared parameters, and also uses specific self-attention masks to control the context information used in prediction. During pre-training, the above three LMs are realized by adjusting the mask matrix in the Transformer network.
在依据下游任务进行微调时,可以将预训练语言模型视为单向的encoder、双向的encoder或者sequence-to-sequence模型,通过调整Transformer网络中的掩膜矩阵,以适应不同的下游任务(自然语言理解和生成任务)。When fine-tuning according to downstream tasks, the pre-training language model can be regarded as a one-way encoder, a two-way encoder or a sequence-to-sequence model. The mask matrix in the Transformer network can be adjusted to adapt to different downstream tasks (naturally Language understanding and generative tasks).
Seq2seq是一个Encoder-Deocder结构的模型,具备良好的文本生成效果;seq2seq的输入是一个序列,输出也是一个序列。Encoder将一个可变长度的输入序列变为固定长度的向量,Decoder将这个固定长度的向量解码成可变长度的输出序列。Seq2seq is an Encoder-Deocder structure model with good text generation effect; the input of seq2seq is a sequence, and the output is also a sequence. The Encoder turns a variable-length input sequence into a fixed-length vector, and the Decoder decodes the fixed-length vector into a variable-length output sequence.
具体地,服务器获取预训练语言模型,预训练语言模型用于中文处理,可以用于自然语言理解,也可以用于文本生成。本申请需要将预训练语言模型微调为问题生成的模型,因此需要设置预训练语言模型中Transformer网络的掩膜(mask)矩阵,从而实现序列到序列模型即seq2seq LM。在seq2seq LM的mask矩阵中,左边的矩阵元素均为0,表示既可以得到上文的信息,又可以得到下文的信息;右边的矩阵中,右上角的矩阵元素为无穷大,表示只可以得到上文的信息。Specifically, the server obtains a pre-trained language model, and the pre-trained language model is used for Chinese processing, can be used for natural language understanding, and can also be used for text generation. In this application, the pre-training language model needs to be fine-tuned to a model for problem generation, so it is necessary to set the mask matrix of the Transformer network in the pre-training language model, so as to realize the sequence-to-sequence model, that is, seq2seq LM. In the mask matrix of seq2seq LM, the matrix elements on the left are all 0, which means that both the above information and the following information can be obtained; in the right matrix, the upper right matrix element is infinite, which means that only the upper Text information.
步骤205,将关键实体和答案文本输入预训练语言模型,得到预训练语言模型输出的预测问题文本。Step 205: Input the key entity and the answer text into the pre-training language model to obtain the predicted question text output by the pre-training language model.
其中,预测问题文本可以是预训练语言模型依据关键实体和答案文本生成的与答案文本相关的问题文本。Among them, the predictive question text may be a question text related to the answer text generated by the pre-training language model according to the key entity and the answer text.
具体地,将预训练语言模型中的网络设置为序列到序列模型后,服务器依据关键实体、问题文本和答案文本对预训练语言模型进行微调。预训练语言模型将关键实体和问题文本转化为向量,对向量进行处理,输出预测问题文本。Specifically, after setting the network in the pre-training language model to a sequence-to-sequence model, the server fine-tunes the pre-training language model according to key entities, question text, and answer text. The pre-training language model converts key entities and question texts into vectors, processes the vectors, and outputs prediction question texts.
在一个实施例中,预训练语言模型将关键实体和问题文本以字为单位进行切分,依据字符转化表将每个字转化为向量,对向量进行处理。字符转化表为预先创建的,确定了字与向量之间的对应关系。服务器对字进行转化时,在字符转化表中查询字,将查询到的字所对应的向量作为字转化后的向量。In one embodiment, the pre-training language model divides the key entities and the question text in units of words, converts each word into a vector according to the character conversion table, and processes the vector. The character conversion table is created in advance, and the correspondence between words and vectors is determined. When the server converts the word, it queries the word in the character conversion table, and uses the vector corresponding to the inquired word as the vector after the word conversion.
步骤206,根据预测问题文本和问题文本,确定预测误差。Step 206: Determine the prediction error according to the prediction question text and the question text.
需要强调的是,为进一步保证上述预测问题文本的私密和安全性,上述预测问题文本还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned prediction question text, the above-mentioned prediction question text can also be stored in a node of a blockchain.
具体地,问答信息中的问题文本是预训练语言模型的目标输出。服务器获取预训练语言模型输出的预测问题文本,以及问答信息中的问题文本,根据预设的误差公式计算预测误差。Specifically, the question text in the question and answer information is the target output of the pre-trained language model. The server obtains the prediction question text output by the pre-training language model and the question text in the Q&A information, and calculates the prediction error according to the preset error formula.
进一步的,在一个实施例中,预测误差的计算公式为:Further, in an embodiment, the calculation formula of the prediction error is:
Figure PCTCN2020105777-appb-000001
Figure PCTCN2020105777-appb-000001
其中,y i为问题文本中,第i个字在依据字符转化表转化为向量时的标识符;logits i为预测问题文本中第i个字在字符转化表中的分值;softmaxLoss为预测问题文本与问题文本之间的预测误差。 Among them, y i is the identifier of the i-th character in the question text when it is converted into a vector according to the character conversion table; logits i is the score of the i-th character in the prediction question text in the character conversion table; softmaxLoss is the prediction problem The prediction error between the text and the question text.
具体地,在字符转化表中,每个字可以视作token,每一个token在字符转化表中拥有唯一的标识即标识符token_id。举例说明,当字符转化表大小为20000即字符转化表记载了20000个字与向量之间的转化关系,则token_id的范围是0-19999。预训练语言模型就是为了得到预测问题文本的token_id排列顺序。Specifically, in the character conversion table, each word can be regarded as a token, and each token has a unique identifier in the character conversion table, that is, the identifier token_id. For example, when the size of the character conversion table is 20000, that is, the character conversion table records the conversion relationship between 20000 words and vectors, the range of token_id is 0-19999. The pre-training language model is to get the token_id sequence of the predicted problem text.
假设问题文本包含N(N为正整数)个字。预训练语言模型对答案文本和关键实体进行编码处理,得到N个H,其中H是将要生成的预测问题文本中的字。预训练语言模型计算H在字符转化表中每个字处的得分logits,可以理解,得分logits相当于H与字符表中每个字的相似度,选取具有最高得分的字,作为H所对应的字。Suppose the question text contains N (N is a positive integer) words. The pre-training language model encodes the answer text and key entities to obtain N Hs, where H is the word in the predicted question text to be generated. The pre-training language model calculates the score logits of H at each word in the character conversion table. It can be understood that the score logits is equivalent to the similarity between H and each word in the character table, and the word with the highest score is selected as the corresponding H Character.
预训练语言模型确定预测问题文本中的每个字及其对应的得分logits后,计算预测误差,y i是爬取到的问题文本中第i个字的标识符token_id,logits i是预测问题文本中第i个字的得分,通过交叉熵计算得到预测误差。 After the pre-training language model determines each word in the predicted question text and its corresponding score logits, calculate the prediction error, y i is the identifier token_id of the i-th word in the crawled question text, and logits i is the predicted question text The score of the i-th word in the middle is calculated by cross-entropy to get the prediction error.
本实施例中,通过误差公式可以准确地衡量预测误差,保证了可以根据误差对预训练语言模型进行准确调整。In this embodiment, the prediction error can be accurately measured by the error formula, which ensures that the pre-training language model can be accurately adjusted according to the error.
步骤207,根据预测误差对预训练语言模型进行调整,直至预测误差满足训练停止条件,得到问题生成模型。In step 207, the pre-training language model is adjusted according to the prediction error until the prediction error meets the training stop condition, and the problem generation model is obtained.
其中,训练停止条件是停止模型训练的条件,训练停止条件可以是预测误差小于预定的误差阈值。Wherein, the training stop condition is a condition for stopping model training, and the training stop condition may be that the prediction error is less than a predetermined error threshold.
具体地,终端获取预定的误差阈值,比较预测误差与误差阈值。当预测误差大于等于误差阈值时,终端按照减小预测误差的方向,调整预训练语言模型中的模型参数。终端每对预训练语言模型调整一次参数,重新将关键实体和答案文本进行处理得到预测问题文本,根据预测问题文本和问题文本得到预测误差,比较预测误差与误差阈值,若预测误差仍大于等于误差阈值,再次对模型进行调整,如此循环迭代,直至预测误差小于误差阈值时,停止训练,将停止训练时的预训练语言模型作为问题生成模型。Specifically, the terminal obtains a predetermined error threshold, and compares the prediction error with the error threshold. When the prediction error is greater than or equal to the error threshold, the terminal adjusts the model parameters in the pre-training language model in the direction of reducing the prediction error. Each time the terminal adjusts the parameters of the pre-trained language model, the key entities and answer text are reprocessed to obtain the prediction question text, and the prediction error is obtained according to the prediction question text and the question text, and the prediction error is compared with the error threshold. If the prediction error is still greater than or equal to the error Threshold, adjust the model again, and loop iteratively until the prediction error is less than the error threshold, stop training, and use the pre-trained language model at the time of stopping training as the problem generation model.
调整预训练语言模型中每一层的模型参数时,需要当前层的输出以及反向传播回来的梯度,而每层的输出保存在显存中。当预训练语言模型中的Transformr网络层数较多时,比如当Transformr网络为24层时,需要保存24层的输出,占用了大量的显存资源。为此,可以仅保存部分层的输出,在反向传播需要更新模型参数时,可以通过保存的部分层的输出计算当前层的输出,以此节约显存资源,降低模型训练对硬件设备的要求。When adjusting the model parameters of each layer in the pre-training language model, the output of the current layer and the gradient back propagated back are required, and the output of each layer is stored in the video memory. When there are many Transformr network layers in the pre-training language model, for example, when the Transformr network has 24 layers, the output of the 24 layers needs to be saved, which takes up a lot of video memory resources. For this reason, you can only save the output of a part of the layer. When backpropagation needs to update the model parameters, you can calculate the output of the current layer through the saved output of the part of the layer, so as to save video memory resources and reduce the hardware equipment requirements for model training.
举例说明,Transformr网络为24层,现保存
Figure PCTCN2020105777-appb-000002
层的输出,即保存第1、7、13、19和24层的输出,当进行反向传播时,第2-6层的输出由第1层的输出重新计算得到,第8-12层的输出由第7层的输出重新计算得到,第14-18层的输出由第13层的输出重新计算得到,第20-23层的输出由第19层重新计算得到。
For example, the Transformr network has 24 layers, now save
Figure PCTCN2020105777-appb-000002
The output of the layer, that is, the output of the 1, 7, 13, 19, and 24 layers are saved. When backpropagation is performed, the output of the 2nd-6th layer is recalculated from the output of the 1st layer, and the 8-12th layer The output is recalculated from the output of layer 7, the output of layers 14-18 is recalculated from the output of layer 13, and the output of layers 20-23 is recalculated from the 19th layer.
本实施例中,通过调整掩膜矩阵将初始模型中的网络实现三种语言模型,以对初始模型进行全方位的预训练,得到既能理解自然语言又能生成自然语言的预训练语言模型;通过网络爬虫可以从网络页面中获取到大量问答信息用于模型训练,问答信息包括问题文本和答案文本,并自动从答案文本中提取与问题文本相关的关键实体,无需依赖人工进行大量标注,提高了获取关键实体的效率,从而提高了模型训练的效率;将预训练语言模型中的网络调整为序列到序列模型,使得预训练语言模型面向文本生成式任务且具备良好的文本生成能力;将关键实体和答案文本输入预训练语言模型得到预测问题文本,根据预测问题文本和真实的问题文本间的误差对预训练语言模型进行调整从而得到问题生成模型,问题生成模型是对预训练语言模型依据下游任务进行微调得到,保证了生成问题的质量,从而提高了生成问题的性能。In this embodiment, the network in the initial model is adjusted to realize three language models by adjusting the mask matrix, so as to perform all-round pre-training on the initial model to obtain a pre-trained language model that can understand natural language and generate natural language; Through web crawlers, a large amount of question and answer information can be obtained from web pages for model training. The question and answer information includes question text and answer text, and key entities related to the question text are automatically extracted from the answer text, without relying on manual large-scale labeling, improving The efficiency of obtaining key entities is improved, thereby improving the efficiency of model training; the network in the pre-training language model is adjusted to a sequence-to-sequence model, so that the pre-training language model is oriented to text generative tasks and has good text generation capabilities; The entity and answer text are input into the pre-trained language model to obtain the predicted question text. The pre-trained language model is adjusted according to the error between the predicted question text and the real question text to obtain the question generation model. The question generation model is based on the downstream of the pre-trained language model The task is fine-tuned to ensure the quality of the generated problem, thereby improving the performance of the generated problem.
进一步的,如图3所示,上述步骤201具体包括:Further, as shown in FIG. 3, the above step 201 specifically includes:
步骤2011,获取用于预训练的初始模型以及多组预训练样本集。Step 2011: Obtain an initial model for pre-training and multiple sets of pre-training samples.
其中,预训练样本集可以是用于训练初始模型的数据集。Wherein, the pre-training sample set may be a data set used to train the initial model.
具体地,服务器中预先存储有构建好的初始模型,以及用于对初始模型进行预训练的多组预训练样本集。服务器获取初始模型以及预训练样本集,需要先对初始模型进行预训练,从而得到预训练语言模型。Specifically, the built initial model and multiple sets of pre-training sample sets for pre-training the initial model are pre-stored in the server. The server obtains the initial model and the pre-training sample set, and needs to pre-train the initial model to obtain the pre-trained language model.
步骤2012,随机生成各组预训练样本集所对应的掩膜标识;掩膜标识对应的掩膜矩阵实现单向模型、双向模型和序列到序列模型。In step 2012, the mask identifiers corresponding to each group of pre-training sample sets are randomly generated; the mask matrix corresponding to the mask identifiers realizes a one-way model, a two-way model, and a sequence-to-sequence model.
其中,掩膜标识可以是模型中网络的掩膜矩阵的标识。Wherein, the mask mark may be the mark of the mask matrix of the network in the model.
具体地,构建的初始模型为Transformer网络,Transformer网络可以是12层,也可以是24层。预训练采用3种无监督的语言模型目标:单向LM(包括从左到右和从右到左)、双向LM和seq2seq LM。Specifically, the initial model constructed is a Transformer network, and the Transformer network can be 12 layers or 24 layers. Pre-training uses three unsupervised language model targets: one-way LM (including left-to-right and right-to-left), two-way LM and seq2seq LM.
对于每一组训练样本集,服务器随机生成训练样本集的掩膜标识,掩膜标识与掩膜矩阵相对应,服务器根据掩膜矩阵,将Transformer网络设置为不同的LM;通过随机生成每组训练样本集的掩膜标识,实现对不同LM的均等预训练。For each training sample set, the server randomly generates the mask identification of the training sample set, the mask identification corresponds to the mask matrix, and the server sets the Transformer network to a different LM according to the mask matrix; each group of training is randomly generated The mask identification of the sample set realizes equal pre-training of different LMs.
在一个实施例中,初始模型中模型参数为半精度,随机生成各组预训练样本集所对应的掩膜标识的步骤之前,还包括:将初始模型中layernorm层和embedding层的模型参数设置为单精度。In one embodiment, the model parameters in the initial model are half-precision, and before the step of randomly generating the mask identifiers corresponding to each group of pre-training sample sets, it further includes: setting the model parameters of the layernorm layer and the embedding layer in the initial model to Single precision.
其中,半精度即半精度浮点数(FP16)是一种计算机使用的二进制浮点数数据类型。半精度浮点数使用2字节(16位)存储;而单精度浮点数(FP32)占用4个字节(32位)存储空间。Among them, half-precision or half-precision floating-point number (FP16) is a binary floating-point number data type used by computers. Half-precision floating-point numbers use 2 bytes (16 bits) for storage; and single-precision floating-point numbers (FP32) occupy 4 bytes (32 bits) of storage space.
具体地,模型训练对计算机的硬件设备资源要求较高,且训练时间较长,为了提高训练速度并降低GPU(Graphics Processing Unit,图形处理器)占用,初始模型中的模型参数为半精度;为避免初始模型不收敛,将初始模型中embedding层的模型参数设置为单精度,为避免训练过程中求均值、方差等操作精度不足带来较大损失,将初始模型中layernorm层的模型参数设置为单精度。Specifically, model training requires higher hardware equipment resources of the computer, and the training time is longer. In order to increase the training speed and reduce the GPU (Graphics Processing Unit, graphics processing unit) occupancy, the model parameters in the initial model are half-precision; To avoid non-convergence of the initial model, set the model parameters of the embedding layer in the initial model to single precision. In order to avoid large losses caused by insufficient operation precision such as average and variance during the training process, set the model parameters of the layernorm layer in the initial model to Single precision.
本实施例中,将初始模型中的模型参数设置为半精度,而layernorm层和embedding 层的模型参数设置为单精度,提高了模型训练的速度和准确性。In this embodiment, the model parameters in the initial model are set to half precision, and the model parameters of the layernorm layer and the embedding layer are set to single precision, which improves the speed and accuracy of model training.
步骤2013,将各组预训练样本集分别输入初始模型,并根据预训练样本集所对应的掩膜标识调整初始模型中网络的掩膜矩阵。In step 2013, each group of pre-training sample sets are respectively input to the initial model, and the mask matrix of the network in the initial model is adjusted according to the mask identifier corresponding to the pre-training sample set.
具体地,服务器将预训练样本集依次输入初始模型。当输入一组预训练样本集后,服务器根据预训练样本集对应的掩膜标识调整初始模型中Transformer网络的掩膜矩阵,从而将Transformer网络设置为单向LM、双向LM或seq2seq LM。Specifically, the server sequentially inputs the pre-training sample set into the initial model. After inputting a set of pre-training sample sets, the server adjusts the mask matrix of the Transformer network in the initial model according to the mask identifier corresponding to the pre-training sample set, thereby setting the Transformer network to one-way LM, two-way LM or seq2seq LM.
步骤2014,根据输入的预训练样本集对掩膜矩阵调整后的初始模型依次进行预训练,得到预训练语言模型。In step 2014, the initial model adjusted by the mask matrix is sequentially pre-trained according to the input pre-training sample set to obtain a pre-training language model.
具体地,服务器对掩膜矩阵进行调整后,根据预训练样本集对初始矩阵进行预训练;当根据一组预训练样本集训练完成后,再输入下一组预训练样本集,调整掩膜矩阵,进行下一轮预训练。当全部预训练样本集完成训练后,服务器得到预训练语言模型。Specifically, after the server adjusts the mask matrix, it pre-trains the initial matrix according to the pre-training sample set; when the training is completed according to a set of pre-training sample sets, input the next set of pre-training sample sets to adjust the mask matrix , Proceed to the next round of pre-training. After all the pre-training sample sets are trained, the server obtains the pre-trained language model.
在预训练过程中,Transformer网络在单向LM(包括从左到右和从右到左)、双向LM和seq2seq LM中随机切换,Transformer网络中的各层在多轮预训练中共享模型参数。In the pre-training process, the Transformer network randomly switches between one-way LM (including left-to-right and right-to-left), two-way LM, and seq2seq LM. Each layer in the Transformer network shares model parameters in multiple rounds of pre-training.
本实施例中,随机生成预训练样本集的掩膜标识,在根据预训练样本集预训练初始模型时,根据掩膜标识调整初始模型中的掩膜矩阵,使得初始模型平均地完成3种语言模型的预训练目标,保证了预训练的科学性。In this embodiment, the mask mark of the pre-training sample set is randomly generated. When the initial model is pre-trained according to the pre-training sample set, the mask matrix in the initial model is adjusted according to the mask mark, so that the initial model can complete 3 languages on average The pre-training goal of the model ensures the scientificity of pre-training.
进一步的,如图4所示,上述步骤203可以包括:Further, as shown in FIG. 4, the above step 203 may include:
步骤2031,从问答信息内的问题文本和答案文本中,分别提取文本实体。Step 2031: Extract text entities from the question text and the answer text in the question and answer information respectively.
其中,文本实体可以是问题文本和答案文本中的实体。Among them, the text entity can be an entity in the question text and the answer text.
具体地,服务器可以对问题文本和答案文本进行分词,得到多个实体。服务器可以借助pkuseg进行分词,将问题文本和答案文本以词为单位进行分割。pkuseg是由北京大学发布的一款开源的中文分词工具包,具有很高的分词准确率。Specifically, the server may segment the question text and the answer text to obtain multiple entities. The server can use pkuseg for word segmentation to segment the question text and the answer text in word units. pkuseg is an open source Chinese word segmentation toolkit released by Peking University, which has a high accuracy rate of word segmentation.
分词后,从实体中去除停用词,停用词即stop words,是一些无明显意义、可以删除的词,例如“在”、“的”和“是”。接着提取词性为动词和名词的实体作为文本实体。After word segmentation, stop words are removed from the entity. Stop words are stop words, which are words that have no obvious meaning and can be deleted, such as "在", "的" and "是". Then, entities whose parts of speech are verbs and nouns are extracted as text entities.
步骤2032,计算答案文本中的各文本实体与问题文本中的各文本实体的相似度。Step 2032: Calculate the similarity between each text entity in the answer text and each text entity in the question text.
具体地,答案文本中的文本实体组成第一数据集,问题文本中的文本实体组成第二数据集,服务器计算第一数据集中每个实体与第二数据集中每个实体间的相似度。服务器可以通过精确匹配和模糊匹配来计算相似度,能够精确匹配的文本实体间的相似度为100%。在进行模糊匹配时,服务器可以将文本实体转化为向量,计算向量之间的余弦相似度;或者计算文本实体之间的文本编辑距离(又称Levenshtein距离,是将一个字符串转化成另一个字符串所需的最少操作次数,操作包括插入、删除、替换),文本编辑距离越短,相似度越高。Specifically, the text entities in the answer text form the first data set, and the text entities in the question text form the second data set, and the server calculates the similarity between each entity in the first data set and each entity in the second data set. The server can calculate the similarity through exact matching and fuzzy matching, and the similarity between text entities that can be accurately matched is 100%. When performing fuzzy matching, the server can convert text entities into vectors and calculate the cosine similarity between vectors; or calculate the text edit distance between text entities (also known as Levenshtein distance, which converts a string into another character) The minimum number of operations required for string, operations include insert, delete, and replace). The shorter the text editing distance, the higher the similarity.
步骤2033,从答案文本的各文本实体中,提取相似度符合预设相似度阈值的文本实体作为关键实体。Step 2033: Extract text entities whose similarity meets a preset similarity threshold from each text entity of the answer text as key entities.
具体地,假设第一数据集中有M(M为正整数)个文本实体,第二数据集中有N(N为正整数)个文本实体,则计算得到M*N组相似度。服务器获取预设的相似度阈值,从M*N组相似度中,选取相似度数值大于相似度阈值的相似度,每个选取到的相似度所对应的两个文本实体中,将来自第一数据集的文本实体作为关键实体。服务器还可以将M*N组相似度按照从大到小的顺序进行排列,按照排列顺序选取预设数量的相似度,将选取到的相似度对应的第一数据集文本实体作为关键实体。Specifically, assuming that there are M (M is a positive integer) text entities in the first data set, and there are N (N is a positive integer) text entities in the second data set, then M*N groups of similarities are calculated. The server obtains the preset similarity threshold, and selects the similarity whose similarity value is greater than the similarity threshold from the M*N group of similarities. The two text entities corresponding to each selected similarity will be from the first The text entity of the data set is used as the key entity. The server may also arrange the M*N groups of similarities in descending order, select a preset number of similarities according to the arrangement order, and use the first data set text entity corresponding to the selected similarity as the key entity.
举例说明,问题文本为“复旦大学在国内的排名大概是多少?”,通过pkuseg切分为{“复旦大学”,“在”,“国内”,“的”,“排名”,“大概”,“是”,“多少”,“?”}。分词后去除停用词{“在”、“的”,“是”},再提取动词和名词{“复旦大学”,“排名”},并对答案文本进行同样的处理。假设答案文本中提取到实体“复旦”,计算“复旦大学”和“复旦”间的相似度满足相似度阈值,将“复旦”作为关键实体。For example, the question text is "What is the ranking of Fudan University in China?", divided into {"Fudan University", "in", "domestic", "of", "ranking", "approximately" through pkuseg, "how many","?"}. After word segmentation, the stop words {"在", "的", "YES"} are removed, and the verbs and nouns {"Fudan University", "ranking"} are extracted, and the answer text is processed in the same way. Assuming that the entity "Fudan" is extracted from the answer text, the similarity between "Fudan University" and "Fudan" is calculated to meet the similarity threshold, and "Fudan" is taken as the key entity.
本实施例中,提取到的关键实体与问题文本和答案文本均高度关联,可以辅助预训练 语言模型输出问题文本。In this embodiment, the extracted key entities are highly related to the question text and the answer text, which can assist the pre-training language model to output the question text.
进一步的,答案文本包括至少一个子答案文本,如图5所示,上述步骤205可以包括:Further, the answer text includes at least one sub-answer text. As shown in FIG. 5, the above step 205 may include:
步骤2051,将至少一个子答案文本以及与子答案文本对应的关键实体输入预训练语言模型,得到至少一个三维字向量矩阵。Step 2051: Input at least one sub-answer text and key entities corresponding to the sub-answer text into the pre-training language model to obtain at least one three-dimensional word vector matrix.
具体地,一个问题文本对应的答案文本可以由至少一个子答案文本组成,每个子答案文本均提取得到关键实体。Specifically, the answer text corresponding to one question text may be composed of at least one sub-answer text, and each sub-answer text is extracted to obtain a key entity.
服务器进行批处理(batch),一个问题文本对应的至少一个子答案文本以及与子答案文本对应的关键实体作为一个batch进行处理。The server performs batch processing, and at least one sub-answer text corresponding to a question text and key entities corresponding to the sub-answer text are processed as a batch.
服务器通过补零的方式将子答案文本的文本长度(即子答案文本中字符的个数)补齐,再依据字符转化表转化为one-hot向量(又称“独热编码”),得到one-hot矩阵。假设子答案文本数量为batch,补齐后文本长度为length,字符转化表中字符个数为M,则one-hot矩阵的三个维度依次为batch、length和M,其中batch表示one-hot矩阵来自哪一个子答案文本,length为one-hot矩阵的行数,M为one-hot矩阵的列数。The server fills in the text length of the sub-answer text (that is, the number of characters in the sub-answer text) by adding zeros, and then converts it into a one-hot vector (also known as "one-hot encoding") according to the character conversion table to obtain one -hot matrix. Assuming that the number of sub-answer texts is batch, the length of the text after completion is length, and the number of characters in the character conversion table is M, then the three dimensions of the one-hot matrix are batch, length, and M in turn, where batch represents one-hot matrix Which sub-answer text comes from, length is the number of rows in the one-hot matrix, and M is the number of columns in the one-hot matrix.
服务器需要将one-hot向量转换为字向量,将三维的one-hot矩阵输入预训练语言模型的embedding层,将M维度替换为dim维度,得到三维字向量矩阵;dim为特征维度,在一个模型中dim是统一的常量,例如dim可以取512、768或者1024。The server needs to convert the one-hot vector into a word vector, input the three-dimensional one-hot matrix into the embedding layer of the pre-trained language model, and replace the M dimension with the dim dimension to obtain a three-dimensional word vector matrix; dim is the feature dimension, in a model The dim is a uniform constant, for example, dim can be 512, 768, or 1024.
步骤2052,将转化得到的三维字向量矩阵合并为二维字向量矩阵。Step 2052: Combine the converted three-dimensional word vector matrix into a two-dimensional word vector matrix.
具体地,为了提高计算效率,将各三维字向量矩阵进行合并,得到一个更大的矩阵即二维字向量矩阵,矩阵合并取消了batch维度,使得预训练语言模型中对矩阵的计算变为对二维矩阵的运算,提高了计算速度,减少了训练时间。Specifically, in order to improve the calculation efficiency, the three-dimensional word vector matrices are combined to obtain a larger matrix, that is, the two-dimensional word vector matrix. The matrix merging cancels the batch dimension, so that the calculation of the matrix in the pre-training language model becomes correct. The operation of the two-dimensional matrix improves the calculation speed and reduces the training time.
步骤2053,通过预训练语言模型对二维字向量矩阵进行处理,得到预训练语言模型输出的预测问题文本,其中,预测问题文本存储在区块链中。In step 2053, the two-dimensional word vector matrix is processed through the pre-training language model to obtain the prediction question text output by the pre-training language model, where the prediction question text is stored in the blockchain.
具体地,服务器通过二维字向量矩阵对预训练语言模型进行处理,得到预测问题文本中每个字处的得分logits,在每一个字处,选取具有最高得分的字作为该处的字,从而输出预测问题文本。服务器还可以将预测问题文本上传至区块链中进行存储,以记录预训练语言模型的训练过程,同时保证预测问题文本的私密性和安全性。Specifically, the server processes the pre-trained language model through a two-dimensional word vector matrix to obtain the score logits of each word in the predicted question text. At each word, the word with the highest score is selected as the word in that place, thereby Output prediction question text. The server can also upload the predicted question text to the blockchain for storage to record the training process of the pre-trained language model, while ensuring the privacy and security of the predicted question text.
本实施例中,将各子答案文本以及对应的关键实体转换为多个三维字向量矩阵,再将三维字向量矩阵合并为二维字向量矩阵,使得预训练语言模型对二维字向量矩阵进行处理,提高了输出预测问题文本的效率。In this embodiment, each sub-answer text and corresponding key entities are converted into multiple three-dimensional word vector matrices, and then the three-dimensional word vector matrices are merged into a two-dimensional word vector matrix, so that the pre-training language model performs the two-dimensional word vector matrix Processing improves the efficiency of outputting prediction problem text.
在一个实施例中,如图6所示,提供了一种问题生成方法,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:In an embodiment, as shown in FIG. 6, a method for generating a question is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:
步骤301,获取用于问题生成的源文本。Step 301: Obtain the source text for question generation.
具体地,问题生成模型依据输入的文本生成问题文本。用户通过终端向服务器发送源文本,问题生成模型依据源文本生成问题文本。Specifically, the question generation model generates question text based on the input text. The user sends the source text to the server through the terminal, and the question generation model generates the question text based on the source text.
在一个实施例中,终端还可以向服务器发送语音数据,服务器通过语音识别将语音数据转化为文本数据,得到源文本。In an embodiment, the terminal may also send voice data to the server, and the server converts the voice data into text data through voice recognition to obtain the source text.
步骤302,从源文本中筛选若干组源实体。Step 302: Filter several groups of source entities from the source text.
具体地,服务器将源文本进行分词得到多个实体。服务器可以随机筛选多个实体得到一组源实体,可以筛选若干组源实体。服务器还可以根据终端发送的指示信息筛选若干组源实体。Specifically, the server performs word segmentation on the source text to obtain multiple entities. The server can randomly screen multiple entities to obtain a group of source entities, and can screen several groups of source entities. The server can also filter several groups of source entities according to the instruction information sent by the terminal.
步骤303,分别将若干组源实体输入问题生成模型;其中,问题生成模型是采用上述问题生成模型的训练方法获取的模型。Step 303: Input several groups of source entities into the question generation model respectively; wherein, the question generation model is a model obtained by using the training method of the above question generation model.
具体地,服务器将筛选到的若干组源实体输入至问题生成模型,问题生成模型将源实体以字符为单位转化为向量,进行问题生成的处理。问题生成模型是采用上述问题生成模型的训练方法获取的模型。Specifically, the server inputs the selected groups of source entities into the question generation model, and the question generation model converts the source entities into vectors in units of characters to perform question generation processing. The question generation model is a model obtained using the training method of the above question generation model.
服务器在生成问题文本时,可以根据整段源文本生成问题文本,也可以根据从源文本 中提取到的若干组源实体生成问题文本。When the server generates the question text, it can generate the question text based on the entire source text, or it can generate the question text based on several groups of source entities extracted from the source text.
步骤304,获取问题生成模型基于若干组源实体生成的若干问题文本。Step 304: Obtain several question texts generated by the question generation model based on several groups of source entities.
具体地,问题生成模型基于一组源实体进行处理,生成一组问题文本。当存在若干组源实体时,服务器生成与若干组源实体分别对应的问题文本。Specifically, the question generation model is based on a set of source entities to process and generate a set of question texts. When there are several groups of source entities, the server generates question texts corresponding to the several groups of source entities.
在一个实施例中,服务器将生成的若干问题文本发送至终端,由用户通过终端选取问题文本进行后续使用。In one embodiment, the server sends several generated question texts to the terminal, and the user selects the question texts through the terminal for subsequent use.
本实施例中,从用于问题文本生成的源文本中筛选若干组源实体,可以通过问题生成模型,依据不同的源实体生成不同的问题文本,提高了生成问题文本的灵活性。In this embodiment, several groups of source entities are filtered from the source text used for question text generation, and different question texts can be generated according to different source entities through the question generation model, which improves the flexibility of generating question text.
进一步的,如图7所示,步骤302可以包括:Further, as shown in FIG. 7, step 302 may include:
步骤3021,识别源文本中的文本实体。Step 3021: Identify text entities in the source text.
具体地,服务器接收到源文本后,对源文本进行分词得到多个实体,识别各个实体的词性,将符合预设词性的实体作为文本实体。其中,文本实体的词性可以包括名词、动词、形容词等。Specifically, after receiving the source text, the server performs word segmentation on the source text to obtain multiple entities, recognizes the part of speech of each entity, and uses the entity that meets the preset part of speech as the text entity. Among them, the part of speech of the text entity can include nouns, verbs, adjectives, etc.
步骤3022,从识别到的文本实体中随机抽取若干组文本实体,得到若干组源实体。Step 3022: Randomly extract several groups of text entities from the recognized text entities to obtain several groups of source entities.
具体地,服务器识别到文本实体后,随机抽取若干组文本实体,将每一组文本实体作为一组源实体,得到多组源实体。Specifically, after the server recognizes the text entities, it randomly selects several groups of text entities, and uses each group of text entities as a group of source entities to obtain multiple groups of source entities.
步骤3023,根据预设的语义知识库对源文本中的文本实体进行语义标注,得到语义标注结果。Step 3023: Perform semantic annotation on the text entities in the source text according to a preset semantic knowledge base to obtain a semantic annotation result.
具体地,服务器中预设有语义知识库。服务器根据语义知识库,识别各文本实体的语义,并对各文本实体进行语义标注,得到语义标注结果。Specifically, a semantic knowledge base is preset in the server. The server recognizes the semantics of each text entity according to the semantic knowledge base, and performs semantic annotation on each text entity to obtain the semantic annotation result.
步骤3024,根据语义标注结果,筛选符合预设语义范围的若干文本实体,得到若干组源实体。Step 3024: According to the semantic annotation result, filter several text entities that meet the preset semantic range to obtain several groups of source entities.
具体地,根据语义标注结果可以确定文本实体所表达的语义信息。服务器获取预设语义范围,筛选语义信息符合预设语义范围的若干文本实体,得到若干组源实体。预设语义范围可以来自终端发送的指示信息。Specifically, the semantic information expressed by the text entity can be determined according to the semantic annotation result. The server obtains the preset semantic range, filters several text entities whose semantic information meets the preset semantic range, and obtains several sets of source entities. The preset semantic range may come from the instruction information sent by the terminal.
举例说明,当用户想得到金融领域的问题文本时,将指示信息中的预设语义范围设置为金融领域,则服务器筛选属于金融领域的文本实体,得到源实体。For example, when the user wants to obtain the question text in the financial field, the preset semantic range in the instruction information is set to the financial field, and the server filters the text entities belonging to the financial field to obtain the source entity.
本实施例中,识别源文本中的文本实体,通过随机抽取或者根据语义抽取文本实体,保证了文本实体抽取的灵活性,从而保证了生成问题文本的灵活性。In this embodiment, the text entities in the source text are recognized, and the text entities are extracted randomly or semantically, so as to ensure the flexibility of text entity extraction, thereby ensuring the flexibility of generating the question text.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。Those of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a computer-readable storage medium. When the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments.
虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。Although the steps in the flowchart of the drawings are shown in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated in this article, the execution of these steps is not strictly limited in order, and they can be executed in other orders.
进一步参考图8,作为对上述图2所示方法的实现,本申请提供了一种问题生成模型的训练装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。With further reference to FIG. 8, as an implementation of the method shown in FIG. 2, this application provides an embodiment of a training device for a question generation model. The device embodiment corresponds to the method embodiment shown in FIG. The device can be applied to various electronic devices.
如图8所示,本实施例所述的问题生成模型的训练装置400包括:模型训练模块401、信息获取模块402、实体提取模块403、模型设置模块404、文本输入模块405、误差确定模块406以及模型调整模块407。其中:As shown in FIG. 8, the training device 400 for the problem generation model described in this embodiment includes: a model training module 401, an information acquisition module 402, an entity extraction module 403, a model setting module 404, a text input module 405, and an error determination module 406 And the model adjustment module 407. in:
模型训练模块401,用于对初始模型进行预训练得到预训练语言模型,并在预训练中通过调整掩膜矩阵将初始模型中的网络实现单向模型、双向模型和序列到序列模型。The model training module 401 is used to pre-train the initial model to obtain the pre-trained language model, and adjust the mask matrix in the pre-training to realize the one-way model, the two-way model and the sequence-to-sequence model of the network in the initial model.
信息获取模块402,用于通过网络爬虫从网络页面中获取问答信息,所述问答信息包括问题文本和答案文本。The information obtaining module 402 is configured to obtain question and answer information from a web page through a web crawler, and the question and answer information includes question text and answer text.
实体提取模块403,用于从所述答案文本中,提取与所述问题文本相关的关键实体。The entity extraction module 403 is used to extract key entities related to the question text from the answer text.
模型设置模块404,用于将预训练语言模型中的网络设置为序列到序列模型,以得到用于中文文本生成的预训练语言模型。The model setting module 404 is used to set the network in the pre-training language model to a sequence-to-sequence model to obtain a pre-training language model for Chinese text generation.
文本输入模块405,用于将关键实体和答案文本输入预训练语言模型,得到预训练语言模型输出的预测问题文本。The text input module 405 is used to input key entities and answer text into the pre-training language model to obtain the predicted question text output by the pre-training language model.
误差确定模块406,用于根据预测问题文本和问题文本,确定预测误差。The error determination module 406 is configured to determine the prediction error according to the prediction question text and the question text.
模型调整模块407,用于根据预测误差对预训练语言模型进行调整,直至预测误差满足训练停止条件,得到问题生成模型。The model adjustment module 407 is configured to adjust the pre-training language model according to the prediction error until the prediction error meets the training stop condition, and the problem generation model is obtained.
本实施例中,将预训练语言模型中的网络调整为序列到序列模型,使得预训练语言模型面向文本生成式任务且具备良好的文本生成能力,再对预训练语言模型进行微调得到问题生成模型,保证了生成问题的质量。In this embodiment, the network in the pre-training language model is adjusted to a sequence-to-sequence model, so that the pre-training language model is oriented to text generative tasks and has good text generation capabilities, and then the pre-training language model is fine-tuned to obtain a problem generation model , To ensure the quality of the generated problems.
在本实施例的一些可选的实现方式中,上述模型训练模块401包括:获取子模块、标识生成子模块、输入子模块和预训练子模块,其中:In some optional implementations of this embodiment, the above-mentioned model training module 401 includes: an acquisition sub-module, an identity generation sub-module, an input sub-module, and a pre-training sub-module, wherein:
获取子模块,用于获取用于预训练的初始模型以及多组预训练样本集;The acquisition sub-module is used to acquire the initial model used for pre-training and multiple sets of pre-training samples;
标识生成子模块,用于随机生成各组预训练样本集所对应的掩膜标识;掩膜标识对应的掩膜矩阵实现单向模型、双向模型和序列到序列模型;The identification generation sub-module is used to randomly generate the mask identification corresponding to each group of pre-training sample sets; the mask matrix corresponding to the mask identification realizes one-way model, two-way model and sequence-to-sequence model;
输入子模块,用于将各组预训练样本集分别输入初始模型,并根据预训练样本集所对应的掩膜标识调整初始模型中网络的掩膜矩阵;The input sub-module is used to input each group of pre-training sample sets into the initial model, and adjust the mask matrix of the network in the initial model according to the mask identifier corresponding to the pre-training sample set;
预训练子模块,用于根据输入的预训练样本集对掩膜矩阵调整后的初始模型依次进行预训练,得到预训练语言模型。The pre-training sub-module is used to sequentially pre-train the initial model adjusted by the mask matrix according to the input pre-training sample set to obtain the pre-training language model.
在本实施例的一些可选的实现方式中,初始模型中模型参数为半精度,上述模型训练模块401还包括参数设置子模块,参数设置子模块用于将初始模型中layernorm层和embedding层的模型参数设置为单精度。In some optional implementations of this embodiment, the model parameters in the initial model are half-precision, and the above model training module 401 also includes a parameter setting submodule. The parameter setting submodule is used to combine the layernorm and embedding layers in the initial model. The model parameters are set to single precision.
在本实施例的一些可选的实现方式中,上述实体提取模块403进一步用于:从问答信息内的问题文本和答案文本中,分别提取文本实体;计算答案文本中的各文本实体与问题文本中的各文本实体的相似度;从答案文本的各文本实体中,提取相似度符合预设相似度阈值的文本实体作为关键实体。In some optional implementations of this embodiment, the entity extraction module 403 is further configured to: extract text entities from the question text and answer text in the question and answer information, respectively; calculate each text entity and question text in the answer text The similarity of each text entity in the answer text; from each text entity of the answer text, extract the text entity whose similarity meets the preset similarity threshold as the key entity.
在本实施例的一些可选的实现方式中,答案文本包括至少一个子答案文本,上述文本输入模块405进一步用于:将至少一个子答案文本以及与子答案文本对应的关键实体输入预训练语言模型,得到至少一个三维字向量矩阵;将转化得到的三维字向量矩阵合并为二维字向量矩阵;通过预训练语言模型对二维字向量矩阵进行处理,得到预训练语言模型输出的预测问题文本,其中,预测问题文本存储在区块链中。In some optional implementations of this embodiment, the answer text includes at least one sub-answer text, and the text input module 405 is further configured to: input at least one sub-answer text and key entities corresponding to the sub-answer text into the pre-training language Model to obtain at least one three-dimensional word vector matrix; merge the converted three-dimensional word vector matrix into a two-dimensional word vector matrix; process the two-dimensional word vector matrix through the pre-training language model to obtain the prediction problem text output by the pre-training language model , Where the prediction question text is stored in the blockchain.
在一个实施例中,提供了一种问题生成装置,包括:源文本获取模块、源实体抽取模块、源实体输入模块和问题生成模块,其中:In one embodiment, a question generation device is provided, including: a source text acquisition module, a source entity extraction module, a source entity input module, and a question generation module, wherein:
源文本获取模块,用于获取用于问题生成的源文本。The source text obtaining module is used to obtain the source text used for question generation.
源实体抽取模块,用于从源文本中筛选若干组源实体。The source entity extraction module is used to filter several groups of source entities from the source text.
源实体输入模块,用于分别将若干组源实体输入问题生成模型;其中,问题生成模型是采用上述问题生成模型的训练方法获取的模型。The source entity input module is used to input several groups of source entities into the question generation model; wherein, the question generation model is a model obtained by using the training method of the above question generation model.
问题生成模块,用于获取问题生成模型基于若干组源实体生成的若干问题文本。The question generation module is used to obtain several question texts generated by the question generation model based on several groups of source entities.
本实施例中,从用于问题文本生成的源文本中筛选若干组源实体,可以通过问题生成模型,依据不同的源实体生成不同的问题文本,提高了生成问题文本的灵活性。In this embodiment, several groups of source entities are filtered from the source text used for question text generation, and different question texts can be generated according to different source entities through the question generation model, which improves the flexibility of generating question text.
在本实施例的一些可选的实现方式中,上述源实体抽取模块进一步用于:识别源文本中的文本实体;从识别到的文本实体中随机抽取若干组文本实体,得到若干组源实体;或者,根据预设的语义知识库对源文本中的文本实体进行语义标注,得到语义标注结果;根据语义标注结果,筛选符合预设语义范围的若干文本实体,得到若干组源实体。In some optional implementations of this embodiment, the aforementioned source entity extraction module is further used to: identify text entities in the source text; randomly extract several groups of text entities from the recognized text entities to obtain several groups of source entities; Or, perform semantic annotation on the text entities in the source text according to a preset semantic knowledge base to obtain a semantic annotation result; according to the semantic annotation result, filter several text entities that meet the preset semantic range to obtain several groups of source entities.
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图9,图9为本实施例计算机设备基本结构框图。In order to solve the above technical problems, the embodiments of the present application also provide computer equipment. Please refer to FIG. 9 for details. FIG. 9 is a block diagram of the basic structure of the computer device in this embodiment.
所述计算机设备5包括通过系统总线相互通信连接存储器51、处理器52、网络接口53。需要指出的是,图中仅示出了具有组件51-53的计算机设备5,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备。The computer device 5 includes a memory 51, a processor 52, and a network interface 53 that communicate with each other through a system bus. It should be pointed out that the figure only shows the computer device 5 with the components 51-53, and it is not required to implement all the shown components, and more or fewer components may be implemented instead. The computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
所述存储器51至少包括一种类型的计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器51可以是所述计算机设备5的内部存储单元,例如该计算机设备5的硬盘或内存。在另一些实施例中,所述存储器51也可以是所述计算机设备5的外部存储设备,例如该计算机设备5上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器51还可以既包括所述计算机设备5的内部存储单元也包括其外部存储设备。本实施例中,所述存储器51通常用于存储安装于所述计算机设备5的操作系统和各类应用软件,例如问题生成模型的训练方法、或问题生成方法的计算机可读指令等。此外,所述存储器51还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 51 includes at least one type of computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium includes flash memory, hard disk, and multimedia card. , Card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), Programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 51 may be an internal storage unit of the computer device 5, such as a hard disk or a memory of the computer device 5. In other embodiments, the memory 51 may also be an external storage device of the computer device 5, for example, a plug-in hard disk equipped on the computer device 5, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, Flash Card, etc. Of course, the memory 51 may also include both the internal storage unit of the computer device 5 and its external storage device. In this embodiment, the memory 51 is generally used to store an operating system and various application software installed in the computer device 5, such as a training method of a question generation model, or computer readable instructions of a question generation method, and the like. In addition, the memory 51 can also be used to temporarily store various types of data that have been output or will be output.
所述处理器52在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器52通常用于控制所述计算机设备5的总体操作。本实施例中,所述处理器52用于运行所述存储器51中存储的计算机可读指令或者处理数据,例如运行问题生成模型的训练方法、或问题生成方法的计算机可读指令。The processor 52 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. The processor 52 is generally used to control the overall operation of the computer device 5. In this embodiment, the processor 52 is configured to run computer-readable instructions or processed data stored in the memory 51, for example, run a training method of a problem generation model or a computer-readable instruction of a problem generation method.
所述网络接口53可包括无线网络接口或有线网络接口,该网络接口53通常用于在所述计算机设备5与其他电子设备之间建立通信连接。The network interface 53 may include a wireless network interface or a wired network interface, and the network interface 53 is generally used to establish a communication connection between the computer device 5 and other electronic devices.
本实施例中提供的计算机设备可以执行上述问题生成模型的训练方法的步骤。此处问题生成模型的训练方法的步骤可以是上述各个实施例的问题生成模型的训练方法中的步骤。The computer device provided in this embodiment can execute the steps of the training method of the problem generation model described above. Here, the steps of the training method of the question generation model may be the steps in the training method of the question generation model of each of the foregoing embodiments.
本实施例中,将预训练语言模型中的网络调整为序列到序列模型,使得预训练语言模型面向文本生成式任务且具备良好的文本生成能力,再对预训练语言模型进行微调得到问题生成模型,保证了生成问题的质量。本实施例中提供的计算机设备可以执行上述问题生成方法的步骤。此处问题生成方法的步骤可以是上述各个实施例的问题生成方法中的步骤。In this embodiment, the network in the pre-training language model is adjusted to a sequence-to-sequence model, so that the pre-training language model is oriented to text generative tasks and has good text generation capabilities, and then the pre-training language model is fine-tuned to obtain a problem generation model , To ensure the quality of the generated problems. The computer device provided in this embodiment can execute the steps of the above-mentioned problem generation method. Here, the steps of the question generation method may be the steps in the question generation method of each of the foregoing embodiments.
本实施例中,从用于问题文本生成的源文本中筛选若干组源实体,可以通过问题生成模型,依据不同的源实体生成不同的问题文本,提高了生成问题文本的灵活性。In this embodiment, several groups of source entities are filtered from the source text used for question text generation, and different question texts can be generated according to different source entities through the question generation model, which improves the flexibility of generating question text.
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有训练问题生成模型的计算机可读指令,所述训练问题生成模型的计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上述的问题生成模型的训练方法的步骤。This application also provides another implementation manner, that is, a computer-readable storage medium storing computer-readable instructions for training a question generation model, and the computer-readable training question generation model The instructions may be executed by at least one processor, so that the at least one processor executes the steps of the training method of the problem generation model as described above.
本实施例中,将预训练语言模型中的网络调整为序列到序列模型,使得预训练语言模型面向文本生成式任务且具备良好的文本生成能力,再对预训练语言模型进行微调得到问题生成模型,保证了生成问题的质量。本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有用于问题生成的计算机可读指令,所述用于问题生成的计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上述的问题生成方法的步骤。In this embodiment, the network in the pre-training language model is adjusted to a sequence-to-sequence model, so that the pre-training language model is oriented to text generative tasks and has good text generation capabilities, and then the pre-training language model is fine-tuned to obtain a problem generation model , To ensure the quality of the generated problems. This application also provides another implementation manner, that is, a computer-readable storage medium storing computer-readable instructions for question generation, and the computer-readable instructions for question generation It may be executed by at least one processor, so that the at least one processor executes the steps of the problem generation method as described above.
本实施例中,从用于问题文本生成的源文本中筛选若干组源实体,可以通过问题生成模型,依据不同的源实体生成不同的问题文本,提高了生成问题文本的灵活性。In this embodiment, several groups of source entities are filtered from the source text used for question text generation, and different question texts can be generated according to different source entities through the question generation model, which improves the flexibility of generating question text.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可 借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The drawings show preferred embodiments of the present application, but do not limit the patent scope of the present application. This application can be implemented in many different forms. All equivalent structures made by using the contents of the description and drawings of this application, directly or indirectly used in other related technical fields, are similarly within the scope of patent protection of this application.

Claims (20)

  1. 一种问题生成模型的训练方法,包括下述步骤:A training method of a problem generation model includes the following steps:
    对初始模型进行预训练得到预训练语言模型,并在预训练中通过调整掩膜矩阵将所述初始模型中的网络实现单向模型、双向模型和序列到序列模型;Pre-training the initial model to obtain a pre-training language model, and in the pre-training, by adjusting the mask matrix, the network in the initial model is realized as a one-way model, a two-way model, and a sequence-to-sequence model;
    通过网络爬虫从网络页面中获取问答信息,所述问答信息包括问题文本和答案文本;Obtaining question and answer information from a web page through a web crawler, where the question and answer information includes question text and answer text;
    从所述答案文本中,提取与所述问题文本相关的关键实体;Extract key entities related to the question text from the answer text;
    将所述预训练语言模型中的网络设置为序列到序列模型,以得到用于中文文本生成的预训练语言模型;Setting the network in the pre-training language model to a sequence-to-sequence model to obtain a pre-training language model for Chinese text generation;
    将所述关键实体和所述答案文本输入所述预训练语言模型,得到所述预训练语言模型输出的预测问题文本;Inputting the key entity and the answer text into the pre-training language model to obtain the predicted question text output by the pre-training language model;
    根据所述预测问题文本和所述问题文本,确定预测误差;Determine the prediction error according to the prediction question text and the question text;
    根据所述预测误差对所述预训练语言模型进行调整,直至所述预测误差满足训练停止条件,得到问题生成模型。The pre-training language model is adjusted according to the prediction error until the prediction error satisfies the training stop condition, and a problem generation model is obtained.
  2. 根据权利要求1所述的问题生成模型的训练方法,其中,所述对初始模型进行预训练得到预训练语言模型,并在预训练中通过调整掩膜矩阵将所述初始模型中的网络实现单向模型、双向模型和序列到序列模型的步骤具体包括:The method for training a problem generation model according to claim 1, wherein the pre-training of the initial model is performed to obtain the pre-training language model, and the network in the initial model is realized by adjusting the mask matrix in the pre-training. The steps to model, bidirectional model and sequence to sequence model specifically include:
    获取用于预训练的初始模型以及多组预训练样本集;Obtain an initial model for pre-training and multiple sets of pre-training samples;
    随机生成各组预训练样本集所对应的掩膜标识;所述掩膜标识对应的掩膜矩阵实现单向模型、双向模型和序列到序列模型;Randomly generating the mask identifiers corresponding to each group of pre-training sample sets; the mask matrix corresponding to the mask identifiers realizes a one-way model, a two-way model, and a sequence-to-sequence model;
    将所述各组预训练样本集分别输入所述初始模型,并根据预训练样本集所对应的掩膜标识调整所述初始模型中网络的掩膜矩阵;Input each of the pre-training sample sets into the initial model, and adjust the mask matrix of the network in the initial model according to the mask identifier corresponding to the pre-training sample set;
    根据输入的预训练样本集对掩膜矩阵调整后的初始模型依次进行预训练,得到预训练语言模型。The initial model adjusted by the mask matrix is sequentially pre-trained according to the input pre-training sample set to obtain the pre-training language model.
  3. 根据权利要求2所述的问题生成模型的训练方法,其中,所述初始模型中模型参数为半精度,所述随机生成各组预训练样本集所对应的掩膜标识的步骤之前,还包括:The method for training a question generation model according to claim 2, wherein the model parameters in the initial model are half-precision, and before the step of randomly generating mask identifiers corresponding to each set of pre-training sample sets, the method further comprises:
    将所述初始模型中layernorm层和embedding层的模型参数设置为单精度。The model parameters of the layernorm layer and the embedding layer in the initial model are set to single precision.
  4. 根据权利要求1所述的问题生成模型的训练方法,其中,所述从所述答案文本中,提取与所述问题文本相关的关键实体的步骤具体包括:The method for training a question generation model according to claim 1, wherein the step of extracting key entities related to the question text from the answer text specifically comprises:
    从所述问答信息内的问题文本和答案文本中,分别提取文本实体;Extract text entities from the question text and answer text in the question and answer information;
    计算所述答案文本中的各文本实体与所述问题文本中的各文本实体的相似度;Calculating the similarity between each text entity in the answer text and each text entity in the question text;
    从所述答案文本的各文本实体中,提取相似度符合预设相似度阈值的文本实体作为关键实体。From each text entity of the answer text, extract the text entity whose similarity meets the preset similarity threshold as the key entity.
  5. 根据权利要求1所述的问题生成模型的训练方法,其中,所述答案文本包括至少一个子答案文本,所述将所述关键实体和所述答案文本输入所述预训练语言模型,得到所述预训练语言模型输出的预测问题文本的步骤具体包括:The method for training a question generation model according to claim 1, wherein the answer text includes at least one sub-answer text, and the key entity and the answer text are input into the pre-training language model to obtain the The steps of predicting the question text output by the pre-training language model specifically include:
    将至少一个子答案文本以及与子答案文本对应的关键实体输入所述预训练语言模型,得到至少一个三维字向量矩阵;Input at least one sub-answer text and key entities corresponding to the sub-answer text into the pre-training language model to obtain at least one three-dimensional word vector matrix;
    将转化得到的三维字向量矩阵合并为二维字向量矩阵;Combine the converted three-dimensional word vector matrix into a two-dimensional word vector matrix;
    通过所述预训练语言模型对所述二维字向量矩阵进行处理,得到所述预训练语言模型输出的预测问题文本,其中,所述预测问题文本存储在区块链中。The two-dimensional word vector matrix is processed by the pre-training language model to obtain the prediction question text output by the pre-training language model, wherein the prediction question text is stored in a blockchain.
  6. 一种问题生成方法,包括下述步骤:A problem generation method includes the following steps:
    获取用于问题生成的源文本;Obtain the source text used for question generation;
    从所述源文本中筛选若干组源实体;Filter several groups of source entities from the source text;
    分别将所述若干组源实体输入问题生成模型,其中,所述问题生成模型是采用权利要求1-5任一项所述问题生成模型的训练方法获取的模型;Respectively inputting the several groups of source entities into a question generation model, wherein the question generation model is a model obtained by using the training method of the question generation model of any one of claims 1 to 5;
    获取所述问题生成模型基于所述若干组源实体生成的若干问题文本。Acquiring the question generation model based on several question texts generated by the several groups of source entities.
  7. 根据权利要求6所述的问题生成方法,其中,所述从所述源文本中筛选若干组源实体包括:8. The question generation method according to claim 6, wherein said filtering several groups of source entities from said source text comprises:
    识别所述源文本中的文本实体;Identifying text entities in the source text;
    从识别到的文本实体中随机抽取若干组文本实体,得到若干组源实体;Randomly extract several groups of text entities from the recognized text entities to obtain several groups of source entities;
    或者,or,
    根据预设的语义知识库对所述源文本中的文本实体进行语义标注,得到语义标注结果;Performing semantic annotation on the text entities in the source text according to a preset semantic knowledge base to obtain a semantic annotation result;
    根据所述语义标注结果,筛选符合预设语义范围的若干文本实体,得到若干组源实体。According to the semantic annotation result, several text entities that meet the preset semantic range are screened to obtain several groups of source entities.
  8. 一种问题生成模型的训练装置,包括:A training device for a problem generation model, including:
    模型训练模块,用于对初始模型进行预训练得到预训练语言模型,并在预训练中通过调整掩膜矩阵将所述初始模型中的网络实现单向模型、双向模型和序列到序列模型;The model training module is used to pre-train the initial model to obtain the pre-trained language model, and adjust the mask matrix in the pre-training to realize the one-way model, the two-way model and the sequence-to-sequence model of the network in the initial model;
    信息获取模块,用于通过网络爬虫从网络页面中获取问答信息,所述问答信息包括问题文本和答案文本;An information acquisition module for acquiring question and answer information from a web page through a web crawler, where the question and answer information includes question text and answer text;
    实体提取模块,用于从所述答案文本中,提取与所述问题文本相关的关键实体;An entity extraction module for extracting key entities related to the question text from the answer text;
    模型设置模块,用于将所述预训练语言模型中的网络设置为序列到序列模型,以得到用于中文文本生成的预训练语言模型;A model setting module, configured to set the network in the pre-training language model to a sequence-to-sequence model to obtain a pre-training language model for Chinese text generation;
    文本输入模块,用于将所述关键实体和所述答案文本输入所述预训练语言模型,得到所述预训练语言模型输出的预测问题文本;A text input module, configured to input the key entity and the answer text into the pre-training language model to obtain the predicted question text output by the pre-training language model;
    误差确定模块,用于根据所述预测问题文本和所述问题文本,确定预测误差;An error determination module, configured to determine a prediction error according to the prediction question text and the question text;
    模型调整模块,用于根据所述预测误差对所述预训练语言模型进行调整,直至所述预测误差满足训练停止条件,得到问题生成模型。The model adjustment module is configured to adjust the pre-training language model according to the prediction error until the prediction error satisfies the training stop condition to obtain a problem generation model.
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory and a processor. The memory stores computer readable instructions. When the processor executes the computer readable instructions, the following steps are implemented:
    对初始模型进行预训练得到预训练语言模型,并在预训练中通过调整掩膜矩阵将所述初始模型中的网络实现单向模型、双向模型和序列到序列模型;Pre-training the initial model to obtain a pre-training language model, and in the pre-training, by adjusting the mask matrix, the network in the initial model is realized as a one-way model, a two-way model, and a sequence-to-sequence model;
    通过网络爬虫从网络页面中获取问答信息,所述问答信息包括问题文本和答案文本;Obtaining question and answer information from a web page through a web crawler, where the question and answer information includes question text and answer text;
    从所述答案文本中,提取与所述问题文本相关的关键实体;Extract key entities related to the question text from the answer text;
    将所述预训练语言模型中的网络设置为序列到序列模型,以得到用于中文文本生成的预训练语言模型;Setting the network in the pre-training language model to a sequence-to-sequence model to obtain a pre-training language model for Chinese text generation;
    将所述关键实体和所述答案文本输入所述预训练语言模型,得到所述预训练语言模型输出的预测问题文本;Inputting the key entity and the answer text into the pre-training language model to obtain the predicted question text output by the pre-training language model;
    根据所述预测问题文本和所述问题文本,确定预测误差;Determine the prediction error according to the prediction question text and the question text;
    根据所述预测误差对所述预训练语言模型进行调整,直至所述预测误差满足训练停止条件,得到问题生成模型。The pre-training language model is adjusted according to the prediction error until the prediction error satisfies the training stop condition, and a problem generation model is obtained.
  10. 根据权利要求9所述的计算机设备,其中,所述对初始模型进行预训练得到预训练语言模型,并在预训练中通过调整掩膜矩阵将所述初始模型中的网络实现单向模型、双向模型和序列到序列模型的步骤具体包括:The computer device according to claim 9, wherein the pre-training of the initial model is performed to obtain a pre-training language model, and the network in the initial model is implemented as a one-way model and a two-way model by adjusting a mask matrix in the pre-training. The steps of model and sequence to sequence model include:
    获取用于预训练的初始模型以及多组预训练样本集;Obtain an initial model for pre-training and multiple sets of pre-training samples;
    随机生成各组预训练样本集所对应的掩膜标识;所述掩膜标识对应的掩膜矩阵实现单向模型、双向模型和序列到序列模型;Randomly generating the mask identifiers corresponding to each group of pre-training sample sets; the mask matrix corresponding to the mask identifiers realizes a one-way model, a two-way model, and a sequence-to-sequence model;
    将所述各组预训练样本集分别输入所述初始模型,并根据预训练样本集所对应的掩膜标识调整所述初始模型中网络的掩膜矩阵;Input each of the pre-training sample sets into the initial model, and adjust the mask matrix of the network in the initial model according to the mask identifier corresponding to the pre-training sample set;
    根据输入的预训练样本集对掩膜矩阵调整后的初始模型依次进行预训练,得到预训练语言模型。The initial model adjusted by the mask matrix is sequentially pre-trained according to the input pre-training sample set to obtain the pre-training language model.
  11. 根据权利要求10所述的计算机设备,其中,所述初始模型中模型参数为半精度,所述随机生成各组预训练样本集所对应的掩膜标识的步骤之前,还包括:10. The computer device according to claim 10, wherein the model parameters in the initial model are half-precision, and before the step of randomly generating mask identifiers corresponding to each group of pre-training sample sets, the method further comprises:
    将所述初始模型中layernorm层和embedding层的模型参数设置为单精度。The model parameters of the layernorm layer and the embedding layer in the initial model are set to single precision.
  12. 根据权利要求9所述的计算机设备,其中,所述从所述答案文本中,提取与所述问题文本相关的关键实体的步骤具体包括:The computer device according to claim 9, wherein the step of extracting key entities related to the question text from the answer text specifically comprises:
    从所述问答信息内的问题文本和答案文本中,分别提取文本实体;Extract text entities from the question text and answer text in the question and answer information;
    计算所述答案文本中的各文本实体与所述问题文本中的各文本实体的相似度;Calculating the similarity between each text entity in the answer text and each text entity in the question text;
    从所述答案文本的各文本实体中,提取相似度符合预设相似度阈值的文本实体作为关键实体。From each text entity of the answer text, extract the text entity whose similarity meets the preset similarity threshold as the key entity.
  13. 根据权利要求9所述的计算机设备,其中,所述答案文本包括至少一个子答案文本,所述将所述关键实体和所述答案文本输入所述预训练语言模型,得到所述预训练语言模型输出的预测问题文本的步骤具体包括:The computer device according to claim 9, wherein the answer text includes at least one sub-answer text, and the key entity and the answer text are input into the pre-training language model to obtain the pre-training language model The specific steps of the output prediction question text include:
    将至少一个子答案文本以及与子答案文本对应的关键实体输入所述预训练语言模型,得到至少一个三维字向量矩阵;Input at least one sub-answer text and key entities corresponding to the sub-answer text into the pre-training language model to obtain at least one three-dimensional word vector matrix;
    将转化得到的三维字向量矩阵合并为二维字向量矩阵;Combine the converted three-dimensional word vector matrix into a two-dimensional word vector matrix;
    通过所述预训练语言模型对所述二维字向量矩阵进行处理,得到所述预训练语言模型输出的预测问题文本,其中,所述预测问题文本存储在区块链中。The two-dimensional word vector matrix is processed by the pre-training language model to obtain the prediction question text output by the pre-training language model, wherein the prediction question text is stored in a blockchain.
  14. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory and a processor. The memory stores computer readable instructions. When the processor executes the computer readable instructions, the following steps are implemented:
    获取用于问题生成的源文本;Obtain the source text used for question generation;
    从所述源文本中筛选若干组源实体;Filter several groups of source entities from the source text;
    分别将所述若干组源实体输入问题生成模型,其中,所述问题生成模型是采用权利要求1-5任一项所述问题生成模型的训练方法获取的模型;Respectively inputting the several groups of source entities into a question generation model, wherein the question generation model is a model obtained by using the training method of the question generation model of any one of claims 1 to 5;
    获取所述问题生成模型基于所述若干组源实体生成的若干问题文本。Acquiring the question generation model based on several question texts generated by the several groups of source entities.
  15. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤:A computer-readable storage medium having computer-readable instructions stored thereon, and when the computer-readable instructions are executed by a processor, the following steps are implemented:
    对初始模型进行预训练得到预训练语言模型,并在预训练中通过调整掩膜矩阵将所述初始模型中的网络实现单向模型、双向模型和序列到序列模型;Pre-training the initial model to obtain a pre-training language model, and in the pre-training, by adjusting the mask matrix, the network in the initial model is realized as a one-way model, a two-way model, and a sequence-to-sequence model;
    通过网络爬虫从网络页面中获取问答信息,所述问答信息包括问题文本和答案文本;Obtaining question and answer information from a web page through a web crawler, where the question and answer information includes question text and answer text;
    从所述答案文本中,提取与所述问题文本相关的关键实体;Extract key entities related to the question text from the answer text;
    将所述预训练语言模型中的网络设置为序列到序列模型,以得到用于中文文本生成的预训练语言模型;Setting the network in the pre-training language model to a sequence-to-sequence model to obtain a pre-training language model for Chinese text generation;
    将所述关键实体和所述答案文本输入所述预训练语言模型,得到所述预训练语言模型输出的预测问题文本;Inputting the key entity and the answer text into the pre-training language model to obtain the predicted question text output by the pre-training language model;
    根据所述预测问题文本和所述问题文本,确定预测误差;Determine the prediction error according to the prediction question text and the question text;
    根据所述预测误差对所述预训练语言模型进行调整,直至所述预测误差满足训练停止条件,得到问题生成模型。The pre-training language model is adjusted according to the prediction error until the prediction error satisfies the training stop condition, and a problem generation model is obtained.
  16. 根据权利要求15所述的一种计算机可读存储介质,其中,所述对初始模型进行预训练得到预训练语言模型,并在预训练中通过调整掩膜矩阵将所述初始模型中的网络实现单向模型、双向模型和序列到序列模型的步骤具体包括:The computer-readable storage medium according to claim 15, wherein the pre-training of the initial model is performed to obtain a pre-trained language model, and the network in the initial model is realized by adjusting the mask matrix in the pre-training The steps of one-way model, two-way model and sequence-to-sequence model specifically include:
    获取用于预训练的初始模型以及多组预训练样本集;Obtain an initial model for pre-training and multiple sets of pre-training samples;
    随机生成各组预训练样本集所对应的掩膜标识;所述掩膜标识对应的掩膜矩阵实现单向模型、双向模型和序列到序列模型;Randomly generating the mask identifiers corresponding to each group of pre-training sample sets; the mask matrix corresponding to the mask identifiers realizes a one-way model, a two-way model, and a sequence-to-sequence model;
    将所述各组预训练样本集分别输入所述初始模型,并根据预训练样本集所对应的掩膜标识调整所述初始模型中网络的掩膜矩阵;Input each of the pre-training sample sets into the initial model, and adjust the mask matrix of the network in the initial model according to the mask identifier corresponding to the pre-training sample set;
    根据输入的预训练样本集对掩膜矩阵调整后的初始模型依次进行预训练,得到预训练语言模型。The initial model adjusted by the mask matrix is sequentially pre-trained according to the input pre-training sample set to obtain the pre-training language model.
  17. 根据权利要求16所述的一种计算机可读存储介质,其中,所述初始模型中模型参数为半精度,所述随机生成各组预训练样本集所对应的掩膜标识的步骤之前,还包括:The computer-readable storage medium according to claim 16, wherein the model parameters in the initial model are half-precision, and before the step of randomly generating the mask identifiers corresponding to each set of pre-training sample sets, the method further comprises :
    将所述初始模型中layernorm层和embedding层的模型参数设置为单精度。The model parameters of the layernorm layer and the embedding layer in the initial model are set to single precision.
  18. 根据权利要求15所述的一种计算机可读存储介质,其中,所述从所述答案文本中,提取与所述问题文本相关的关键实体的步骤具体包括:15. The computer-readable storage medium according to claim 15, wherein the step of extracting key entities related to the question text from the answer text specifically comprises:
    从所述问答信息内的问题文本和答案文本中,分别提取文本实体;Extract text entities from the question text and answer text in the question and answer information;
    计算所述答案文本中的各文本实体与所述问题文本中的各文本实体的相似度;Calculating the similarity between each text entity in the answer text and each text entity in the question text;
    从所述答案文本的各文本实体中,提取相似度符合预设相似度阈值的文本实体作为关键实体。From each text entity of the answer text, extract the text entity whose similarity meets the preset similarity threshold as the key entity.
  19. 根据权利要求15所述的一种计算机可读存储介质,其中,所述答案文本包括至少一个子答案文本,所述将所述关键实体和所述答案文本输入所述预训练语言模型,得到所述预训练语言模型输出的预测问题文本的步骤具体包括:The computer-readable storage medium according to claim 15, wherein the answer text includes at least one sub-answer text, and the key entity and the answer text are input into the pre-training language model to obtain the answer text. The steps to describe the prediction question text output by the pre-training language model specifically include:
    将至少一个子答案文本以及与子答案文本对应的关键实体输入所述预训练语言模型,得到至少一个三维字向量矩阵;Input at least one sub-answer text and key entities corresponding to the sub-answer text into the pre-training language model to obtain at least one three-dimensional word vector matrix;
    将转化得到的三维字向量矩阵合并为二维字向量矩阵;Combine the converted three-dimensional word vector matrix into a two-dimensional word vector matrix;
    通过所述预训练语言模型对所述二维字向量矩阵进行处理,得到所述预训练语言模型输出的预测问题文本,其中,所述预测问题文本存储在区块链中。The two-dimensional word vector matrix is processed by the pre-training language model to obtain the prediction question text output by the pre-training language model, wherein the prediction question text is stored in a blockchain.
  20. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤:A computer-readable storage medium having computer-readable instructions stored thereon, and when the computer-readable instructions are executed by a processor, the following steps are implemented:
    获取用于问题生成的源文本;Obtain the source text used for question generation;
    从所述源文本中筛选若干组源实体;Filter several groups of source entities from the source text;
    分别将所述若干组源实体输入问题生成模型,其中,所述问题生成模型是采用权利要求1-5任一项所述问题生成模型的训练方法获取的模型;Respectively inputting the several groups of source entities into a question generation model, wherein the question generation model is a model obtained by using the training method of the question generation model of any one of claims 1 to 5;
    获取所述问题生成模型基于所述若干组源实体生成的若干问题文本。Acquiring the question generation model based on several question texts generated by the several groups of source entities.
PCT/CN2020/105777 2020-04-29 2020-07-30 Method for training question generation model, question generation method, and related device WO2021217935A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010356637.X 2020-04-29
CN202010356637.XA CN111639163A (en) 2020-04-29 2020-04-29 Problem generation model training method, problem generation method and related equipment

Publications (1)

Publication Number Publication Date
WO2021217935A1 true WO2021217935A1 (en) 2021-11-04

Family

ID=72330978

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/105777 WO2021217935A1 (en) 2020-04-29 2020-07-30 Method for training question generation model, question generation method, and related device

Country Status (2)

Country Link
CN (1) CN111639163A (en)
WO (1) WO2021217935A1 (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887245A (en) * 2021-12-02 2022-01-04 腾讯科技(深圳)有限公司 Model training method and related device
CN114330512A (en) * 2021-12-13 2022-04-12 腾讯科技(深圳)有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN114970563A (en) * 2022-07-28 2022-08-30 山东大学 Chinese question generation method and system fusing content and form diversity
CN115277626A (en) * 2022-07-29 2022-11-01 平安科技(深圳)有限公司 Address information conversion method, electronic device, and computer-readable storage medium
CN115438176A (en) * 2022-11-08 2022-12-06 阿里巴巴达摩院(杭州)科技有限公司 Method and equipment for generating downstream task model and executing task
CN115600602A (en) * 2022-12-13 2023-01-13 中南大学(Cn) Method, system and terminal device for extracting key elements of long text
CN115713065A (en) * 2022-11-08 2023-02-24 贝壳找房(北京)科技有限公司 Method for generating question, electronic equipment and computer readable storage medium
CN116383365A (en) * 2023-06-01 2023-07-04 广州里工实业有限公司 Learning material generation method and system based on intelligent manufacturing and electronic equipment
CN116402164A (en) * 2023-06-06 2023-07-07 之江实验室 Robot task generation method, device and medium based on pre-training language model
CN116757254A (en) * 2023-08-16 2023-09-15 阿里巴巴(中国)有限公司 Task processing method, electronic device and storage medium
CN116775847A (en) * 2023-08-18 2023-09-19 中国电子科技集团公司第十五研究所 Question answering method and system based on knowledge graph and large language model
CN116910572A (en) * 2023-09-13 2023-10-20 浪潮(北京)电子信息产业有限公司 Training method and device for three-dimensional content generation model based on pre-training language model
CN116935230A (en) * 2023-09-13 2023-10-24 山东建筑大学 Crop pest identification method, device, equipment and medium
CN116932803A (en) * 2023-09-13 2023-10-24 浪潮(北京)电子信息产业有限公司 Data set generation method and training method based on multi-mode pre-training model
CN117011612A (en) * 2023-08-16 2023-11-07 海南省新超豪信息技术有限公司 AI identification method for traditional Chinese medicinal materials
CN117235240A (en) * 2023-11-14 2023-12-15 神州医疗科技股份有限公司 Multi-model result fusion question-answering method and system based on asynchronous consumption queue
CN117290492A (en) * 2023-11-27 2023-12-26 深圳市灵智数字科技有限公司 Knowledge base question-answering method and device, electronic equipment and storage medium
CN117555644A (en) * 2024-01-11 2024-02-13 之江实验室 Front-end page construction method and device based on natural language interaction
CN117609444A (en) * 2023-11-08 2024-02-27 天讯瑞达通信技术有限公司 Searching question-answering method based on large model
CN117710538A (en) * 2023-11-16 2024-03-15 北京百悟科技有限公司 Digital person display method, device, equipment and storage medium
CN117852654A (en) * 2024-02-05 2024-04-09 清华大学 Model training method and method for solving problems in specific field
CN117892139A (en) * 2024-03-14 2024-04-16 中国医学科学院医学信息研究所 Large language model training and using method based on interlayer comparison and related device
CN118350463A (en) * 2024-06-17 2024-07-16 恒生电子股份有限公司 Question-answer model training method, text processing method and rewarding model training method

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100351A (en) * 2020-09-11 2020-12-18 陕西师范大学 Method and equipment for constructing intelligent question-answering system through question generation data set
CN114385809B (en) * 2020-10-22 2024-06-18 中移(成都)信息通信科技有限公司 Training method, device and equipment for entity text extraction model
CN112559702B (en) * 2020-11-10 2022-09-30 西安理工大学 Method for generating natural language problem in civil construction information field based on Transformer
CN112257393B (en) 2020-12-22 2021-04-13 北京百度网讯科技有限公司 Method, device, equipment and medium for realizing text generation
CN112347793B (en) * 2020-12-30 2021-05-14 北京智源人工智能研究院 Semantic analysis method and device based on rules and learning and electronic equipment
CN113420129B (en) * 2021-05-08 2022-11-18 天津大学 Method for controlling dialog generation based on large-scale general pre-training model
CN113743095B (en) * 2021-07-19 2024-09-20 西安理工大学 Chinese problem generation unified pre-training method based on word lattice and relative position embedding
CN113569025B (en) * 2021-07-23 2024-08-20 上海明略人工智能(集团)有限公司 Data processing method and device, electronic equipment and storage medium
CN113673702B (en) * 2021-07-27 2022-07-29 北京师范大学 Method and device for evaluating pre-training language model and storage medium
CN113569033A (en) * 2021-08-04 2021-10-29 工银科技有限公司 Government affair problem generation method and device
CN113590844A (en) * 2021-08-09 2021-11-02 北京智源人工智能研究院 Knowledge graph-based question-answer library generation method and device, electronic equipment and storage medium
CN114360537A (en) * 2021-12-27 2022-04-15 科大讯飞股份有限公司 Spoken question and answer scoring method, spoken question and answer training method, computer equipment and storage medium
CN114461749B (en) * 2022-02-15 2023-04-07 北京百度网讯科技有限公司 Data processing method and device for conversation content, electronic equipment and medium
CN115687031A (en) * 2022-11-15 2023-02-03 北京优特捷信息技术有限公司 Method, device, equipment and medium for generating alarm description text
CN118312598A (en) * 2023-06-30 2024-07-09 北京百度网讯科技有限公司 Text generation method, training method and device of text generation model
CN116860933B (en) * 2023-06-30 2024-07-12 北京百度网讯科技有限公司 Dialogue model training method, reply information generating method, device and medium
CN118228839A (en) * 2024-04-23 2024-06-21 北京面壁智能科技有限责任公司 Method and device for constructing complex instruction training data for model training, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108846130A (en) * 2018-06-29 2018-11-20 北京百度网讯科技有限公司 A kind of question text generation method, device, equipment and medium
CN109657041A (en) * 2018-12-04 2019-04-19 南京理工大学 The problem of based on deep learning automatic generation method
WO2019079922A1 (en) * 2017-10-23 2019-05-02 腾讯科技(深圳)有限公司 Session information processing method and device, and storage medium
CN110162613A (en) * 2019-05-27 2019-08-23 腾讯科技(深圳)有限公司 A kind of problem generation method, device, equipment and storage medium
CN110188182A (en) * 2019-05-31 2019-08-30 中国科学院深圳先进技术研究院 Model training method, dialogue generation method, device, equipment and medium
CN110188331A (en) * 2019-06-03 2019-08-30 腾讯科技(深圳)有限公司 Model training method, conversational system evaluation method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019079922A1 (en) * 2017-10-23 2019-05-02 腾讯科技(深圳)有限公司 Session information processing method and device, and storage medium
CN108846130A (en) * 2018-06-29 2018-11-20 北京百度网讯科技有限公司 A kind of question text generation method, device, equipment and medium
CN109657041A (en) * 2018-12-04 2019-04-19 南京理工大学 The problem of based on deep learning automatic generation method
CN110162613A (en) * 2019-05-27 2019-08-23 腾讯科技(深圳)有限公司 A kind of problem generation method, device, equipment and storage medium
CN110188182A (en) * 2019-05-31 2019-08-30 中国科学院深圳先进技术研究院 Model training method, dialogue generation method, device, equipment and medium
CN110188331A (en) * 2019-06-03 2019-08-30 腾讯科技(深圳)有限公司 Model training method, conversational system evaluation method, device, equipment and storage medium

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887245A (en) * 2021-12-02 2022-01-04 腾讯科技(深圳)有限公司 Model training method and related device
CN114330512A (en) * 2021-12-13 2022-04-12 腾讯科技(深圳)有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN114330512B (en) * 2021-12-13 2024-04-26 腾讯科技(深圳)有限公司 Data processing method, device, electronic equipment and computer readable storage medium
CN114970563A (en) * 2022-07-28 2022-08-30 山东大学 Chinese question generation method and system fusing content and form diversity
CN115277626A (en) * 2022-07-29 2022-11-01 平安科技(深圳)有限公司 Address information conversion method, electronic device, and computer-readable storage medium
CN115713065A (en) * 2022-11-08 2023-02-24 贝壳找房(北京)科技有限公司 Method for generating question, electronic equipment and computer readable storage medium
CN115438176B (en) * 2022-11-08 2023-04-07 阿里巴巴达摩院(杭州)科技有限公司 Method and equipment for generating downstream task model and executing task
CN115713065B (en) * 2022-11-08 2023-09-15 贝壳找房(北京)科技有限公司 Method for generating problem, electronic equipment and computer readable storage medium
WO2024099144A1 (en) * 2022-11-08 2024-05-16 阿里巴巴达摩院(杭州)科技有限公司 Downstream task model generation method, task execution method, and device
CN115438176A (en) * 2022-11-08 2022-12-06 阿里巴巴达摩院(杭州)科技有限公司 Method and equipment for generating downstream task model and executing task
CN115600602A (en) * 2022-12-13 2023-01-13 中南大学(Cn) Method, system and terminal device for extracting key elements of long text
CN116383365A (en) * 2023-06-01 2023-07-04 广州里工实业有限公司 Learning material generation method and system based on intelligent manufacturing and electronic equipment
CN116383365B (en) * 2023-06-01 2023-09-08 广州里工实业有限公司 Learning material generation method and system based on intelligent manufacturing and electronic equipment
CN116402164A (en) * 2023-06-06 2023-07-07 之江实验室 Robot task generation method, device and medium based on pre-training language model
CN116402164B (en) * 2023-06-06 2023-09-05 之江实验室 Robot task generation method, device and medium based on pre-training language model
CN117011612A (en) * 2023-08-16 2023-11-07 海南省新超豪信息技术有限公司 AI identification method for traditional Chinese medicinal materials
CN116757254A (en) * 2023-08-16 2023-09-15 阿里巴巴(中国)有限公司 Task processing method, electronic device and storage medium
CN116757254B (en) * 2023-08-16 2023-11-14 阿里巴巴(中国)有限公司 Task processing method, electronic device and storage medium
CN116775847B (en) * 2023-08-18 2023-11-28 中国电子科技集团公司第十五研究所 Question answering method and system based on knowledge graph and large language model
CN116775847A (en) * 2023-08-18 2023-09-19 中国电子科技集团公司第十五研究所 Question answering method and system based on knowledge graph and large language model
CN116910572B (en) * 2023-09-13 2024-02-09 浪潮(北京)电子信息产业有限公司 Training method and device for three-dimensional content generation model based on pre-training language model
CN116935230B (en) * 2023-09-13 2023-12-15 山东建筑大学 Crop pest identification method, device, equipment and medium
CN116932803B (en) * 2023-09-13 2024-01-26 浪潮(北京)电子信息产业有限公司 Data set generation method and training method based on multi-mode pre-training model
CN116910572A (en) * 2023-09-13 2023-10-20 浪潮(北京)电子信息产业有限公司 Training method and device for three-dimensional content generation model based on pre-training language model
CN116935230A (en) * 2023-09-13 2023-10-24 山东建筑大学 Crop pest identification method, device, equipment and medium
CN116932803A (en) * 2023-09-13 2023-10-24 浪潮(北京)电子信息产业有限公司 Data set generation method and training method based on multi-mode pre-training model
CN117609444A (en) * 2023-11-08 2024-02-27 天讯瑞达通信技术有限公司 Searching question-answering method based on large model
CN117235240A (en) * 2023-11-14 2023-12-15 神州医疗科技股份有限公司 Multi-model result fusion question-answering method and system based on asynchronous consumption queue
CN117235240B (en) * 2023-11-14 2024-02-20 神州医疗科技股份有限公司 Multi-model result fusion question-answering method and system based on asynchronous consumption queue
CN117710538A (en) * 2023-11-16 2024-03-15 北京百悟科技有限公司 Digital person display method, device, equipment and storage medium
CN117290492A (en) * 2023-11-27 2023-12-26 深圳市灵智数字科技有限公司 Knowledge base question-answering method and device, electronic equipment and storage medium
CN117555644A (en) * 2024-01-11 2024-02-13 之江实验室 Front-end page construction method and device based on natural language interaction
CN117555644B (en) * 2024-01-11 2024-04-30 之江实验室 Front-end page construction method and device based on natural language interaction
CN117852654A (en) * 2024-02-05 2024-04-09 清华大学 Model training method and method for solving problems in specific field
CN117892139A (en) * 2024-03-14 2024-04-16 中国医学科学院医学信息研究所 Large language model training and using method based on interlayer comparison and related device
CN117892139B (en) * 2024-03-14 2024-05-14 中国医学科学院医学信息研究所 Large language model training and using method based on interlayer comparison and related device
CN118350463A (en) * 2024-06-17 2024-07-16 恒生电子股份有限公司 Question-answer model training method, text processing method and rewarding model training method

Also Published As

Publication number Publication date
CN111639163A (en) 2020-09-08

Similar Documents

Publication Publication Date Title
WO2021217935A1 (en) Method for training question generation model, question generation method, and related device
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN112131366A (en) Method, device and storage medium for training text classification model and text classification
CN111310436B (en) Text processing method and device based on artificial intelligence and electronic equipment
CN115587175B (en) Man-machine conversation and pre-training language model training method and system and electronic equipment
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN112215008A (en) Entity recognition method and device based on semantic understanding, computer equipment and medium
CN111695356A (en) Synonym corpus generation method, synonym corpus generation device, computer system and readable storage medium
CN109117474A (en) Calculation method, device and the storage medium of statement similarity
CN111767394A (en) Abstract extraction method and device based on artificial intelligence expert system
CN111552798B (en) Name information processing method and device based on name prediction model and electronic equipment
JP2023002690A (en) Semantics recognition method, apparatus, electronic device, and storage medium
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN111382563A (en) Text relevance determining method and device
WO2022073341A1 (en) Disease entity matching method and apparatus based on voice semantics, and computer device
WO2024125155A1 (en) Entity linking method and apparatus, computer device and storage medium
CN116913278A (en) Voice processing method, device, equipment and storage medium
CN115169370B (en) Corpus data enhancement method and device, computer equipment and medium
CN115310429B (en) Data compression and high-performance calculation method in multi-round listening dialogue model
CN116881446A (en) Semantic classification method, device, equipment and storage medium thereof
KR102541806B1 (en) Method, system, and computer readable record medium for ranking reformulated query
CN113591493A (en) Translation model training method and translation model device
CN113434789A (en) Search sorting method based on multi-dimensional text features and related equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20933498

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 160223)

122 Ep: pct application non-entry in european phase

Ref document number: 20933498

Country of ref document: EP

Kind code of ref document: A1