WO2021164292A1

WO2021164292A1 - Reading model optimization method and apparatus based on big data, and device and medium

Info

Publication number: WO2021164292A1
Application number: PCT/CN2020/123170
Authority: WO
Inventors: 楼星雨; 许开河; 王少军
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-02-21
Filing date: 2020-10-23
Publication date: 2021-08-26
Also published as: CN111444677A

Abstract

A reading model optimization method based on big data. The method comprises: pre-training a first reading comprehension model, a question generation model and a second reading comprehension model according to a marked data set; predicting an unmarked data set by means of the pre-trained second reading comprehension model to obtain binary data pairs regarding articles and answers; predicting the binary data pairs by means of the pre-trained question generation model to obtain ternary data pairs regarding articles, questions and answers; filtering the ternary data pairs by means of the pre-trained first reading comprehension model; screening the filtered ternary data pairs according to article topics in the marked data set to generate a pseudo-marked data set; and performing optimized training on the pre-trained first reading comprehension model according to the pseudo-marked data set and the marked data set. According to the method, the problems of small training samples and low model precision of the existing reading comprehension technology caused by the high acquisition costs of marked data are solved.

Description

Reading model optimization method, device, equipment and medium based on big data

This application claims the priority of the Chinese patent application filed with the Chinese Patent Office on February 21, 2020, the application number is 202010108092.0, and the invention title is "Big data-based reading model optimization method, device, equipment and medium", and its entire content Incorporated in this application by reference.

Technical field

This application relates to the field of information technology, and in particular to a method, device, equipment, and medium for optimizing a reading model based on big data.

To

Background technique

In artificial intelligence, reading comprehension technology is a difficult and widely used information processing technology in the field of natural language processing. Reading comprehension technology aims to find the corresponding answer from a given article or document based on the question raised, and even judge whether the question raised can be answered. An excellent reading comprehension model needs to have the same language comprehension ability and knowledge reasoning ability as humans, in order to conduct in-depth mining and analysis of the article, and focus on different parts of the article or perspectives to find the correct answer according to a specific question, so it has better High difficulty. The current excellent reading comprehension models are all based on the complex deep learning model structure, which requires a huge amount of training data for the model to learn. It can be seen from the definition of reading comprehension technology that the training data of reading comprehension technology needs to be pre-labeled to locate article information, question information and answer information. However, the inventor realizes that it is very difficult to label the training data. This is because the labeler needs to read the entire article first and then generate the corresponding answers based on the questions given. It is difficult to have good efficiency and accuracy. Guarantee. Due to the high cost of acquiring annotation data, in actual use, reading comprehension models are often trained based on small-scale training data, and it is impossible to find a better solution in the parameter space, which limits the accuracy of the model.

Therefore, finding a method to solve the problems of small training samples and low model accuracy caused by the high cost of acquiring annotated data in the existing reading comprehension technology has become an urgent technical problem for those skilled in the art.

To

Summary of the invention

The embodiments of the present application provide a reading model optimization method, device, equipment, and medium based on big data to solve the problems of small training samples and low model accuracy caused by the high cost of acquiring annotated data in the existing reading comprehension technology.

A reading model optimization method based on big data, including:

Acquiring a labeled data set, and pre-training the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;

Obtaining an unlabeled data set, and predicting the unlabeled data set through a pre-trained second reading comprehension model to obtain a binary data pair of the article and the answer in the unlabeled data set;

Prediction of the binary data pair through the pre-trained question generation model to obtain the ternary data pair of the article, question, and answer in the unlabeled data set;

Filtering the ternary data pair through the pre-trained first reading comprehension model;

Filter the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;

Perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.

A reading model optimization device based on big data, including:

The pre-training module is used to obtain a labeled data set, and perform pre-training on the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;

The binary data pair generation module is used to obtain an unlabeled data set, and predict the unlabeled data set through the pre-trained second reading comprehension model to obtain the binary data about the article and the answer of the unlabeled data set right;

The ternary data pair generation module is used to predict the binary data pair through the pre-trained question generation model to obtain the ternary data pair about the article, question, and answer in the unlabeled data set;

A filtering module, configured to filter the ternary data pair through the pre-trained first reading comprehension model;

The screening module is used to screen the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;

The optimization training module is configured to perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.

A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:

One or more non-volatile readable storage media storing computer readable instructions. When the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:

The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.

To

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

FIG. 1 is a flowchart of a method for optimizing a reading model based on big data in an embodiment of the present application;

2 is a flowchart of step S101 in a method for optimizing a reading model based on big data in another embodiment of the present application;

3 is a flowchart of step S102 in a method for optimizing a reading model based on big data in another embodiment of the present application;

4 is a flowchart of step S104 in a method for optimizing a reading model based on big data in another embodiment of the present application;

FIG. 5 is a flowchart of step S105 in a method for optimizing a reading model based on big data in another embodiment of the present application;

FIG. 6 is a schematic block diagram of a reading model optimization device based on big data in an embodiment of the present application;

Fig. 7 is a schematic diagram of a computer device in an embodiment of the present application.

To

Detailed ways

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The method for optimizing a reading model based on big data provided in the embodiments of the application is a self-training method for improving a reading comprehension model. Its purpose is to overcome the inability to find a better solution in the parameter space caused by insufficient reading comprehension labeled data. problem. The method for optimizing a reading model based on big data is applied to a server. The server can be implemented by an independent server or a server cluster composed of multiple servers. In an embodiment, as shown in FIG. 1, a method for optimizing a reading model based on big data is provided, which includes the following steps:

In step S101, a labeled data set is obtained, and a preset first reading comprehension model, a question generation model, and a second reading comprehension model are pre-trained according to the labeled data set.

Here, the embodiment of the present application first uses a small number of existing labeled data sets to pre-train the reading comprehension model, so as to pave the way for obtaining pseudo-labeled data from the unlabeled data set. Optionally, as shown in FIG. 2, in step S101, the labeled data set is acquired, and the preset first reading comprehension model, question generation model, and second reading comprehension model are pre-defined according to the labeled data set. Training includes:

In step S201, a labeled data set is obtained. The labeled data set includes several labeled data pairs, and each labeled data pair includes article information, question information, and corresponding answer information.

Here, each marked data pair includes article information, question information, and corresponding answer information, and the article information, question information, and answer information in each marked data pair have been manually marked by a designated method. Exemplarily, the ternary data pair can be implemented in a preset composition format. For example, each labeled data pair is designed as a triplet: (Article Information Passage, Question Information Query, Answer Information Answer); Of course it can also Realize through identification information.

Taking customer service robots as an example, the labeled data set refers to the collection of data pairs obtained by manually labeling historical customer service conversations; the article information is the historical customer service conversation, the question information is the question information in the historical customer service conversation, and the answer information is the history The answer information to the questioned information in the customer service dialogue. Because historical customer service conversation data is limited, the marked data pairs in the marked data set are also limited; and because manual labeling is expensive and difficult, even if the amount of historical customer service conversation data is large enough, it will be difficult in a short period of time. Obtain enough training samples.

In step S202, pre-training the first reading comprehension model is performed using the article information and question information in the labeled data set.

Here, the first reading comprehension model is a big data model that predicts based on article information and question information, and obtains answer information. The input is article information and question information, and the output is answer information. Optionally, the first reading comprehension model includes but is not limited to the R-net machine reading comprehension model and the BERT machine reading comprehension model. In the embodiment of the present application, the article information and question information in each labeled data pair in the labeled data set are used as the input of the first reading comprehension model, and the first reading comprehension model is pre-trained.

In step S203, the article information and answer information in the labeled data set are used to pre-train the question generation model.

Here, the question generation model is a big data model that makes predictions based on article information and answer information to obtain question information. The input is article information and answer information, and the output is question information. Exemplarily, the problem generation model may adopt a sequence to sequence + copy based problem generation model. In the embodiment of the present application, the article information and answer information of each labeled data pair in the labeled data set are used as the input of the question generation model, and the question generation model is pre-trained.

In step S204, pre-training the second reading comprehension model is performed using the article information in the labeled data set.

Here, the second reading comprehension model is a big data model that predicts based on article information and obtains answer information. The input is article information, and the output is answer information. In the embodiment of the present application, the article information of each labeled data pair in the labeled data set is used as the input of the second reading comprehension model, and the second reading comprehension model is pre-trained.

The first reading comprehension model, question generation model, and second reading comprehension model after the pre-training is completed are used to subsequently generate a complete article, question, and answer ternary data pair based on the unlabeled data set, thereby generating a pseudo-labeled data set. Among them, the question generation model is used to generate question information based on the unlabeled data set. As mentioned above, the question generation model is usually pre-trained with a large amount of labeled article information, so that the input is a certain sentence in the article information, and the output is It is the effect of the previous sentence entered. The second reading comprehension model is used to generate answer information based on the unlabeled data set as the source of one of the answers to the question information. The first reading comprehension model is used to filter the question information and answer information. The method for optimizing a reading model based on big data further includes:

In step S102, an unlabeled data set is obtained, and the unlabeled data set is predicted by the pre-trained second reading comprehension model to obtain a binary data pair of the unlabeled data set about the article and the answer.

Here, the unlabeled data set refers to the unlabeled article information obtained by crawling or downloading from the Internet, such as Wikipedia and official accounts, which is very large compared to the labeled data set. Taking a customer service robot as an example, the unlabeled data set is unlabeled article data obtained from the Internet, and historical customer service conversation data that has not yet been labeled. In this embodiment of the application, the unlabeled data set is transmitted as input to the second pre-trained reading comprehension model for prediction, and the answer information of the unlabeled data set is obtained, and then the article information and answer information are combined to obtain information about Binary data pair of article and answer.

Optionally, the second reading comprehension model can also be combined with multiple ways to predict answer information. As shown in FIG. 3, the pre-trained second reading comprehension model in step S102 is used to predict the unlabeled data set, and obtain the binary data pair of the article and the answer in the unlabeled data set including:

In step S301, the unlabeled data set is input to a pre-trained second reading comprehension model, and the output of the pre-trained second reading comprehension model is obtained as the first predicted answer.

Wherein, the first predicted answer is answer information obtained by predicting the unlabeled data set through a second reading comprehension model.

In step S302, named entity recognition is performed on the unlabeled data set to obtain a second predicted answer.

Wherein, the second predicted answer is answer information obtained by predicting the unlabeled data set through a named entity recognition technology. Here, Named Entity Recognition (NER) is a rule-based method that can extract predicted answer information through regular and pre-built entity dictionaries and open source syntax trees.

In step S303, a two-way long short-term memory network and conditional random field technology are used to obtain a third predicted answer from the unlabeled data set.

Wherein, the third predicted answer is the answer information obtained by predicting the unlabeled data set through bidirectional Long Short Term Memory (LSTM) and Conditional Random Fields (Conditional Random Fields, CRF) technologies . The use of two-way LSTM and conditional random field technology is currently the mainstream method of using models for entity recognition. The two-way LSTM is used for feature extraction, and then the conditional random field technology is used to integrate the dependency between tags into the extracted features and predict the answer Location, get the answer information.

In step S304, the first predicted answer, the second predicted answer, and the third predicted answer are combined to obtain the binary data pair of the article and the answer in the unlabeled data set.

The foregoing prediction of the first answer information through the second reading comprehension model, the extraction of the second prediction answer using the named entity recognition technology method, and the extraction of the third prediction answer using the two-way LSTM and conditional random field technology are synchronized and parallel operations. The obtained first predicted answer, second predicted answer, and third predicted answer are all regarded as one of the answer information corresponding to the unlabeled data set. The embodiment of the present application obtains the answer information corresponding to the unlabeled data set by combining multiple methods, which can effectively increase the number and diversity of predicted answers, and further expand the number of binary data pairs and ternary data pairs.

In step S103, the binary data pair is predicted by the pre-trained question generation model to obtain the ternary data pair of the article, question, and answer in the unlabeled data set.

Here, the embodiment of the present application takes the binary data pair about the article and the answer as input, and passes it into the pre-trained question generation model to predict the binary data pair through the question generation model. Get information about the problem. Then, the binary data pair and the question information are combined to obtain the ternary data pair of the article, question, and answer in the unlabeled data set. Each of the ternary data pairs including article information, question information and answer signals can be expressed in the format of (article, question, answer).

In step S104, the ternary data pair is filtered through the pre-trained first reading comprehension model.

In the embodiment of the present application, the ternary data pair generated by S103 has relatively high noise. If it is directly retained in full, the introduction of noise will produce adverse effects instead. Therefore, the embodiment of the present application further filters the ternary data pair through the pre-trained first reading comprehension model, so as to reduce noise and improve data quality. As shown in FIG. 4, the filtering of the ternary data pair through the pre-trained first reading comprehension model in step S104 includes:

In step S401, the ternary data pair is traversed, and the article information and question information in the ternary data pair are predicted by the first reading comprehension model after pre-training, and the corresponding ternary data pair is obtained The predicted answer.

Here, the embodiment of the present application takes the article information and question information in the ternary data pair as input, and passes it into the pre-trained first reading comprehension model, so that the first reading comprehension model is based on the first reading comprehension model. The article information and question information in the ternary data are predicted to obtain answer information, that is, the ternary data has a corresponding predicted answer. It should be noted that the answer information obtained in step S401 is obtained based on the pre-trained first reading comprehension model predicting the article information and question information in the ternary data, and the first predicted answer obtained in step S301 It is obtained by predicting the article information in the unlabeled data set based on the pre-trained second reading comprehension model, and is used as the answer information in the ternary data pair. Therefore, the predicted answer corresponding to the ternary data pair may be the same as The answer information in the ternary data pair may be the same or different.

In step S402, the predicted answer corresponding to the ternary data pair is compared with the answer information in the ternary data pair.

Here, the comparison method includes, but is not limited to: whether the predicted answer corresponding to the ternary data pair is completely consistent with the answer information in the ternary data pair, whether there is overlap, and whether there is an inclusion relationship. One of the above methods or any combination thereof can be selected according to the amount of data in the actual application and the training effect of the reading comprehension model to implement the comparison, and to determine which ternary data pairs are specifically retained.

In step S403, if the predicted answer corresponding to the ternary data pair is different from the answer information in the ternary data pair, the ternary data pair is deleted.

In step S404, if the predicted answer corresponding to the ternary data pair is the same as the answer information in the ternary data pair, the ternary data pair is retained.

According to the comparison method adopted, the deleted and retained ternary data pairs are not exactly the same. When the comparison method is to determine whether the predicted answer corresponding to the ternary data pair is completely consistent with the answer information in the ternary data pair, then if and only if the predicted answer corresponding to the ternary data pair is the same as the ternary data When the answer information in the pair is completely consistent, it is considered the same, otherwise it is considered different; when the comparison method is to determine whether the predicted answer corresponding to the ternary data pair overlaps with the answer information in the ternary data pair When there is an overlap between the predicted answer corresponding to the ternary data pair and the answer information in the ternary data pair, it is considered the same, otherwise it is considered different; when the comparison method is to judge the ternary data pair When there is an inclusive relationship between the predicted answer corresponding to the data pair and the answer information in the ternary data pair, then when the predicted answer corresponding to the ternary data pair has an inclusive relationship with the answer information in the ternary data pair, it is considered the same , Otherwise it is considered different. The embodiment of the application deletes the ternary data pair whose predicted answer is different from the answer information in the ternary data pair, and retains the ternary data pair whose predicted answer is the same as the answer information in the ternary data pair.

In step S105, the filtered ternary data pairs are filtered according to the subject of the article in the labeled data set to generate a pseudo-labeled data set.

After completing the filtering of the ternary data pairs, the embodiment of the present application further filters out theme-related ternary data pairs from the retained ternary data pairs based on the article topics of the labeled data set. Since these ternary data pairs related to the subject are not manually annotated, in order to distinguish them from the annotated data set, this embodiment of the application is referred to as a pseudo-annotated data set.

Optionally, as shown in FIG. 5, the filtering of the filtered ternary data pairs according to the article topics in the labeled data set in step S105, and generating a pseudo-labeled data set includes:

In step S501, the Dirichlet distribution topic model is used to analyze the similarity between the article information of the filtered ternary data pair and the article information of the labeled data set, and the subject of the ternary data pair and the labeled data set is obtained. Similarity.

Here, the Dirichlet distribution topic model is an unsupervised model and does not require annotated data. The distribution analysis of an input article can be performed to obtain the probability that the article belongs to each topic. In the embodiment of the application, the article information in the ternary data pair retained after filtering and the article information in the labeled data set are input into the Dirichlet distribution topic model, and the article information in the ternary data pair retained after filtering is obtained. The probability that the article information in the labeled data set belongs to each topic. The greater the probability, the greater the similarity, so that the similarity of the article information on the subject can be obtained, that is, the subject similarity of the ternary data pair and the labeled data set.

In step S502, a ternary data pair whose topic similarity is higher than a preset threshold is obtained, and a pseudo-labeled data set is constructed.

In this embodiment of the application, by pre-setting a threshold, each ternary data pair is compared with the theme similarity of the labeled data set with the preset threshold, and the theme similarity is higher than the preset threshold. Ternary data pairs are used as pseudo-labeled data to construct a pseudo-labeled data set.

Since the unlabeled data comes from various fields, and many fields are very different from the fields of the labeled data set, the embodiment of the present application filters by subject similarity, which will be the same and/or similar to the field of the labeled data set. Some ternary data pairs are retained, which is beneficial to reduce noise.

Here, the pseudo-labeled data set and the labeled data set constitute a new training sample of the reading comprehension model, thereby expanding the labeled data for training the reading comprehension model, avoiding manual labeling, and reducing the cost of obtaining labeled data .

In step S106, the pre-trained first reading comprehension model is optimized and trained according to the pseudo-labeled data set and the labeled data set.

In the embodiment of the present application, the pseudo-labeled data and the labeled data are used to train the first reading comprehension model, which greatly enriches the labeled data for training the reading comprehension model, and helps to find better ones during the training process. Parameters, so that a better model than the previous reading comprehension model can be obtained.

This embodiment of the application pre-trains the reading comprehension model by using a small amount of labeled data sets; then uses the pre-trained reading comprehension model to predict a large number of unlabeled data sets in related fields to generate a rougher ternary data pair ( Article, question, answer); then select high-quality ternary data pairs from the rougher ternary data pairs to construct a pseudo-labeled data set, and add it to the original labeled data set to retrain the reading comprehension model ; Thereby greatly enriching the annotation data used to train the reading comprehension model, effectively solving the problem of small training samples caused by the high cost of acquiring annotated data in the existing reading comprehension technology, and is conducive to finding better ones in the training process Parameters, obtain a better model than the previous reading comprehension model, and improve the accuracy of the reading comprehension model.

The method for optimizing the reading model based on big data provided by the embodiments of the present application can alleviate the problem that the reading comprehension model cannot find a better solution in the parameter space caused by insufficient annotation data, and effectively improve the accuracy of the reading comprehension model. The reading comprehension model trained through the embodiments of this application is mainly used in the task of information extraction, and the task of information extraction is one of the very important modules such as customer service robots and chat robots. The reading comprehension model trained in this application can more accurately and quickly extract the answers to the questions the user wants to ask from the massive documents, achieve accurate answers to the user’s questions, and reduce the number of user inquiry rounds in the customer service robot. The overall load of the customer service robot. If the customer service robot answers the user’s question quickly and accurately, the number of polls is small; if the answer is inaccurate, the user will often continue to ask questions and increase the number of inquiries.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.

To

In one embodiment, a reading model optimization device based on big data is provided, and the reading model optimization device based on big data corresponds to the reading model optimization method based on big data in the above-mentioned embodiment in a one-to-one correspondence. As shown in FIG. 6, the reading model optimization device based on big data includes a pre-training module 61, a binary data pair generating module 62, a ternary data pair generating module 63, a filtering module 64, a filtering module 65, and an optimization training module 66. The detailed description of each functional module is as follows:

The pre-training module 61 is configured to obtain a labeled data set, and perform pre-training on a preset first reading comprehension model, a question generation model, and a second reading comprehension model according to the labeled data set;

The binary data pair generation module 62 is configured to obtain an unlabeled data set, and predict the unlabeled data set through a pre-trained second reading comprehension model to obtain the binary data about the article and the answer of the unlabeled data set Data pair

The ternary data pair generation module 63 is configured to predict the binary data pair through the pre-trained question generation model to obtain the ternary data pair about the article, question, and answer in the unlabeled data set;

The filtering module 64 is configured to filter the ternary data pair through the pre-trained first reading comprehension model;

The screening module 65 is configured to screen the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;

The optimization training module 66 is configured to perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.

Optionally, the pre-training module 61 includes:

An obtaining unit for obtaining a labeled data set, the labeled data set includes a number of labeled data pairs, and each labeled data pair includes article information, question information, and corresponding answer information;

The first pre-training unit is configured to pre-train the first reading comprehension model by using the article information and question information in the labeled data set;

The second pre-training unit is configured to use the article information and answer information in the labeled data set to pre-train the question generation model;

The third pre-training unit is used to pre-train the second reading comprehension model by using the article information in the labeled data set.

Optionally, the binary data pair generation module 62 includes:

The first answer prediction unit is configured to input the unlabeled data set into a pre-trained second reading comprehension model, and obtain the output of the pre-trained second reading comprehension model as the first predicted answer;

The second answer prediction unit is configured to perform named entity recognition on the unlabeled data set to obtain a second predicted answer;

The third answer prediction unit is used to obtain the third predicted answer from the unlabeled data set by using a two-way long-term short-term memory network and conditional random field technology;

The binary data pair generating unit is used to combine the first predicted answer, the second predicted answer, and the third predicted answer to obtain the binary data pair of the article and the answer in the unlabeled data set.

Optionally, the filtering module 64 includes:

A prediction unit for traversing the ternary data pair, and predicting the article information and question information in the ternary data pair through the first reading comprehension model after pre-training to obtain the corresponding ternary data pair Predicted answer

The comparison unit is used to compare the predicted answer corresponding to the ternary data pair with the answer information in the ternary data pair;

The filtering unit is configured to delete the ternary data pair if the predicted answer corresponding to the ternary data pair is different from the answer information in the ternary data pair; If the answer information in the metadata pair is the same, the ternary data pair is retained.

Optionally, the screening module 65 includes:

Using the Dirichlet distribution topic model to analyze the similarity between the filtered article information of the ternary data pair and the article information of the labeled data set, and obtain the topic similarity between the ternary data pair and the labeled data set;

Acquire ternary data pairs whose topic similarity is higher than a preset threshold, and construct a pseudo-labeled data set.

For the specific limitation of the reading model optimization device based on big data, please refer to the above limitation of the reading model optimization method based on big data, which will not be repeated here. Each module in the device for optimizing a reading model based on big data can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.

To

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 7. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer-readable instructions are executed by the processor, a reading model optimization method based on big data is realized.

In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored on the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:

In one embodiment, one or more non-volatile readable storage media storing computer readable instructions are provided. When the computer readable instructions are executed by one or more processors, the one or more Each processor performs the following steps:

The computer-readable storage medium may be non-volatile or volatile.

A person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above functional units and modules is used as an example. In practical applications, the above functions can be allocated to different functional units and modules as needed. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims

A reading model optimization method based on big data, including:

Acquiring a labeled data set, and pre-training the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;

Obtaining an unlabeled data set, and predicting the unlabeled data set through a pre-trained second reading comprehension model to obtain a binary data pair of the article and the answer in the unlabeled data set;

Prediction of the binary data pair through the pre-trained question generation model to obtain the ternary data pair of the article, question, and answer in the unlabeled data set;

Filtering the ternary data pair through the pre-trained first reading comprehension model;

Filter the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;

Perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
The method for optimizing a reading model based on big data according to claim 1, wherein said obtaining a labeled data set performs a preset first reading comprehension model, question generation model, and second reading based on the labeled data set. Understanding the model for pre-training includes:

Acquiring a labeled data set, the labeled data set including a number of labeled data pairs, and each labeled data pair includes article information, question information, and corresponding answer information;

Pre-training the first reading comprehension model by using the article information and question information in the labeled data set;

Pre-training the question generation model by using the article information and answer information in the labeled data set;

Pre-training the second reading comprehension model is performed on the article information in the labeled data set.
The method for optimizing a reading model based on big data according to claim 1 or 2, wherein the pre-trained second reading comprehension model predicts the unlabeled data set to obtain the unlabeled data set The binary data pairs of article and answer include:

Inputting the unlabeled data set into a pre-trained second reading comprehension model, and obtaining an output of the pre-trained second reading comprehension model as a first predicted answer;

Performing named entity recognition on the unlabeled data set to obtain a second predicted answer;

Obtain the third predicted answer from the unlabeled data set by using a two-way long and short-term memory network and conditional random field technology;

Combining the first predicted answer, the second predicted answer, and the third predicted answer to obtain a binary data pair of the article and the answer in the unlabeled data set.
The method for optimizing a reading model based on big data according to claim 1 or 2, wherein the filtering the ternary data pair through the pre-trained first reading comprehension model comprises:

Traversing the ternary data pair, and predicting the article information and question information in the ternary data pair through the pre-trained first reading comprehension model to obtain the predicted answer corresponding to the ternary data pair;

Comparing the predicted answer corresponding to the ternary data pair with the answer information in the ternary data pair;

If the predicted answer corresponding to the ternary data pair is different from the answer information in the ternary data pair, delete the ternary data pair;

If the predicted answer corresponding to the ternary data pair is the same as the answer information in the ternary data pair, the ternary data pair is retained.
The method for optimizing a reading model based on big data according to claim 1 or 2, wherein the filtering of the filtered ternary data pair according to the topic of the article in the labeled data set to generate a pseudo-labeled data set includes :

Using the Dirichlet distribution topic model to analyze the similarity between the filtered article information of the ternary data pair and the article information of the labeled data set, and obtain the topic similarity between the ternary data pair and the labeled data set;

Acquire ternary data pairs whose topic similarity is higher than a preset threshold, and construct a pseudo-labeled data set.
A reading model optimization device based on big data, including:

The pre-training module is used to obtain a labeled data set, and perform pre-training on the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;

The binary data pair generation module is used to obtain an unlabeled data set, and predict the unlabeled data set through the pre-trained second reading comprehension model to obtain the binary data about the article and the answer of the unlabeled data set right;

The ternary data pair generation module is used to predict the binary data pair through the pre-trained question generation model to obtain the ternary data pair about the article, question, and answer in the unlabeled data set;

A filtering module, configured to filter the ternary data pair through the pre-trained first reading comprehension model;

The screening module is used to screen the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;

The optimization training module is configured to perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
The device for optimizing a reading model based on big data according to claim 6, wherein the pre-training module comprises:

An obtaining unit for obtaining a labeled data set, the labeled data set includes a number of labeled data pairs, and each labeled data pair includes article information, question information, and corresponding answer information;

The first pre-training unit is configured to pre-train the first reading comprehension model by using the article information and question information in the labeled data set;

The second pre-training unit is configured to use the article information and answer information in the labeled data set to pre-train the question generation model;

The third pre-training unit is used to pre-train the second reading comprehension model by using the article information in the labeled data set.
The device for optimizing a reading model based on big data according to claim 6 or 7, wherein the binary data pair generation module comprises:

The first answer prediction unit is configured to input the unlabeled data set into a pre-trained second reading comprehension model, and obtain the output of the pre-trained second reading comprehension model as the first predicted answer;

The second answer prediction unit is configured to perform named entity recognition on the unlabeled data set to obtain a second predicted answer;

The third answer prediction unit is used to obtain the third predicted answer from the unlabeled data set by using a two-way long-term short-term memory network and conditional random field technology;

The binary data pair generating unit is used to combine the first predicted answer, the second predicted answer, and the third predicted answer to obtain the binary data pair of the article and the answer in the unlabeled data set.
The device for optimizing a reading model based on big data according to claim 6 or 7, wherein the filtering module comprises:

A prediction unit for traversing the ternary data pair, and predicting the article information and question information in the ternary data pair through the first reading comprehension model after pre-training to obtain the corresponding ternary data pair Predicted answer

The comparison unit is used to compare the predicted answer corresponding to the ternary data pair with the answer information in the ternary data pair;

The filtering unit is configured to delete the ternary data pair if the predicted answer corresponding to the ternary data pair is different from the answer information in the ternary data pair; If the answer information in the metadata pair is the same, the ternary data pair is retained.
The device for optimizing a reading model based on big data according to claim 6 or 7, wherein the screening module comprises:

Using the Dirichlet distribution topic model to analyze the similarity between the filtered article information of the ternary data pair and the article information of the labeled data set, and obtain the topic similarity between the ternary data pair and the labeled data set;

Acquire ternary data pairs whose topic similarity is higher than a preset threshold, and construct a pseudo-labeled data set.
A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:

Acquiring a labeled data set, and pre-training the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;

Obtaining an unlabeled data set, and predicting the unlabeled data set through a pre-trained second reading comprehension model to obtain a binary data pair of the article and the answer in the unlabeled data set;

Prediction of the binary data pair through the pre-trained question generation model to obtain the ternary data pair of the article, question, and answer in the unlabeled data set;

Filtering the ternary data pair through the pre-trained first reading comprehension model;

Filter the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;

Perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
The computer device according to claim 11, wherein said obtaining a labeled data set, and pre-training a preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set comprises :

Acquiring a labeled data set, the labeled data set including a number of labeled data pairs, and each labeled data pair includes article information, question information, and corresponding answer information;

Pre-training the first reading comprehension model by using the article information and question information in the labeled data set;

Pre-training the question generation model by using the article information and answer information in the labeled data set;

Pre-training the second reading comprehension model is performed on the article information in the labeled data set.
The computer device according to claim 11 or 12, wherein the pre-trained second reading comprehension model is used to predict the unlabeled data set to obtain the binary data of the unlabeled data set about the article and the answer. Data pairs include:

Inputting the unlabeled data set into a pre-trained second reading comprehension model, and obtaining an output of the pre-trained second reading comprehension model as a first predicted answer;

Performing named entity recognition on the unlabeled data set to obtain a second predicted answer;

Obtain the third predicted answer from the unlabeled data set by using a two-way long and short-term memory network and conditional random field technology;

Combining the first predicted answer, the second predicted answer, and the third predicted answer to obtain a binary data pair of the article and the answer in the unlabeled data set.
The computer device according to claim 11 or 12, wherein the filtering the ternary data pair through the pre-trained first reading comprehension model comprises:

Traversing the ternary data pair, and predicting the article information and question information in the ternary data pair through the pre-trained first reading comprehension model to obtain the predicted answer corresponding to the ternary data pair;

Comparing the predicted answer corresponding to the ternary data pair with the answer information in the ternary data pair;

If the predicted answer corresponding to the ternary data pair is different from the answer information in the ternary data pair, delete the ternary data pair;

If the predicted answer corresponding to the ternary data pair is the same as the answer information in the ternary data pair, the ternary data pair is retained.
The computer device according to claim 11 or 12, wherein the filtering the filtered ternary data pairs according to the article topics in the labeled data set, and generating the pseudo-labeled data set comprises:

Using the Dirichlet distribution topic model to analyze the similarity between the filtered article information of the ternary data pair and the article information of the labeled data set, and obtain the topic similarity between the ternary data pair and the labeled data set;

Acquire ternary data pairs whose topic similarity is higher than a preset threshold, and construct a pseudo-labeled data set.
One or more non-volatile readable storage media storing computer readable instructions. When the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Acquiring a labeled data set, and pre-training the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;

Obtaining an unlabeled data set, and predicting the unlabeled data set through a pre-trained second reading comprehension model to obtain a binary data pair of the article and the answer in the unlabeled data set;

Prediction of the binary data pair through the pre-trained question generation model to obtain the ternary data pair of the article, question, and answer in the unlabeled data set;

Filtering the ternary data pair through the pre-trained first reading comprehension model;

Filter the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;

Perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
The non-volatile readable storage medium according to claim 16, wherein said acquiring a labeled data set performs a comparison of a preset first reading comprehension model, question generation model, and second reading based on the labeled data set. Understanding the model for pre-training includes:

Acquiring a labeled data set, the labeled data set including a number of labeled data pairs, and each labeled data pair includes article information, question information, and corresponding answer information;

Pre-training the first reading comprehension model by using the article information and question information in the labeled data set;

Pre-training the question generation model by using the article information and answer information in the labeled data set;

Pre-training the second reading comprehension model is performed on the article information in the labeled data set.
The non-volatile readable storage medium according to claim 16 or 17, wherein the pre-trained second reading comprehension model is used to predict the unlabeled data set to obtain the unlabeled data set The binary data pairs of article and answer include:

Inputting the unlabeled data set into a pre-trained second reading comprehension model, and obtaining an output of the pre-trained second reading comprehension model as a first predicted answer;

Performing named entity recognition on the unlabeled data set to obtain a second predicted answer;

Obtain the third predicted answer from the unlabeled data set by using a two-way long and short-term memory network and conditional random field technology;

Combining the first predicted answer, the second predicted answer, and the third predicted answer to obtain a binary data pair of the article and the answer in the unlabeled data set.
The non-volatile readable storage medium according to claim 16 or 17, wherein the filtering the ternary data pair through the pre-trained first reading comprehension model comprises:

Traversing the ternary data pair, and predicting the article information and question information in the ternary data pair through the pre-trained first reading comprehension model to obtain the predicted answer corresponding to the ternary data pair;

Comparing the predicted answer corresponding to the ternary data pair with the answer information in the ternary data pair;

If the predicted answer corresponding to the ternary data pair is different from the answer information in the ternary data pair, delete the ternary data pair;

If the predicted answer corresponding to the ternary data pair is the same as the answer information in the ternary data pair, the ternary data pair is retained.
The non-volatile readable storage medium according to claim 16 or 17, wherein said filtering the filtered ternary data pair according to the topic of the article in the marked data set to generate a pseudo-marked data set comprises :

Using the Dirichlet distribution topic model to analyze the similarity between the filtered article information of the ternary data pair and the article information of the labeled data set, and obtain the topic similarity between the ternary data pair and the labeled data set;

Acquire ternary data pairs whose topic similarity is higher than a preset threshold, and construct a pseudo-labeled data set.

To