WO2021164292A1 - Reading model optimization method and apparatus based on big data, and device and medium - Google Patents

Reading model optimization method and apparatus based on big data, and device and medium Download PDF

Info

Publication number
WO2021164292A1
WO2021164292A1 PCT/CN2020/123170 CN2020123170W WO2021164292A1 WO 2021164292 A1 WO2021164292 A1 WO 2021164292A1 CN 2020123170 W CN2020123170 W CN 2020123170W WO 2021164292 A1 WO2021164292 A1 WO 2021164292A1
Authority
WO
WIPO (PCT)
Prior art keywords
data set
answer
model
ternary
labeled
Prior art date
Application number
PCT/CN2020/123170
Other languages
French (fr)
Chinese (zh)
Inventor
楼星雨
许开河
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021164292A1 publication Critical patent/WO2021164292A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of information technology, and in particular to a method, device, equipment, and medium for optimizing a reading model based on big data.
  • reading comprehension technology is a difficult and widely used information processing technology in the field of natural language processing. Reading comprehension technology aims to find the corresponding answer from a given article or document based on the question raised, and even judge whether the question raised can be answered.
  • An excellent reading comprehension model needs to have the same language comprehension ability and knowledge reasoning ability as humans, in order to conduct in-depth mining and analysis of the article, and focus on different parts of the article or perspectives to find the correct answer according to a specific question, so it has better High difficulty.
  • the current excellent reading comprehension models are all based on the complex deep learning model structure, which requires a huge amount of training data for the model to learn.
  • reading comprehension technology needs to be pre-labeled to locate article information, question information and answer information.
  • the inventor realizes that it is very difficult to label the training data. This is because the labeler needs to read the entire article first and then generate the corresponding answers based on the questions given. It is difficult to have good efficiency and accuracy. Guarantee. Due to the high cost of acquiring annotation data, in actual use, reading comprehension models are often trained based on small-scale training data, and it is impossible to find a better solution in the parameter space, which limits the accuracy of the model.
  • the embodiments of the present application provide a reading model optimization method, device, equipment, and medium based on big data to solve the problems of small training samples and low model accuracy caused by the high cost of acquiring annotated data in the existing reading comprehension technology.
  • a reading model optimization method based on big data including:
  • a reading model optimization device based on big data including:
  • the pre-training module is used to obtain a labeled data set, and perform pre-training on the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;
  • the binary data pair generation module is used to obtain an unlabeled data set, and predict the unlabeled data set through the pre-trained second reading comprehension model to obtain the binary data about the article and the answer of the unlabeled data set right;
  • the ternary data pair generation module is used to predict the binary data pair through the pre-trained question generation model to obtain the ternary data pair about the article, question, and answer in the unlabeled data set;
  • a filtering module configured to filter the ternary data pair through the pre-trained first reading comprehension model
  • the screening module is used to screen the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;
  • the optimization training module is configured to perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
  • a computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • One or more non-volatile readable storage media storing computer readable instructions.
  • the computer readable instructions execute the following steps:
  • FIG. 1 is a flowchart of a method for optimizing a reading model based on big data in an embodiment of the present application
  • step S101 is a flowchart of step S101 in a method for optimizing a reading model based on big data in another embodiment of the present application;
  • step S102 is a flowchart of step S102 in a method for optimizing a reading model based on big data in another embodiment of the present application;
  • step S104 is a flowchart of step S104 in a method for optimizing a reading model based on big data in another embodiment of the present application;
  • FIG. 5 is a flowchart of step S105 in a method for optimizing a reading model based on big data in another embodiment of the present application;
  • FIG. 6 is a schematic block diagram of a reading model optimization device based on big data in an embodiment of the present application.
  • Fig. 7 is a schematic diagram of a computer device in an embodiment of the present application.
  • the method for optimizing a reading model based on big data is a self-training method for improving a reading comprehension model. Its purpose is to overcome the inability to find a better solution in the parameter space caused by insufficient reading comprehension labeled data. problem.
  • the method for optimizing a reading model based on big data is applied to a server.
  • the server can be implemented by an independent server or a server cluster composed of multiple servers.
  • a method for optimizing a reading model based on big data is provided, which includes the following steps:
  • step S101 a labeled data set is obtained, and a preset first reading comprehension model, a question generation model, and a second reading comprehension model are pre-trained according to the labeled data set.
  • the embodiment of the present application first uses a small number of existing labeled data sets to pre-train the reading comprehension model, so as to pave the way for obtaining pseudo-labeled data from the unlabeled data set.
  • the labeled data set is acquired, and the preset first reading comprehension model, question generation model, and second reading comprehension model are pre-defined according to the labeled data set.
  • Training includes:
  • a labeled data set is obtained.
  • the labeled data set includes several labeled data pairs, and each labeled data pair includes article information, question information, and corresponding answer information.
  • each marked data pair includes article information, question information, and corresponding answer information, and the article information, question information, and answer information in each marked data pair have been manually marked by a designated method.
  • the ternary data pair can be implemented in a preset composition format.
  • each labeled data pair is designed as a triplet: (Article Information Passage, Question Information Query, Answer Information Answer); Of course it can also Realize through identification information.
  • the labeled data set refers to the collection of data pairs obtained by manually labeling historical customer service conversations; the article information is the historical customer service conversation, the question information is the question information in the historical customer service conversation, and the answer information is the history The answer information to the questioned information in the customer service dialogue. Because historical customer service conversation data is limited, the marked data pairs in the marked data set are also limited; and because manual labeling is expensive and difficult, even if the amount of historical customer service conversation data is large enough, it will be difficult in a short period of time. Obtain enough training samples.
  • step S202 pre-training the first reading comprehension model is performed using the article information and question information in the labeled data set.
  • the first reading comprehension model is a big data model that predicts based on article information and question information, and obtains answer information.
  • the input is article information and question information
  • the output is answer information.
  • the first reading comprehension model includes but is not limited to the R-net machine reading comprehension model and the BERT machine reading comprehension model.
  • the article information and question information in each labeled data pair in the labeled data set are used as the input of the first reading comprehension model, and the first reading comprehension model is pre-trained.
  • step S203 the article information and answer information in the labeled data set are used to pre-train the question generation model.
  • the question generation model is a big data model that makes predictions based on article information and answer information to obtain question information.
  • the input is article information and answer information
  • the output is question information.
  • the problem generation model may adopt a sequence to sequence + copy based problem generation model.
  • the article information and answer information of each labeled data pair in the labeled data set are used as the input of the question generation model, and the question generation model is pre-trained.
  • step S204 pre-training the second reading comprehension model is performed using the article information in the labeled data set.
  • the second reading comprehension model is a big data model that predicts based on article information and obtains answer information.
  • the input is article information
  • the output is answer information.
  • the article information of each labeled data pair in the labeled data set is used as the input of the second reading comprehension model, and the second reading comprehension model is pre-trained.
  • the first reading comprehension model, question generation model, and second reading comprehension model after the pre-training is completed are used to subsequently generate a complete article, question, and answer ternary data pair based on the unlabeled data set, thereby generating a pseudo-labeled data set.
  • the question generation model is used to generate question information based on the unlabeled data set.
  • the question generation model is usually pre-trained with a large amount of labeled article information, so that the input is a certain sentence in the article information, and the output is It is the effect of the previous sentence entered.
  • the second reading comprehension model is used to generate answer information based on the unlabeled data set as the source of one of the answers to the question information.
  • the first reading comprehension model is used to filter the question information and answer information.
  • the method for optimizing a reading model based on big data further includes:
  • step S102 an unlabeled data set is obtained, and the unlabeled data set is predicted by the pre-trained second reading comprehension model to obtain a binary data pair of the unlabeled data set about the article and the answer.
  • the unlabeled data set refers to the unlabeled article information obtained by crawling or downloading from the Internet, such as Wikipedia and official accounts, which is very large compared to the labeled data set.
  • the unlabeled data set is unlabeled article data obtained from the Internet, and historical customer service conversation data that has not yet been labeled.
  • the unlabeled data set is transmitted as input to the second pre-trained reading comprehension model for prediction, and the answer information of the unlabeled data set is obtained, and then the article information and answer information are combined to obtain information about Binary data pair of article and answer.
  • the second reading comprehension model can also be combined with multiple ways to predict answer information.
  • the pre-trained second reading comprehension model in step S102 is used to predict the unlabeled data set, and obtain the binary data pair of the article and the answer in the unlabeled data set including:
  • step S301 the unlabeled data set is input to a pre-trained second reading comprehension model, and the output of the pre-trained second reading comprehension model is obtained as the first predicted answer.
  • the first predicted answer is answer information obtained by predicting the unlabeled data set through a second reading comprehension model.
  • step S302 named entity recognition is performed on the unlabeled data set to obtain a second predicted answer.
  • the second predicted answer is answer information obtained by predicting the unlabeled data set through a named entity recognition technology.
  • Named Entity Recognition is a rule-based method that can extract predicted answer information through regular and pre-built entity dictionaries and open source syntax trees.
  • step S303 a two-way long short-term memory network and conditional random field technology are used to obtain a third predicted answer from the unlabeled data set.
  • the third predicted answer is the answer information obtained by predicting the unlabeled data set through bidirectional Long Short Term Memory (LSTM) and Conditional Random Fields (Conditional Random Fields, CRF) technologies .
  • LSTM Long Short Term Memory
  • CRF Conditional Random Fields
  • the use of two-way LSTM and conditional random field technology is currently the mainstream method of using models for entity recognition.
  • the two-way LSTM is used for feature extraction, and then the conditional random field technology is used to integrate the dependency between tags into the extracted features and predict the answer Location, get the answer information.
  • step S304 the first predicted answer, the second predicted answer, and the third predicted answer are combined to obtain the binary data pair of the article and the answer in the unlabeled data set.
  • the foregoing prediction of the first answer information through the second reading comprehension model, the extraction of the second prediction answer using the named entity recognition technology method, and the extraction of the third prediction answer using the two-way LSTM and conditional random field technology are synchronized and parallel operations.
  • the obtained first predicted answer, second predicted answer, and third predicted answer are all regarded as one of the answer information corresponding to the unlabeled data set.
  • the embodiment of the present application obtains the answer information corresponding to the unlabeled data set by combining multiple methods, which can effectively increase the number and diversity of predicted answers, and further expand the number of binary data pairs and ternary data pairs.
  • step S103 the binary data pair is predicted by the pre-trained question generation model to obtain the ternary data pair of the article, question, and answer in the unlabeled data set.
  • the embodiment of the present application takes the binary data pair about the article and the answer as input, and passes it into the pre-trained question generation model to predict the binary data pair through the question generation model. Get information about the problem. Then, the binary data pair and the question information are combined to obtain the ternary data pair of the article, question, and answer in the unlabeled data set.
  • Each of the ternary data pairs including article information, question information and answer signals can be expressed in the format of (article, question, answer).
  • step S104 the ternary data pair is filtered through the pre-trained first reading comprehension model.
  • the ternary data pair generated by S103 has relatively high noise. If it is directly retained in full, the introduction of noise will produce adverse effects instead. Therefore, the embodiment of the present application further filters the ternary data pair through the pre-trained first reading comprehension model, so as to reduce noise and improve data quality. As shown in FIG. 4, the filtering of the ternary data pair through the pre-trained first reading comprehension model in step S104 includes:
  • step S401 the ternary data pair is traversed, and the article information and question information in the ternary data pair are predicted by the first reading comprehension model after pre-training, and the corresponding ternary data pair is obtained The predicted answer.
  • the embodiment of the present application takes the article information and question information in the ternary data pair as input, and passes it into the pre-trained first reading comprehension model, so that the first reading comprehension model is based on the first reading comprehension model.
  • the article information and question information in the ternary data are predicted to obtain answer information, that is, the ternary data has a corresponding predicted answer.
  • the answer information obtained in step S401 is obtained based on the pre-trained first reading comprehension model predicting the article information and question information in the ternary data
  • the first predicted answer obtained in step S301 It is obtained by predicting the article information in the unlabeled data set based on the pre-trained second reading comprehension model, and is used as the answer information in the ternary data pair. Therefore, the predicted answer corresponding to the ternary data pair may be the same as The answer information in the ternary data pair may be the same or different.
  • step S402 the predicted answer corresponding to the ternary data pair is compared with the answer information in the ternary data pair.
  • the comparison method includes, but is not limited to: whether the predicted answer corresponding to the ternary data pair is completely consistent with the answer information in the ternary data pair, whether there is overlap, and whether there is an inclusion relationship.
  • One of the above methods or any combination thereof can be selected according to the amount of data in the actual application and the training effect of the reading comprehension model to implement the comparison, and to determine which ternary data pairs are specifically retained.
  • step S403 if the predicted answer corresponding to the ternary data pair is different from the answer information in the ternary data pair, the ternary data pair is deleted.
  • step S404 if the predicted answer corresponding to the ternary data pair is the same as the answer information in the ternary data pair, the ternary data pair is retained.
  • the deleted and retained ternary data pairs are not exactly the same.
  • the comparison method is to determine whether the predicted answer corresponding to the ternary data pair is completely consistent with the answer information in the ternary data pair, then if and only if the predicted answer corresponding to the ternary data pair is the same as the ternary data When the answer information in the pair is completely consistent, it is considered the same, otherwise it is considered different; when the comparison method is to determine whether the predicted answer corresponding to the ternary data pair overlaps with the answer information in the ternary data pair When there is an overlap between the predicted answer corresponding to the ternary data pair and the answer information in the ternary data pair, it is considered the same, otherwise it is considered different; when the comparison method is to judge the ternary data pair When there is an inclusive relationship between the predicted answer corresponding to the data pair and the answer information in the ternary data pair, then when the predicted answer corresponding to the ternary data pair has an inclusive relationship with the answer information in the ternary data pair, it
  • step S105 the filtered ternary data pairs are filtered according to the subject of the article in the labeled data set to generate a pseudo-labeled data set.
  • the embodiment of the present application further filters out theme-related ternary data pairs from the retained ternary data pairs based on the article topics of the labeled data set. Since these ternary data pairs related to the subject are not manually annotated, in order to distinguish them from the annotated data set, this embodiment of the application is referred to as a pseudo-annotated data set.
  • the filtering of the filtered ternary data pairs according to the article topics in the labeled data set in step S105, and generating a pseudo-labeled data set includes:
  • step S501 the Dirichlet distribution topic model is used to analyze the similarity between the article information of the filtered ternary data pair and the article information of the labeled data set, and the subject of the ternary data pair and the labeled data set is obtained. Similarity.
  • the Dirichlet distribution topic model is an unsupervised model and does not require annotated data.
  • the distribution analysis of an input article can be performed to obtain the probability that the article belongs to each topic.
  • the article information in the ternary data pair retained after filtering and the article information in the labeled data set are input into the Dirichlet distribution topic model, and the article information in the ternary data pair retained after filtering is obtained.
  • step S502 a ternary data pair whose topic similarity is higher than a preset threshold is obtained, and a pseudo-labeled data set is constructed.
  • each ternary data pair is compared with the theme similarity of the labeled data set with the preset threshold, and the theme similarity is higher than the preset threshold.
  • Ternary data pairs are used as pseudo-labeled data to construct a pseudo-labeled data set.
  • the embodiment of the present application filters by subject similarity, which will be the same and/or similar to the field of the labeled data set. Some ternary data pairs are retained, which is beneficial to reduce noise.
  • the pseudo-labeled data set and the labeled data set constitute a new training sample of the reading comprehension model, thereby expanding the labeled data for training the reading comprehension model, avoiding manual labeling, and reducing the cost of obtaining labeled data .
  • step S106 the pre-trained first reading comprehension model is optimized and trained according to the pseudo-labeled data set and the labeled data set.
  • the pseudo-labeled data and the labeled data are used to train the first reading comprehension model, which greatly enriches the labeled data for training the reading comprehension model, and helps to find better ones during the training process. Parameters, so that a better model than the previous reading comprehension model can be obtained.
  • This embodiment of the application pre-trains the reading comprehension model by using a small amount of labeled data sets; then uses the pre-trained reading comprehension model to predict a large number of unlabeled data sets in related fields to generate a rougher ternary data pair ( Article, question, answer); then select high-quality ternary data pairs from the rougher ternary data pairs to construct a pseudo-labeled data set, and add it to the original labeled data set to retrain the reading comprehension model ;
  • greatly enriching the annotation data used to train the reading comprehension model effectively solving the problem of small training samples caused by the high cost of acquiring annotated data in the existing reading comprehension technology, and is conducive to finding better ones in the training process Parameters, obtain a better model than the previous reading comprehension model, and improve the accuracy of the reading comprehension model.
  • the method for optimizing the reading model based on big data can alleviate the problem that the reading comprehension model cannot find a better solution in the parameter space caused by insufficient annotation data, and effectively improve the accuracy of the reading comprehension model.
  • the reading comprehension model trained through the embodiments of this application is mainly used in the task of information extraction, and the task of information extraction is one of the very important modules such as customer service robots and chat robots.
  • the reading comprehension model trained in this application can more accurately and quickly extract the answers to the questions the user wants to ask from the massive documents, achieve accurate answers to the user’s questions, and reduce the number of user inquiry rounds in the customer service robot.
  • the overall load of the customer service robot If the customer service robot answers the user’s question quickly and accurately, the number of polls is small; if the answer is inaccurate, the user will often continue to ask questions and increase the number of inquiries.
  • a reading model optimization device based on big data corresponds to the reading model optimization method based on big data in the above-mentioned embodiment in a one-to-one correspondence.
  • the reading model optimization device based on big data includes a pre-training module 61, a binary data pair generating module 62, a ternary data pair generating module 63, a filtering module 64, a filtering module 65, and an optimization training module 66.
  • the detailed description of each functional module is as follows:
  • the pre-training module 61 is configured to obtain a labeled data set, and perform pre-training on a preset first reading comprehension model, a question generation model, and a second reading comprehension model according to the labeled data set;
  • the binary data pair generation module 62 is configured to obtain an unlabeled data set, and predict the unlabeled data set through a pre-trained second reading comprehension model to obtain the binary data about the article and the answer of the unlabeled data set Data pair
  • the ternary data pair generation module 63 is configured to predict the binary data pair through the pre-trained question generation model to obtain the ternary data pair about the article, question, and answer in the unlabeled data set;
  • the filtering module 64 is configured to filter the ternary data pair through the pre-trained first reading comprehension model
  • the screening module 65 is configured to screen the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;
  • the optimization training module 66 is configured to perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
  • the pre-training module 61 includes:
  • An obtaining unit for obtaining a labeled data set includes a number of labeled data pairs, and each labeled data pair includes article information, question information, and corresponding answer information;
  • the first pre-training unit is configured to pre-train the first reading comprehension model by using the article information and question information in the labeled data set;
  • the second pre-training unit is configured to use the article information and answer information in the labeled data set to pre-train the question generation model
  • the third pre-training unit is used to pre-train the second reading comprehension model by using the article information in the labeled data set.
  • the binary data pair generation module 62 includes:
  • the first answer prediction unit is configured to input the unlabeled data set into a pre-trained second reading comprehension model, and obtain the output of the pre-trained second reading comprehension model as the first predicted answer;
  • the second answer prediction unit is configured to perform named entity recognition on the unlabeled data set to obtain a second predicted answer
  • the third answer prediction unit is used to obtain the third predicted answer from the unlabeled data set by using a two-way long-term short-term memory network and conditional random field technology;
  • the binary data pair generating unit is used to combine the first predicted answer, the second predicted answer, and the third predicted answer to obtain the binary data pair of the article and the answer in the unlabeled data set.
  • the filtering module 64 includes:
  • a prediction unit for traversing the ternary data pair, and predicting the article information and question information in the ternary data pair through the first reading comprehension model after pre-training to obtain the corresponding ternary data pair Predicted answer
  • the comparison unit is used to compare the predicted answer corresponding to the ternary data pair with the answer information in the ternary data pair;
  • the filtering unit is configured to delete the ternary data pair if the predicted answer corresponding to the ternary data pair is different from the answer information in the ternary data pair; If the answer information in the metadata pair is the same, the ternary data pair is retained.
  • the screening module 65 includes:
  • Each module in the device for optimizing a reading model based on big data can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 7.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer-readable instructions are executed by the processor, a reading model optimization method based on big data is realized.
  • a computer device including a memory, a processor, and computer-readable instructions stored on the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • one or more non-volatile readable storage media storing computer readable instructions are provided.
  • the computer readable instructions are executed by one or more processors, the one or more Each processor performs the following steps:
  • the computer-readable storage medium may be non-volatile or volatile.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A reading model optimization method based on big data. The method comprises: pre-training a first reading comprehension model, a question generation model and a second reading comprehension model according to a marked data set; predicting an unmarked data set by means of the pre-trained second reading comprehension model to obtain binary data pairs regarding articles and answers; predicting the binary data pairs by means of the pre-trained question generation model to obtain ternary data pairs regarding articles, questions and answers; filtering the ternary data pairs by means of the pre-trained first reading comprehension model; screening the filtered ternary data pairs according to article topics in the marked data set to generate a pseudo-marked data set; and performing optimized training on the pre-trained first reading comprehension model according to the pseudo-marked data set and the marked data set. According to the method, the problems of small training samples and low model precision of the existing reading comprehension technology caused by the high acquisition costs of marked data are solved.

Description

基于大数据的阅读模型优化方法、装置、设备及介质Reading model optimization method, device, equipment and medium based on big data
本申请要求于2020年2月21日提交中国专利局、申请号为202010108092.0,发明名称为“基于大数据的阅读模型优化方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the Chinese Patent Office on February 21, 2020, the application number is 202010108092.0, and the invention title is "Big data-based reading model optimization method, device, equipment and medium", and its entire content Incorporated in this application by reference.
技术领域Technical field
本申请涉及信息技术领域,尤其涉及一种基于大数据的阅读模型优化方法、装置、设备及介质。This application relates to the field of information technology, and in particular to a method, device, equipment, and medium for optimizing a reading model based on big data.
 To
背景技术   Background technique
在人工智能中,阅读理解技术是自然语言处理领域中的一种难度高且应用广泛的信息处理技术。阅读理解技术旨在根据提出的问题从给定的文章或文档中找到相应的答案,甚至还可以判断提出的问题是否可以回答。优秀的阅读理解模型需要具有像人类一样的语言理解能力和知识推理能力,以对文章进行深入的挖掘和分析,并根据特定的问题聚焦于文章的不同部分或者观点来找到正确答案,因此具有较高的难度。目前优秀的阅读理解模型均是基于复杂的深度学习模型结构,需要十分庞大的训练数据来让模型进行学习。从阅读理解技术的定义可知,阅读理解技术的训练数据需要预先标注,以定位文章信息、问题信息和答案信息。然而发明人意识到,对训练数据进行标注是十分困难的,这是因为需要标注者先阅读整篇文章并随后根据给出的问题来生成相应的答案,无论是效率还是精度都难以有很好的保证。由于标注数据的获取成本很高,在实际使用中,阅读理解模型往往基于规模较小的训练数据进行训练,无法在参数空间中找到更优的解,限制了模型的精度。In artificial intelligence, reading comprehension technology is a difficult and widely used information processing technology in the field of natural language processing. Reading comprehension technology aims to find the corresponding answer from a given article or document based on the question raised, and even judge whether the question raised can be answered. An excellent reading comprehension model needs to have the same language comprehension ability and knowledge reasoning ability as humans, in order to conduct in-depth mining and analysis of the article, and focus on different parts of the article or perspectives to find the correct answer according to a specific question, so it has better High difficulty. The current excellent reading comprehension models are all based on the complex deep learning model structure, which requires a huge amount of training data for the model to learn. It can be seen from the definition of reading comprehension technology that the training data of reading comprehension technology needs to be pre-labeled to locate article information, question information and answer information. However, the inventor realizes that it is very difficult to label the training data. This is because the labeler needs to read the entire article first and then generate the corresponding answers based on the questions given. It is difficult to have good efficiency and accuracy. Guarantee. Due to the high cost of acquiring annotation data, in actual use, reading comprehension models are often trained based on small-scale training data, and it is impossible to find a better solution in the parameter space, which limits the accuracy of the model.
因此,寻找一种方法解决现有阅读理解技术由于标注数据的获取成本高导致的训练样本小、模型精度低的问题成为本领域技术人员亟需解决的技术问题。Therefore, finding a method to solve the problems of small training samples and low model accuracy caused by the high cost of acquiring annotated data in the existing reading comprehension technology has become an urgent technical problem for those skilled in the art.
 To
发明内容Summary of the invention
本申请实施例提供了一种基于大数据的阅读模型优化方法、装置、设备及介质,以解决现有阅读理解技术由于标注数据的获取成本高导致的训练样本小、模型精度低的问题。The embodiments of the present application provide a reading model optimization method, device, equipment, and medium based on big data to solve the problems of small training samples and low model accuracy caused by the high cost of acquiring annotated data in the existing reading comprehension technology.
一种基于大数据的阅读模型优化方法,包括:A reading model optimization method based on big data, including:
获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练;Acquiring a labeled data set, and pre-training the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;
获取无标注数据集,通过预训练后的第二阅读理解模型对所述无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对;Obtaining an unlabeled data set, and predicting the unlabeled data set through a pre-trained second reading comprehension model to obtain a binary data pair of the article and the answer in the unlabeled data set;
通过预训练后的问题生成模型对所述二元数据对进行预测,得到所述无标注数据集关于文章、问题和答案的三元数据对;Prediction of the binary data pair through the pre-trained question generation model to obtain the ternary data pair of the article, question, and answer in the unlabeled data set;
通过预训练后的第一阅读理解模型对所述三元数据对进行过滤;Filtering the ternary data pair through the pre-trained first reading comprehension model;
根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集;Filter the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;
根据所述伪标注数据集和已标注数据集对预训练后的所述第一阅读理解模型进行优化训练。Perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
一种基于大数据的阅读模型优化装置,包括:A reading model optimization device based on big data, including:
预训练模块,用于获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练;The pre-training module is used to obtain a labeled data set, and perform pre-training on the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;
二元数据对生成模块,用于获取无标注数据集,通过预训练后的第二阅读理解模型对所述无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对;The binary data pair generation module is used to obtain an unlabeled data set, and predict the unlabeled data set through the pre-trained second reading comprehension model to obtain the binary data about the article and the answer of the unlabeled data set right;
三元数据对生成模块,用于通过预训练后的问题生成模型对所述二元数据对进行预测,得到所述无标注数据集关于文章、问题和答案的三元数据对;The ternary data pair generation module is used to predict the binary data pair through the pre-trained question generation model to obtain the ternary data pair about the article, question, and answer in the unlabeled data set;
过滤模块,用于通过预训练后的第一阅读理解模型对所述三元数据对进行过滤;A filtering module, configured to filter the ternary data pair through the pre-trained first reading comprehension model;
筛选模块,用于根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集;The screening module is used to screen the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;
优化训练模块,用于根据所述伪标注数据集和已标注数据集对预训练后的所述第一阅读理解模型进行优化训练。The optimization training module is configured to perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练;Acquiring a labeled data set, and pre-training the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;
获取无标注数据集,通过预训练后的第二阅读理解模型对所述无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对;Obtaining an unlabeled data set, and predicting the unlabeled data set through a pre-trained second reading comprehension model to obtain a binary data pair of the article and the answer in the unlabeled data set;
通过预训练后的问题生成模型对所述二元数据对进行预测,得到所述无标注数据集关于文章、问题和答案的三元数据对;Prediction of the binary data pair through the pre-trained question generation model to obtain the ternary data pair of the article, question, and answer in the unlabeled data set;
通过预训练后的第一阅读理解模型对所述三元数据对进行过滤;Filtering the ternary data pair through the pre-trained first reading comprehension model;
根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集;Filter the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;
根据所述伪标注数据集和已标注数据集对预训练后的所述第一阅读理解模型进行优化训练。Perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer readable instructions. When the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练;Acquiring a labeled data set, and pre-training the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;
获取无标注数据集,通过预训练后的第二阅读理解模型对所述无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对;Obtaining an unlabeled data set, and predicting the unlabeled data set through a pre-trained second reading comprehension model to obtain a binary data pair of the article and the answer in the unlabeled data set;
通过预训练后的问题生成模型对所述二元数据对进行预测,得到所述无标注数据集关于文章、问题和答案的三元数据对;Prediction of the binary data pair through the pre-trained question generation model to obtain the ternary data pair of the article, question, and answer in the unlabeled data set;
通过预训练后的第一阅读理解模型对所述三元数据对进行过滤;Filtering the ternary data pair through the pre-trained first reading comprehension model;
根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集;Filter the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;
根据所述伪标注数据集和已标注数据集对预训练后的所述第一阅读理解模型进行优化训练。Perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.
 To
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.
图1是本申请一实施例中基于大数据的阅读模型优化方法的一流程图;FIG. 1 is a flowchart of a method for optimizing a reading model based on big data in an embodiment of the present application;
图2是本申请另一实施例中基于大数据的阅读模型优化方法中步骤S101的一流程图;2 is a flowchart of step S101 in a method for optimizing a reading model based on big data in another embodiment of the present application;
图3是本申请另一实施例中基于大数据的阅读模型优化方法中步骤S102的一流程图;3 is a flowchart of step S102 in a method for optimizing a reading model based on big data in another embodiment of the present application;
图4是本申请另一实施例中基于大数据的阅读模型优化方法中步骤S104的一流程图;4 is a flowchart of step S104 in a method for optimizing a reading model based on big data in another embodiment of the present application;
图5是本申请另一实施例中基于大数据的阅读模型优化方法中步骤S105的一流程图;FIG. 5 is a flowchart of step S105 in a method for optimizing a reading model based on big data in another embodiment of the present application;
图6 是本申请一实施例中基于大数据的阅读模型优化装置的一原理框图;FIG. 6 is a schematic block diagram of a reading model optimization device based on big data in an embodiment of the present application;
图7是本申请一实施例中计算机设备的一示意图。Fig. 7 is a schematic diagram of a computer device in an embodiment of the present application.
 To
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
本申请实施例提供的基于大数据的阅读模型优化方法是一种基于自训练的阅读理解模型提升方法,其目的是为了克服阅读理解标注数据不足而导致的无法在参数空间中找到更优解的问题。所述基于大数据的阅读模型优化方法应用于服务器。所述服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。在一实施例中,如图1所示,提供一种基于大数据的阅读模型优化方法,包括如下步骤: The method for optimizing a reading model based on big data provided in the embodiments of the application is a self-training method for improving a reading comprehension model. Its purpose is to overcome the inability to find a better solution in the parameter space caused by insufficient reading comprehension labeled data. problem. The method for optimizing a reading model based on big data is applied to a server. The server can be implemented by an independent server or a server cluster composed of multiple servers. In an embodiment, as shown in FIG. 1, a method for optimizing a reading model based on big data is provided, which includes the following steps:
在步骤S101中,获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练。In step S101, a labeled data set is obtained, and a preset first reading comprehension model, a question generation model, and a second reading comprehension model are pre-trained according to the labeled data set.
在这里,本申请实施例首先采用已有的少量已标注数据集对阅读理解模型进行预训练,以对从无标注数据集中获取伪标注数据进行铺垫。可选地,如图2所述,步骤S101中所述的获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练包括:Here, the embodiment of the present application first uses a small number of existing labeled data sets to pre-train the reading comprehension model, so as to pave the way for obtaining pseudo-labeled data from the unlabeled data set. Optionally, as shown in FIG. 2, in step S101, the labeled data set is acquired, and the preset first reading comprehension model, question generation model, and second reading comprehension model are pre-defined according to the labeled data set. Training includes:
在步骤S201中,获取已标注数据集,所述已标注数据集中包括若干条已标注数据对,每一已标注数据对包括文章信息、问题信息及对应的答案信息。In step S201, a labeled data set is obtained. The labeled data set includes several labeled data pairs, and each labeled data pair includes article information, question information, and corresponding answer information.
在这里,每一条已标注数据对包括文章信息、问题信息以及对应的答案信息,且每一条已标注数据对中的文章信息、问题信息以及答案信息已通过指定方式进行人工标记。示例性地,所述三元数据对可以采用预设组成格式实现,比如将每一条已标注数据对设计成一个三元组:(文章信息Passage,问题信息Query,答案信息Answer);当然也可以通过标识信息实现。Here, each marked data pair includes article information, question information, and corresponding answer information, and the article information, question information, and answer information in each marked data pair have been manually marked by a designated method. Exemplarily, the ternary data pair can be implemented in a preset composition format. For example, each labeled data pair is designed as a triplet: (Article Information Passage, Question Information Query, Answer Information Answer); Of course it can also Realize through identification information.
以客服机器人为例,已标注数据集是指对历史客服对话进行人工标注后得到的数据对的集合;其中文章信息为历史客服对话,问题信息为历史客服对话中的提问信息,答案信息为历史客服对话中针对提问信息的答案信息。由于历史客服对话数据是有限的,因此已标注数据集中的已标注数据对也是有限的;且又由于人工标注的成本高昂和难度大,即使历史客服对话数据量足够大,在短时间内也难以获得足够的训练样本。Taking customer service robots as an example, the labeled data set refers to the collection of data pairs obtained by manually labeling historical customer service conversations; the article information is the historical customer service conversation, the question information is the question information in the historical customer service conversation, and the answer information is the history The answer information to the questioned information in the customer service dialogue. Because historical customer service conversation data is limited, the marked data pairs in the marked data set are also limited; and because manual labeling is expensive and difficult, even if the amount of historical customer service conversation data is large enough, it will be difficult in a short period of time. Obtain enough training samples.
在步骤S202中,采用所述已标注数据集中的文章信息和问题信息对所述第一阅读理解模型进行预训练。In step S202, pre-training the first reading comprehension model is performed using the article information and question information in the labeled data set.
在这里,所述第一阅读理解模型为基于文章信息和问题信息进行预测,得到答案信息的大数据模型,输入为文章信息和问题信息,输出为答案信息。可选地,所述第一阅读理解模型包括但不限于R-net机器阅读理解模型、BERT机器阅读理解模型。本申请实施例将已标注数据集中每一已标注数据对中的文章信息和问题信息作为所述第一阅读理解模型的输入,对所述第一阅读理解模型进行预训练。Here, the first reading comprehension model is a big data model that predicts based on article information and question information, and obtains answer information. The input is article information and question information, and the output is answer information. Optionally, the first reading comprehension model includes but is not limited to the R-net machine reading comprehension model and the BERT machine reading comprehension model. In the embodiment of the present application, the article information and question information in each labeled data pair in the labeled data set are used as the input of the first reading comprehension model, and the first reading comprehension model is pre-trained.
在步骤S203中,采用所述已标注数据集中的文章信息和答案信息对所述问题生成模型进行预训练。In step S203, the article information and answer information in the labeled data set are used to pre-train the question generation model.
在这里,所述问题生成模型为基于文章信息和答案信息进行预测,得到问题信息的大数据模型,输入为文章信息和答案信息,输出为问题信息。示例性地,所述问题生成模型可以采用基于sequence to sequence + copy的问题生成模型。本申请实施例将已标注数据集中的每一已标注数据对的文章信息和答案信息作为所述问题生成模型的输入,对所述问题生成模型进行预训练。Here, the question generation model is a big data model that makes predictions based on article information and answer information to obtain question information. The input is article information and answer information, and the output is question information. Exemplarily, the problem generation model may adopt a sequence to sequence + copy based problem generation model. In the embodiment of the present application, the article information and answer information of each labeled data pair in the labeled data set are used as the input of the question generation model, and the question generation model is pre-trained.
在步骤S204中,采用所述已标注数据集中的文章信息,对所述第二阅读理解模型进行预训练。In step S204, pre-training the second reading comprehension model is performed using the article information in the labeled data set.
在这里,所述第二阅读理解模型为基于文章信息进行预测,得到答案信息的大数据模型,输入为文章信息,输出为答案信息。本申请实施例将已标注数据集中的每一已标注数据对的文章信息作为所述第二阅读理解模型的输入,对所述第二阅读理解模型进行预训练。Here, the second reading comprehension model is a big data model that predicts based on article information and obtains answer information. The input is article information, and the output is answer information. In the embodiment of the present application, the article information of each labeled data pair in the labeled data set is used as the input of the second reading comprehension model, and the second reading comprehension model is pre-trained.
预训练完成后的所述第一阅读理解模型、问题生成模型、第二阅读理解模型用于后续根据无标注数据集生成完整的文章、问题、答案三元数据对,进而生成伪标注数据集。其中,问题生成模型用于根据无标注数据集生成问题信息,如前所述通常先用大量已标注的文章信息对问题生成模型进行预训练,使其达到输入是文章信息中的某一句、输出是输入的上一句的效果。第二阅读理解模型用于根据无标注数据集生成答案信息,作为问题信息的其中一个答案的来源。第一阅读理解模型则用于对问题信息和答案信息进行过滤操作。所述基于大数据的阅读模型优化方法还包括:The first reading comprehension model, question generation model, and second reading comprehension model after the pre-training is completed are used to subsequently generate a complete article, question, and answer ternary data pair based on the unlabeled data set, thereby generating a pseudo-labeled data set. Among them, the question generation model is used to generate question information based on the unlabeled data set. As mentioned above, the question generation model is usually pre-trained with a large amount of labeled article information, so that the input is a certain sentence in the article information, and the output is It is the effect of the previous sentence entered. The second reading comprehension model is used to generate answer information based on the unlabeled data set as the source of one of the answers to the question information. The first reading comprehension model is used to filter the question information and answer information. The method for optimizing a reading model based on big data further includes:
在步骤S102中,获取无标注数据集,通过预训练后的第二阅读理解模型对无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对。In step S102, an unlabeled data set is obtained, and the unlabeled data set is predicted by the pre-trained second reading comprehension model to obtain a binary data pair of the unlabeled data set about the article and the answer.
在这里,所述无标注数据集为从网络上,比如维基百科、公众号,通过爬虫或者下载的方式获取到的未经过任何标注的文章信息,相对于已标注数据集的规模十分庞大。以客服机器人为例,所述无标注数据集为从网络上获取的未经标注的文章数据,以及还未经标注的历史客服对话数据。本申请实施例将所述无标注数据集作为输入传输到预训练后的第二阅读理解模型进行预测,得到所述无标注数据集的答案信息,然后组合所述文章信息和答案信息,得到关于文章和答案的二元数据对。Here, the unlabeled data set refers to the unlabeled article information obtained by crawling or downloading from the Internet, such as Wikipedia and official accounts, which is very large compared to the labeled data set. Taking a customer service robot as an example, the unlabeled data set is unlabeled article data obtained from the Internet, and historical customer service conversation data that has not yet been labeled. In this embodiment of the application, the unlabeled data set is transmitted as input to the second pre-trained reading comprehension model for prediction, and the answer information of the unlabeled data set is obtained, and then the article information and answer information are combined to obtain information about Binary data pair of article and answer.
可选地,还可以结合所述第二阅读理解模型和多种方式预测答案信息。如图3所示,步骤S102所述的通过预训练后的第二阅读理解模型对无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对包括:Optionally, the second reading comprehension model can also be combined with multiple ways to predict answer information. As shown in FIG. 3, the pre-trained second reading comprehension model in step S102 is used to predict the unlabeled data set, and obtain the binary data pair of the article and the answer in the unlabeled data set including:
在步骤S301中,将所述无标注数据集输入预训练后的第二阅读理解模型,并获取所述预训练后的第二阅读理解模型的输出作为第一预测答案。In step S301, the unlabeled data set is input to a pre-trained second reading comprehension model, and the output of the pre-trained second reading comprehension model is obtained as the first predicted answer.
其中,所述第一预测答案为通过第二阅读理解模型对所述无标注数据集进行预测得到的答案信息。Wherein, the first predicted answer is answer information obtained by predicting the unlabeled data set through a second reading comprehension model.
在步骤S302中,对所述无标注数据集进行命名实体识别,获取第二预测答案。In step S302, named entity recognition is performed on the unlabeled data set to obtain a second predicted answer.
其中,所述第二预测答案为通过命名实体识别技术对所述无标注数据集进行预测得到的答案信息。在这里,命名实体识别(Named Entity Recognition,简称NER)是一种基于规则的方法,可以通过正则和预先构建的实体词典以及开源的句法树来抽取出预测的答案信息。Wherein, the second predicted answer is answer information obtained by predicting the unlabeled data set through a named entity recognition technology. Here, Named Entity Recognition (NER) is a rule-based method that can extract predicted answer information through regular and pre-built entity dictionaries and open source syntax trees.
在步骤S303中,采用双向长短期记忆网络和条件随机场技术从所述无标注数据集中获取第三预测答案。In step S303, a two-way long short-term memory network and conditional random field technology are used to obtain a third predicted answer from the unlabeled data set.
其中,所述第三预测答案为通过双向长短期记忆网络(Long Short Term Memory,简称LSTM)和条件随机场(Conditional Random Fields,简称CRF)技术对所述无标注数据集进行预测得到的答案信息。采用双向LSTM和条件随机场技术是目前利用模型来进行实体识别的主流方法,通过双向LSTM进行特征提取,然后利用条件随机场技术在所提取的特征中融入标签之间的依赖关系并预测出答案位置,得到答案信息。Wherein, the third predicted answer is the answer information obtained by predicting the unlabeled data set through bidirectional Long Short Term Memory (LSTM) and Conditional Random Fields (Conditional Random Fields, CRF) technologies . The use of two-way LSTM and conditional random field technology is currently the mainstream method of using models for entity recognition. The two-way LSTM is used for feature extraction, and then the conditional random field technology is used to integrate the dependency between tags into the extracted features and predict the answer Location, get the answer information.
在步骤S304中,合并所述第一预测答案、第二预测答案和第三预测答案,得到所述无标注数据集关于文章和答案的二元数据对。In step S304, the first predicted answer, the second predicted answer, and the third predicted answer are combined to obtain the binary data pair of the article and the answer in the unlabeled data set.
上述的通过第二阅读理解模型预测第一答案信息、采用命名实体识别技术法抽取第二预测答案以及采用双向LSTM和条件随机场技术抽取第三预测答案,是同步并行运行的。所得到的第一第预测答案、第二预测答案和第三预测答案均作为无标注数据集对应的答案信息之一。本申请实施例通过结合多种方式来获取所述无标注数据集对应的答案信息,可以有效地增加预测答案的数量和多样性,进而扩展二元数据对及三元数据对的数量。The foregoing prediction of the first answer information through the second reading comprehension model, the extraction of the second prediction answer using the named entity recognition technology method, and the extraction of the third prediction answer using the two-way LSTM and conditional random field technology are synchronized and parallel operations. The obtained first predicted answer, second predicted answer, and third predicted answer are all regarded as one of the answer information corresponding to the unlabeled data set. The embodiment of the present application obtains the answer information corresponding to the unlabeled data set by combining multiple methods, which can effectively increase the number and diversity of predicted answers, and further expand the number of binary data pairs and ternary data pairs.
在步骤S103中,通过预训练后的问题生成模型对所述二元数据对进行预测,得到所述无标注数据集关于文章、问题和答案的三元数据对。In step S103, the binary data pair is predicted by the pre-trained question generation model to obtain the ternary data pair of the article, question, and answer in the unlabeled data set.
在这里,本申请实施例将所述关于文章和答案的二元数据对作为输入,传入预训练后的所述问题生成模型,以通过所述问题生成模型对所述二元数据对进行预测得到问题信息。然后组合所述二元数据对和问题信息,得到所述无标注数据集关于文章、问题和答案的三元数据对。每一所述三元数据对包括文章信息、问题信息及答案信号,可以通过(文章,问题,答案)的格式进行表示。Here, the embodiment of the present application takes the binary data pair about the article and the answer as input, and passes it into the pre-trained question generation model to predict the binary data pair through the question generation model. Get information about the problem. Then, the binary data pair and the question information are combined to obtain the ternary data pair of the article, question, and answer in the unlabeled data set. Each of the ternary data pairs including article information, question information and answer signals can be expressed in the format of (article, question, answer).
在步骤S104中,通过预训练后的第一阅读理解模型对所述三元数据对进行过滤。In step S104, the ternary data pair is filtered through the pre-trained first reading comprehension model.
在本申请实施例中,通过S103产生的三元数据对具有较高的噪声,如果直接全量保留,则由于噪声的引入,反而产生不利的效果。因此本申请实施例进一步通过预训练后的第一阅读理解模型对所述三元数据对进行过滤,以降低噪声,提高数据质量。如图4所示,步骤S104所述的通过预训练后的第一阅读理解模型对所述三元数据对进行过滤包括:In the embodiment of the present application, the ternary data pair generated by S103 has relatively high noise. If it is directly retained in full, the introduction of noise will produce adverse effects instead. Therefore, the embodiment of the present application further filters the ternary data pair through the pre-trained first reading comprehension model, so as to reduce noise and improve data quality. As shown in FIG. 4, the filtering of the ternary data pair through the pre-trained first reading comprehension model in step S104 includes:
在步骤S401中,遍历所述三元数据对,通过预训练后的所述第一阅读理解模型对所述三元数据对中的文章信息和问题信息进行预测,得到所述三元数据对对应的预测答案。In step S401, the ternary data pair is traversed, and the article information and question information in the ternary data pair are predicted by the first reading comprehension model after pre-training, and the corresponding ternary data pair is obtained The predicted answer.
在这里,本申请实施例将所述三元数据对中的文章信息和问题信息作为输入,传入预训练后的所述第一阅读理解模型,以通过所述第一阅读理解模型基于所述三元数据中的文章信息和问题信息进行预测得到答案信息,即所述三元数据对对应的预测答案。需要注意的是,步骤S401得到的答案信息是基于所述预训练后的第一阅读理解模型对所述三元数据中的文章信息和问题信息预测得到的,步骤S301中得到的第一预测答案则是基于预训练后的第二阅读理解模型对所述无标注数据集中的文章信息预测得到的,用作三元数据对中的答案信息,因此所述三元数据对对应的预测答案可能与三元数据对中的答案信息可能相同,也可能不相同。Here, the embodiment of the present application takes the article information and question information in the ternary data pair as input, and passes it into the pre-trained first reading comprehension model, so that the first reading comprehension model is based on the first reading comprehension model. The article information and question information in the ternary data are predicted to obtain answer information, that is, the ternary data has a corresponding predicted answer. It should be noted that the answer information obtained in step S401 is obtained based on the pre-trained first reading comprehension model predicting the article information and question information in the ternary data, and the first predicted answer obtained in step S301 It is obtained by predicting the article information in the unlabeled data set based on the pre-trained second reading comprehension model, and is used as the answer information in the ternary data pair. Therefore, the predicted answer corresponding to the ternary data pair may be the same as The answer information in the ternary data pair may be the same or different.
在步骤S402中,将所述三元数据对对应的预测答案与三元数据对中的答案信息进行比对。In step S402, the predicted answer corresponding to the ternary data pair is compared with the answer information in the ternary data pair.
在这里,比对的方式包括但不限于:所述三元数据对对应的预测答案与三元数据对中的答案信息是否完全一致、是否存在重叠、是否存在包含关系。可以根据实际应用中的数据量和阅读理解模型的训练效果来选择上述方式中的一种或者其任意组合来实现比对,判断具体保留哪些三元数据对。Here, the comparison method includes, but is not limited to: whether the predicted answer corresponding to the ternary data pair is completely consistent with the answer information in the ternary data pair, whether there is overlap, and whether there is an inclusion relationship. One of the above methods or any combination thereof can be selected according to the amount of data in the actual application and the training effect of the reading comprehension model to implement the comparison, and to determine which ternary data pairs are specifically retained.
在步骤S403中,若所述三元数据对对应的预测答案与三元数据对中的答案信息不相同,则删除所述三元数据对。In step S403, if the predicted answer corresponding to the ternary data pair is different from the answer information in the ternary data pair, the ternary data pair is deleted.
在步骤S404中,若所述三元数据对对应的预测答案与三元数据对中的答案信息相同,则保留所述三元数据对。In step S404, if the predicted answer corresponding to the ternary data pair is the same as the answer information in the ternary data pair, the ternary data pair is retained.
根据所采用的比对方式,删除和保留的三元数据对不完全相同。当比对的方式为判断所述三元数据对对应的预测答案与三元数据对中的答案信息是否完全一致时,则当且仅当所述三元数据对对应的预测答案与三元数据对中的答案信息完全一致时,认为是相同的,否则认为是不相同的;当比对的方式为判断所述三元数据对对应的预测答案与三元数据对中的答案信息是否存在重叠时,则当所述三元数据对对应的预测答案与三元数据对中的答案信息存在重叠时,认为是相同的,否则认为是不相同的;当比对的方式为判断所述三元数据对对应的预测答案与三元数据对中的答案信息是否存在包含关系时,则当所述三元数据对对应的预测答案与三元数据对中的答案信息存在包含关系时,认为是相同的,否则认为是不相同的。本申请实施例删除预测答案与三元数据对中的答案信息不相同三元数据对,保留预测答案与三元数据对中的答案信息相同的三元数据对。According to the comparison method adopted, the deleted and retained ternary data pairs are not exactly the same. When the comparison method is to determine whether the predicted answer corresponding to the ternary data pair is completely consistent with the answer information in the ternary data pair, then if and only if the predicted answer corresponding to the ternary data pair is the same as the ternary data When the answer information in the pair is completely consistent, it is considered the same, otherwise it is considered different; when the comparison method is to determine whether the predicted answer corresponding to the ternary data pair overlaps with the answer information in the ternary data pair When there is an overlap between the predicted answer corresponding to the ternary data pair and the answer information in the ternary data pair, it is considered the same, otherwise it is considered different; when the comparison method is to judge the ternary data pair When there is an inclusive relationship between the predicted answer corresponding to the data pair and the answer information in the ternary data pair, then when the predicted answer corresponding to the ternary data pair has an inclusive relationship with the answer information in the ternary data pair, it is considered the same , Otherwise it is considered different. The embodiment of the application deletes the ternary data pair whose predicted answer is different from the answer information in the ternary data pair, and retains the ternary data pair whose predicted answer is the same as the answer information in the ternary data pair.
在步骤S105中,根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集。In step S105, the filtered ternary data pairs are filtered according to the subject of the article in the labeled data set to generate a pseudo-labeled data set.
在完成对三元数据对的过滤之后,本申请实施例进一步基于已标注数据集的文章主题,从保留的三元数据对中筛选出主题相关的三元数据对。由于这些与主题相关的三元数据对并非是人工标注的,为了与已标注数据集区分,本申请实施例称之为伪标注数据集。After completing the filtering of the ternary data pairs, the embodiment of the present application further filters out theme-related ternary data pairs from the retained ternary data pairs based on the article topics of the labeled data set. Since these ternary data pairs related to the subject are not manually annotated, in order to distinguish them from the annotated data set, this embodiment of the application is referred to as a pseudo-annotated data set.
可选地,如图5所示,步骤S105所述的根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集包括:Optionally, as shown in FIG. 5, the filtering of the filtered ternary data pairs according to the article topics in the labeled data set in step S105, and generating a pseudo-labeled data set includes:
在步骤S501中,采用狄利克雷分布主题模型对过滤后的所述三元数据对的文章信息和已标注数据集的文章信息进行相似性分析,得到三元数据对与已标注数据集的主题相似度。In step S501, the Dirichlet distribution topic model is used to analyze the similarity between the article information of the filtered ternary data pair and the article information of the labeled data set, and the subject of the ternary data pair and the labeled data set is obtained. Similarity.
在这里,狄利克雷分布主题模型是无监督模型,无需标注数据。可以对输入的一篇文章进行分布分析,得到该文章分别属于各个主题的概率大小。本申请实施例把过滤后保留下来的三元数据对中的文章信息和已标注数据集中的文章信息输入所述狄利克雷分布主题模型,得到过滤后保留下来的三元数据对中的文章信息和已标注数据集中的文章信息分别属于各个主题的概率大小。其中概率越大,相似度越大,从而可以得到文章信息在主题上的相似度,即三元数据对与已标注数据集的主题相似度。Here, the Dirichlet distribution topic model is an unsupervised model and does not require annotated data. The distribution analysis of an input article can be performed to obtain the probability that the article belongs to each topic. In the embodiment of the application, the article information in the ternary data pair retained after filtering and the article information in the labeled data set are input into the Dirichlet distribution topic model, and the article information in the ternary data pair retained after filtering is obtained. The probability that the article information in the labeled data set belongs to each topic. The greater the probability, the greater the similarity, so that the similarity of the article information on the subject can be obtained, that is, the subject similarity of the ternary data pair and the labeled data set.
在步骤S502中,获取主题相似度高于预设阈值的三元数据对,构建伪标注数据集。In step S502, a ternary data pair whose topic similarity is higher than a preset threshold is obtained, and a pseudo-labeled data set is constructed.
本申请实施例通过预先设定阈值,将每一个三元数据对与已标注数据集的主题相似度,和所述预设阈值进行比对,筛选出主题相似度高于所述预设阈值的三元数据对,作为伪标注数据,以构建伪标注数据集。In this embodiment of the application, by pre-setting a threshold, each ternary data pair is compared with the theme similarity of the labeled data set with the preset threshold, and the theme similarity is higher than the preset threshold. Ternary data pairs are used as pseudo-labeled data to construct a pseudo-labeled data set.
由于无标注数据是来自各个领域的,而许多领域与已标注数据集的领域差别很大,本申请实施例通过主题相似度进行过滤,将与已标注数据集所属领域相同和/或相似的那部分三元数据对保留下来,有利于降低噪声。Since the unlabeled data comes from various fields, and many fields are very different from the fields of the labeled data set, the embodiment of the present application filters by subject similarity, which will be the same and/or similar to the field of the labeled data set. Some ternary data pairs are retained, which is beneficial to reduce noise.
在这里,所述伪标注数据集和已标注数据集构成所述阅读理解模型的新的训练样本,从而拓展了训练阅读理解模型的标注数据,也避免了人工标注,降低了标注数据的获取成本。Here, the pseudo-labeled data set and the labeled data set constitute a new training sample of the reading comprehension model, thereby expanding the labeled data for training the reading comprehension model, avoiding manual labeling, and reducing the cost of obtaining labeled data .
在步骤S106中,根据所述伪标注数据集和已标注数据集对预训练后的所述第一阅读理解模型进行优化训练。In step S106, the pre-trained first reading comprehension model is optimized and trained according to the pseudo-labeled data set and the labeled data set.
在本申请实施例中,所述伪标注数据和已标注数据共同用于对第一阅读理解模型进行训练,大大地丰富了训练阅读理解模型的标注数据,有利于在训练过程中找到更优的参数,从而可以得到相较于之前的阅读理解模型更优的模型。In the embodiment of the present application, the pseudo-labeled data and the labeled data are used to train the first reading comprehension model, which greatly enriches the labeled data for training the reading comprehension model, and helps to find better ones during the training process. Parameters, so that a better model than the previous reading comprehension model can be obtained.
本申请实施例通过采用少量的已标注数据集对阅读理解模型进行预训练;然后通过预训练后的阅读理解模型对相关领域的大量无标注数据集进行预测,生成较粗糙的三元数据对(文章,问题,答案);再从较粗糙的三元数据对中挑选高质量的三元数据对构建伪标注数据集,并加入到原有的已标注标注数据集中,用于重新训练阅读理解模型;从而大大地丰富了用于训练阅读理解模型的标注数据,有效地解决了现有阅读理解技术由于标注数据的获取成本高导致的训练样本小的问题,有利于在训练过程中找到更优的参数,得到相较于之前的阅读理解模型更优的模型,提高阅读理解模型的精度。This embodiment of the application pre-trains the reading comprehension model by using a small amount of labeled data sets; then uses the pre-trained reading comprehension model to predict a large number of unlabeled data sets in related fields to generate a rougher ternary data pair ( Article, question, answer); then select high-quality ternary data pairs from the rougher ternary data pairs to construct a pseudo-labeled data set, and add it to the original labeled data set to retrain the reading comprehension model ; Thereby greatly enriching the annotation data used to train the reading comprehension model, effectively solving the problem of small training samples caused by the high cost of acquiring annotated data in the existing reading comprehension technology, and is conducive to finding better ones in the training process Parameters, obtain a better model than the previous reading comprehension model, and improve the accuracy of the reading comprehension model.
本申请实施例提供的基于大数据的阅读模型优化方法可以缓解因为标注数据不足而导致的阅读理解模型无法在参数空间中寻找更优解的问题,有效地提升阅读理解模型的精度。通过本申请实施例训练得到的阅读理解模型,主要应用在信息抽取的任务中,而信息抽取任务又是诸如客服机器人、聊天机器人等的非常重要的模块之一。通过本申请训练得到的阅读理解模型可以更为准确和快速地从海量文档中抽取出用户想要询问的问题的答案,实现对用户问题的精准回答,降低用户在客服机器人中的询问轮数以及客服机器人的整体负载。如果客服机器人对用户问题回答得快且准确,轮询次数少;如果回答得不准确,用户往往会继续追问下去,增加询问的轮数。The method for optimizing the reading model based on big data provided by the embodiments of the present application can alleviate the problem that the reading comprehension model cannot find a better solution in the parameter space caused by insufficient annotation data, and effectively improve the accuracy of the reading comprehension model. The reading comprehension model trained through the embodiments of this application is mainly used in the task of information extraction, and the task of information extraction is one of the very important modules such as customer service robots and chat robots. The reading comprehension model trained in this application can more accurately and quickly extract the answers to the questions the user wants to ask from the massive documents, achieve accurate answers to the user’s questions, and reduce the number of user inquiry rounds in the customer service robot. The overall load of the customer service robot. If the customer service robot answers the user’s question quickly and accurately, the number of polls is small; if the answer is inaccurate, the user will often continue to ask questions and increase the number of inquiries.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
 To
在一实施例中,提供一种基于大数据的阅读模型优化装置,该基于大数据的阅读模型优化装置与上述实施例中基于大数据的阅读模型优化方法一一对应。如图6所示,该基于大数据的阅读模型优化装置包括预训练模块61、二元数据对生成模块62、三元数据对生成模块63、过滤模块64、筛选模块65、优化训练模块66。各功能模块详细说明如下:In one embodiment, a reading model optimization device based on big data is provided, and the reading model optimization device based on big data corresponds to the reading model optimization method based on big data in the above-mentioned embodiment in a one-to-one correspondence. As shown in FIG. 6, the reading model optimization device based on big data includes a pre-training module 61, a binary data pair generating module 62, a ternary data pair generating module 63, a filtering module 64, a filtering module 65, and an optimization training module 66. The detailed description of each functional module is as follows:
预训练模块61,用于获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练;The pre-training module 61 is configured to obtain a labeled data set, and perform pre-training on a preset first reading comprehension model, a question generation model, and a second reading comprehension model according to the labeled data set;
二元数据对生成模块62,用于获取无标注数据集,通过预训练后的第二阅读理解模型对所述无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对;The binary data pair generation module 62 is configured to obtain an unlabeled data set, and predict the unlabeled data set through a pre-trained second reading comprehension model to obtain the binary data about the article and the answer of the unlabeled data set Data pair
三元数据对生成模块63,用于通过预训练后的问题生成模型对所述二元数据对进行预测,得到所述无标注数据集关于文章、问题和答案的三元数据对;The ternary data pair generation module 63 is configured to predict the binary data pair through the pre-trained question generation model to obtain the ternary data pair about the article, question, and answer in the unlabeled data set;
过滤模块64,用于通过预训练后的第一阅读理解模型对所述三元数据对进行过滤;The filtering module 64 is configured to filter the ternary data pair through the pre-trained first reading comprehension model;
筛选模块65,用于根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集;The screening module 65 is configured to screen the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;
优化训练模块66,用于根据所述伪标注数据集和已标注数据集对预训练后的所述第一阅读理解模型进行优化训练。The optimization training module 66 is configured to perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
可选地,所述预训练模块61包括:Optionally, the pre-training module 61 includes:
获取单元,用于获取已标注数据集,所述已标注数据集中包括若干条已标注数据对,每一已标注数据对包括文章信息、问题信息及对应的答案信息;An obtaining unit for obtaining a labeled data set, the labeled data set includes a number of labeled data pairs, and each labeled data pair includes article information, question information, and corresponding answer information;
第一预训练单元,用于采用所述已标注数据集中的文章信息和问题信息对所述第一阅读理解模型进行预训练;The first pre-training unit is configured to pre-train the first reading comprehension model by using the article information and question information in the labeled data set;
第二预训练单元,用于采用所述已标注数据集中的文章信息和答案信息对所述问题生成模型进行预训练;The second pre-training unit is configured to use the article information and answer information in the labeled data set to pre-train the question generation model;
第三预训练单元,用于采用所述已标注数据集中的文章信息,对所述第二阅读理解模型进行预训练。The third pre-training unit is used to pre-train the second reading comprehension model by using the article information in the labeled data set.
可选地,所述二元数据对生成模块62包括:Optionally, the binary data pair generation module 62 includes:
第一答案预测单元,用于将所述无标注数据集输入预训练后的第二阅读理解模型,并获取所述预训练后的第二阅读理解模型的输出作为第一预测答案;The first answer prediction unit is configured to input the unlabeled data set into a pre-trained second reading comprehension model, and obtain the output of the pre-trained second reading comprehension model as the first predicted answer;
第二答案预测单元,用于对所述无标注数据集进行命名实体识别,获取第二预测答案;The second answer prediction unit is configured to perform named entity recognition on the unlabeled data set to obtain a second predicted answer;
第三答案预测单元,用于采用双向长短期记忆网络和条件随机场技术从所述无标注数据集中获取第三预测答案;The third answer prediction unit is used to obtain the third predicted answer from the unlabeled data set by using a two-way long-term short-term memory network and conditional random field technology;
二元数据对生成单元,用于合并所述第一预测答案、第二预测答案和第三预测答案,得到所述无标注数据集关于文章和答案的二元数据对。The binary data pair generating unit is used to combine the first predicted answer, the second predicted answer, and the third predicted answer to obtain the binary data pair of the article and the answer in the unlabeled data set.
可选地,所述过滤模块64包括:Optionally, the filtering module 64 includes:
预测单元,用于遍历所述三元数据对,通过预训练后的所述第一阅读理解模型对所述三元数据对中的文章信息和问题信息进行预测,得到所述三元数据对对应的预测答案;A prediction unit for traversing the ternary data pair, and predicting the article information and question information in the ternary data pair through the first reading comprehension model after pre-training to obtain the corresponding ternary data pair Predicted answer
比对单元,用于将所述三元数据对对应的预测答案与三元数据对中的答案信息进行比对;The comparison unit is used to compare the predicted answer corresponding to the ternary data pair with the answer information in the ternary data pair;
过滤单元,用于若所述三元数据对对应的预测答案与三元数据对中的答案信息不相同,则删除所述三元数据对;若所述三元数据对对应的预测答案与三元数据对中的答案信息相同,则保留所述三元数据对。The filtering unit is configured to delete the ternary data pair if the predicted answer corresponding to the ternary data pair is different from the answer information in the ternary data pair; If the answer information in the metadata pair is the same, the ternary data pair is retained.
可选地,所述筛选模块65包括:Optionally, the screening module 65 includes:
采用狄利克雷分布主题模型对过滤后的所述三元数据对的文章信息和已标注数据集的文章信息进行相似性分析,得到三元数据对与已标注数据集的主题相似度;Using the Dirichlet distribution topic model to analyze the similarity between the filtered article information of the ternary data pair and the article information of the labeled data set, and obtain the topic similarity between the ternary data pair and the labeled data set;
获取主题相似度高于预设阈值的三元数据对,构建伪标注数据集。Acquire ternary data pairs whose topic similarity is higher than a preset threshold, and construct a pseudo-labeled data set.
关于基于大数据的阅读模型优化装置的具体限定可以参见上文中对于基于大数据的阅读模型优化方法的限定,在此不再赘述。上述基于大数据的阅读模型优化装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the reading model optimization device based on big data, please refer to the above limitation of the reading model optimization method based on big data, which will not be repeated here. Each module in the device for optimizing a reading model based on big data can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
 To
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图7所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种基于大数据的阅读模型优化方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 7. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer-readable instructions are executed by the processor, a reading model optimization method based on big data is realized.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored on the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练;Acquiring a labeled data set, and pre-training the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;
获取无标注数据集,通过预训练后的第二阅读理解模型对所述无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对;Obtaining an unlabeled data set, and predicting the unlabeled data set through a pre-trained second reading comprehension model to obtain a binary data pair of the article and the answer in the unlabeled data set;
通过预训练后的问题生成模型对所述二元数据对进行预测,得到所述无标注数据集关于文章、问题和答案的三元数据对;Prediction of the binary data pair through the pre-trained question generation model to obtain the ternary data pair of the article, question, and answer in the unlabeled data set;
通过预训练后的第一阅读理解模型对所述三元数据对进行过滤;Filtering the ternary data pair through the pre-trained first reading comprehension model;
根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集;Filter the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;
根据所述伪标注数据集和已标注数据集对预训练后的所述第一阅读理解模型进行优化训练。Perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
在一个实施例中,提供了一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:In one embodiment, one or more non-volatile readable storage media storing computer readable instructions are provided. When the computer readable instructions are executed by one or more processors, the one or more Each processor performs the following steps:
获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练;Acquiring a labeled data set, and pre-training the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;
获取无标注数据集,通过预训练后的第二阅读理解模型对所述无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对;Obtaining an unlabeled data set, and predicting the unlabeled data set through a pre-trained second reading comprehension model to obtain a binary data pair of the article and the answer in the unlabeled data set;
通过预训练后的问题生成模型对所述二元数据对进行预测,得到所述无标注数据集关于文章、问题和答案的三元数据对;Prediction of the binary data pair through the pre-trained question generation model to obtain the ternary data pair of the article, question, and answer in the unlabeled data set;
通过预训练后的第一阅读理解模型对所述三元数据对进行过滤;Filtering the ternary data pair through the pre-trained first reading comprehension model;
根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集;Filter the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;
根据所述伪标注数据集和已标注数据集对预训练后的所述第一阅读理解模型进行优化训练。Perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
所述计算机可读存储介质可以是非易失性,也可以是易失性。The computer-readable storage medium may be non-volatile or volatile.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above functional units and modules is used as an example. In practical applications, the above functions can be allocated to different functional units and modules as needed. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims (20)

  1. 一种基于大数据的阅读模型优化方法,包括: A reading model optimization method based on big data, including:
    获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练;Acquiring a labeled data set, and pre-training the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;
    获取无标注数据集,通过预训练后的第二阅读理解模型对所述无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对;Obtaining an unlabeled data set, and predicting the unlabeled data set through a pre-trained second reading comprehension model to obtain a binary data pair of the article and the answer in the unlabeled data set;
    通过预训练后的问题生成模型对所述二元数据对进行预测,得到所述无标注数据集关于文章、问题和答案的三元数据对;Prediction of the binary data pair through the pre-trained question generation model to obtain the ternary data pair of the article, question, and answer in the unlabeled data set;
    通过预训练后的第一阅读理解模型对所述三元数据对进行过滤;Filtering the ternary data pair through the pre-trained first reading comprehension model;
    根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集;Filter the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;
    根据所述伪标注数据集和已标注数据集对预训练后的所述第一阅读理解模型进行优化训练。Perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
  2. 如权利要求1所述的基于大数据的阅读模型优化方法,其中,所述获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练包括: The method for optimizing a reading model based on big data according to claim 1, wherein said obtaining a labeled data set performs a preset first reading comprehension model, question generation model, and second reading based on the labeled data set. Understanding the model for pre-training includes:
    获取已标注数据集,所述已标注数据集中包括若干条已标注数据对,每一已标注数据对包括文章信息、问题信息及对应的答案信息;Acquiring a labeled data set, the labeled data set including a number of labeled data pairs, and each labeled data pair includes article information, question information, and corresponding answer information;
    采用所述已标注数据集中的文章信息和问题信息对所述第一阅读理解模型进行预训练;Pre-training the first reading comprehension model by using the article information and question information in the labeled data set;
    采用所述已标注数据集中的文章信息和答案信息对所述问题生成模型进行预训练;Pre-training the question generation model by using the article information and answer information in the labeled data set;
    采用所述已标注数据集中的文章信息,对所述第二阅读理解模型进行预训练。Pre-training the second reading comprehension model is performed on the article information in the labeled data set.
  3. 如权利要求1或2所述的基于大数据的阅读模型优化方法,其中,所述通过预训练后的第二阅读理解模型对所述无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对包括: The method for optimizing a reading model based on big data according to claim 1 or 2, wherein the pre-trained second reading comprehension model predicts the unlabeled data set to obtain the unlabeled data set The binary data pairs of article and answer include:
    将所述无标注数据集输入预训练后的第二阅读理解模型,并获取所述预训练后的第二阅读理解模型的输出作为第一预测答案;Inputting the unlabeled data set into a pre-trained second reading comprehension model, and obtaining an output of the pre-trained second reading comprehension model as a first predicted answer;
    对所述无标注数据集进行命名实体识别,获取第二预测答案;Performing named entity recognition on the unlabeled data set to obtain a second predicted answer;
    采用双向长短期记忆网络和条件随机场技术从所述无标注数据集中获取第三预测答案;Obtain the third predicted answer from the unlabeled data set by using a two-way long and short-term memory network and conditional random field technology;
    合并所述第一预测答案、第二预测答案和第三预测答案,得到所述无标注数据集关于文章和答案的二元数据对。Combining the first predicted answer, the second predicted answer, and the third predicted answer to obtain a binary data pair of the article and the answer in the unlabeled data set.
  4. 如权利要求1或2所述的基于大数据的阅读模型优化方法,其中,所述通过预训练后的第一阅读理解模型对所述三元数据对进行过滤包括: The method for optimizing a reading model based on big data according to claim 1 or 2, wherein the filtering the ternary data pair through the pre-trained first reading comprehension model comprises:
    遍历所述三元数据对,通过预训练后的所述第一阅读理解模型对所述三元数据对中的文章信息和问题信息进行预测,得到所述三元数据对对应的预测答案;Traversing the ternary data pair, and predicting the article information and question information in the ternary data pair through the pre-trained first reading comprehension model to obtain the predicted answer corresponding to the ternary data pair;
    将所述三元数据对对应的预测答案与三元数据对中的答案信息进行比对;Comparing the predicted answer corresponding to the ternary data pair with the answer information in the ternary data pair;
    若所述三元数据对对应的预测答案与三元数据对中的答案信息不相同,则删除所述三元数据对;If the predicted answer corresponding to the ternary data pair is different from the answer information in the ternary data pair, delete the ternary data pair;
    若所述三元数据对对应的预测答案与三元数据对中的答案信息相同,则保留所述三元数据对。If the predicted answer corresponding to the ternary data pair is the same as the answer information in the ternary data pair, the ternary data pair is retained.
  5. 如权利要求1或2所述的基于大数据的阅读模型优化方法,其中,所述根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集包括: The method for optimizing a reading model based on big data according to claim 1 or 2, wherein the filtering of the filtered ternary data pair according to the topic of the article in the labeled data set to generate a pseudo-labeled data set includes :
    采用狄利克雷分布主题模型对过滤后的所述三元数据对的文章信息和已标注数据集的文章信息进行相似性分析,得到三元数据对与已标注数据集的主题相似度;Using the Dirichlet distribution topic model to analyze the similarity between the filtered article information of the ternary data pair and the article information of the labeled data set, and obtain the topic similarity between the ternary data pair and the labeled data set;
    获取主题相似度高于预设阈值的三元数据对,构建伪标注数据集。Acquire ternary data pairs whose topic similarity is higher than a preset threshold, and construct a pseudo-labeled data set.
  6. 一种基于大数据的阅读模型优化装置,包括: A reading model optimization device based on big data, including:
    预训练模块,用于获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练;The pre-training module is used to obtain a labeled data set, and perform pre-training on the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;
    二元数据对生成模块,用于获取无标注数据集,通过预训练后的第二阅读理解模型对所述无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对;The binary data pair generation module is used to obtain an unlabeled data set, and predict the unlabeled data set through the pre-trained second reading comprehension model to obtain the binary data about the article and the answer of the unlabeled data set right;
    三元数据对生成模块,用于通过预训练后的问题生成模型对所述二元数据对进行预测,得到所述无标注数据集关于文章、问题和答案的三元数据对;The ternary data pair generation module is used to predict the binary data pair through the pre-trained question generation model to obtain the ternary data pair about the article, question, and answer in the unlabeled data set;
    过滤模块,用于通过预训练后的第一阅读理解模型对所述三元数据对进行过滤;A filtering module, configured to filter the ternary data pair through the pre-trained first reading comprehension model;
    筛选模块,用于根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集;The screening module is used to screen the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;
    优化训练模块,用于根据所述伪标注数据集和已标注数据集对预训练后的所述第一阅读理解模型进行优化训练。The optimization training module is configured to perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
  7. 如权利要求6所述的基于大数据的阅读模型优化装置,其中,所述预训练模块包括: The device for optimizing a reading model based on big data according to claim 6, wherein the pre-training module comprises:
    获取单元,用于获取已标注数据集,所述已标注数据集中包括若干条已标注数据对,每一已标注数据对包括文章信息、问题信息及对应的答案信息;An obtaining unit for obtaining a labeled data set, the labeled data set includes a number of labeled data pairs, and each labeled data pair includes article information, question information, and corresponding answer information;
    第一预训练单元,用于采用所述已标注数据集中的文章信息和问题信息对所述第一阅读理解模型进行预训练;The first pre-training unit is configured to pre-train the first reading comprehension model by using the article information and question information in the labeled data set;
    第二预训练单元,用于采用所述已标注数据集中的文章信息和答案信息对所述问题生成模型进行预训练;The second pre-training unit is configured to use the article information and answer information in the labeled data set to pre-train the question generation model;
    第三预训练单元,用于采用所述已标注数据集中的文章信息,对所述第二阅读理解模型进行预训练。The third pre-training unit is used to pre-train the second reading comprehension model by using the article information in the labeled data set.
  8. 如权利要求6或7所述的基于大数据的阅读模型优化装置,其中,所述二元数据对生成模块包括: The device for optimizing a reading model based on big data according to claim 6 or 7, wherein the binary data pair generation module comprises:
    第一答案预测单元,用于将所述无标注数据集输入预训练后的第二阅读理解模型,并获取所述预训练后的第二阅读理解模型的输出作为第一预测答案;The first answer prediction unit is configured to input the unlabeled data set into a pre-trained second reading comprehension model, and obtain the output of the pre-trained second reading comprehension model as the first predicted answer;
    第二答案预测单元,用于对所述无标注数据集进行命名实体识别,获取第二预测答案;The second answer prediction unit is configured to perform named entity recognition on the unlabeled data set to obtain a second predicted answer;
    第三答案预测单元,用于采用双向长短期记忆网络和条件随机场技术从所述无标注数据集中获取第三预测答案;The third answer prediction unit is used to obtain the third predicted answer from the unlabeled data set by using a two-way long-term short-term memory network and conditional random field technology;
    二元数据对生成单元,用于合并所述第一预测答案、第二预测答案和第三预测答案,得到所述无标注数据集关于文章和答案的二元数据对。The binary data pair generating unit is used to combine the first predicted answer, the second predicted answer, and the third predicted answer to obtain the binary data pair of the article and the answer in the unlabeled data set.
  9. 如权利要求6或7所述的基于大数据的阅读模型优化装置,其中,所述过滤模块包括: The device for optimizing a reading model based on big data according to claim 6 or 7, wherein the filtering module comprises:
    预测单元,用于遍历所述三元数据对,通过预训练后的所述第一阅读理解模型对所述三元数据对中的文章信息和问题信息进行预测,得到所述三元数据对对应的预测答案;A prediction unit for traversing the ternary data pair, and predicting the article information and question information in the ternary data pair through the first reading comprehension model after pre-training to obtain the corresponding ternary data pair Predicted answer
    比对单元,用于将所述三元数据对对应的预测答案与三元数据对中的答案信息进行比对;The comparison unit is used to compare the predicted answer corresponding to the ternary data pair with the answer information in the ternary data pair;
    过滤单元,用于若所述三元数据对对应的预测答案与三元数据对中的答案信息不相同,则删除所述三元数据对;若所述三元数据对对应的预测答案与三元数据对中的答案信息相同,则保留所述三元数据对。The filtering unit is configured to delete the ternary data pair if the predicted answer corresponding to the ternary data pair is different from the answer information in the ternary data pair; If the answer information in the metadata pair is the same, the ternary data pair is retained.
  10. 如权利要求6或7所述的基于大数据的阅读模型优化装置,其中,所述筛选模块包括: The device for optimizing a reading model based on big data according to claim 6 or 7, wherein the screening module comprises:
    采用狄利克雷分布主题模型对过滤后的所述三元数据对的文章信息和已标注数据集的文章信息进行相似性分析,得到三元数据对与已标注数据集的主题相似度;Using the Dirichlet distribution topic model to analyze the similarity between the filtered article information of the ternary data pair and the article information of the labeled data set, and obtain the topic similarity between the ternary data pair and the labeled data set;
    获取主题相似度高于预设阈值的三元数据对,构建伪标注数据集。Acquire ternary data pairs whose topic similarity is higher than a preset threshold, and construct a pseudo-labeled data set.
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤: A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
    获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练;Acquiring a labeled data set, and pre-training the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;
    获取无标注数据集,通过预训练后的第二阅读理解模型对所述无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对;Obtaining an unlabeled data set, and predicting the unlabeled data set through a pre-trained second reading comprehension model to obtain a binary data pair of the article and the answer in the unlabeled data set;
    通过预训练后的问题生成模型对所述二元数据对进行预测,得到所述无标注数据集关于文章、问题和答案的三元数据对;Prediction of the binary data pair through the pre-trained question generation model to obtain the ternary data pair of the article, question, and answer in the unlabeled data set;
    通过预训练后的第一阅读理解模型对所述三元数据对进行过滤;Filtering the ternary data pair through the pre-trained first reading comprehension model;
    根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集;Filter the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;
    根据所述伪标注数据集和已标注数据集对预训练后的所述第一阅读理解模型进行优化训练。Perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
  12. 如权利要求11所述的计算机设备,其中,所述获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练包括: The computer device according to claim 11, wherein said obtaining a labeled data set, and pre-training a preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set comprises :
    获取已标注数据集,所述已标注数据集中包括若干条已标注数据对,每一已标注数据对包括文章信息、问题信息及对应的答案信息;Acquiring a labeled data set, the labeled data set including a number of labeled data pairs, and each labeled data pair includes article information, question information, and corresponding answer information;
    采用所述已标注数据集中的文章信息和问题信息对所述第一阅读理解模型进行预训练;Pre-training the first reading comprehension model by using the article information and question information in the labeled data set;
    采用所述已标注数据集中的文章信息和答案信息对所述问题生成模型进行预训练;Pre-training the question generation model by using the article information and answer information in the labeled data set;
    采用所述已标注数据集中的文章信息,对所述第二阅读理解模型进行预训练。Pre-training the second reading comprehension model is performed on the article information in the labeled data set.
  13. 如权利要求11或12所述的计算机设备,其中,所述通过预训练后的第二阅读理解模型对所述无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对包括: The computer device according to claim 11 or 12, wherein the pre-trained second reading comprehension model is used to predict the unlabeled data set to obtain the binary data of the unlabeled data set about the article and the answer. Data pairs include:
    将所述无标注数据集输入预训练后的第二阅读理解模型,并获取所述预训练后的第二阅读理解模型的输出作为第一预测答案;Inputting the unlabeled data set into a pre-trained second reading comprehension model, and obtaining an output of the pre-trained second reading comprehension model as a first predicted answer;
    对所述无标注数据集进行命名实体识别,获取第二预测答案;Performing named entity recognition on the unlabeled data set to obtain a second predicted answer;
    采用双向长短期记忆网络和条件随机场技术从所述无标注数据集中获取第三预测答案;Obtain the third predicted answer from the unlabeled data set by using a two-way long and short-term memory network and conditional random field technology;
    合并所述第一预测答案、第二预测答案和第三预测答案,得到所述无标注数据集关于文章和答案的二元数据对。Combining the first predicted answer, the second predicted answer, and the third predicted answer to obtain a binary data pair of the article and the answer in the unlabeled data set.
  14. 如权利要求11或12所述的计算机设备,其中,所述通过预训练后的第一阅读理解模型对所述三元数据对进行过滤包括: The computer device according to claim 11 or 12, wherein the filtering the ternary data pair through the pre-trained first reading comprehension model comprises:
    遍历所述三元数据对,通过预训练后的所述第一阅读理解模型对所述三元数据对中的文章信息和问题信息进行预测,得到所述三元数据对对应的预测答案;Traversing the ternary data pair, and predicting the article information and question information in the ternary data pair through the pre-trained first reading comprehension model to obtain the predicted answer corresponding to the ternary data pair;
    将所述三元数据对对应的预测答案与三元数据对中的答案信息进行比对;Comparing the predicted answer corresponding to the ternary data pair with the answer information in the ternary data pair;
    若所述三元数据对对应的预测答案与三元数据对中的答案信息不相同,则删除所述三元数据对;If the predicted answer corresponding to the ternary data pair is different from the answer information in the ternary data pair, delete the ternary data pair;
    若所述三元数据对对应的预测答案与三元数据对中的答案信息相同,则保留所述三元数据对。If the predicted answer corresponding to the ternary data pair is the same as the answer information in the ternary data pair, the ternary data pair is retained.
  15. 如权利要求11或12所述的计算机设备,其中,所述根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集包括: The computer device according to claim 11 or 12, wherein the filtering the filtered ternary data pairs according to the article topics in the labeled data set, and generating the pseudo-labeled data set comprises:
    采用狄利克雷分布主题模型对过滤后的所述三元数据对的文章信息和已标注数据集的文章信息进行相似性分析,得到三元数据对与已标注数据集的主题相似度;Using the Dirichlet distribution topic model to analyze the similarity between the filtered article information of the ternary data pair and the article information of the labeled data set, and obtain the topic similarity between the ternary data pair and the labeled data set;
    获取主题相似度高于预设阈值的三元数据对,构建伪标注数据集。Acquire ternary data pairs whose topic similarity is higher than a preset threshold, and construct a pseudo-labeled data set.
  16. 一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤: One or more non-volatile readable storage media storing computer readable instructions. When the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
    获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练;Acquiring a labeled data set, and pre-training the preset first reading comprehension model, question generation model, and second reading comprehension model according to the labeled data set;
    获取无标注数据集,通过预训练后的第二阅读理解模型对所述无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对;Obtaining an unlabeled data set, and predicting the unlabeled data set through a pre-trained second reading comprehension model to obtain a binary data pair of the article and the answer in the unlabeled data set;
    通过预训练后的问题生成模型对所述二元数据对进行预测,得到所述无标注数据集关于文章、问题和答案的三元数据对;Prediction of the binary data pair through the pre-trained question generation model to obtain the ternary data pair of the article, question, and answer in the unlabeled data set;
    通过预训练后的第一阅读理解模型对所述三元数据对进行过滤;Filtering the ternary data pair through the pre-trained first reading comprehension model;
    根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集;Filter the filtered ternary data pairs according to the subject of the article in the labeled data set to generate a pseudo-labeled data set;
    根据所述伪标注数据集和已标注数据集对预训练后的所述第一阅读理解模型进行优化训练。Perform optimization training on the pre-trained first reading comprehension model according to the pseudo-labeled data set and the labeled data set.
  17. 如权利要求16所述的非易失性可读存储介质,其中,所述获取已标注数据集,根据所述已标注数据集对预设的第一阅读理解模型、问题生成模型以及第二阅读理解模型进行预训练包括: The non-volatile readable storage medium according to claim 16, wherein said acquiring a labeled data set performs a comparison of a preset first reading comprehension model, question generation model, and second reading based on the labeled data set. Understanding the model for pre-training includes:
    获取已标注数据集,所述已标注数据集中包括若干条已标注数据对,每一已标注数据对包括文章信息、问题信息及对应的答案信息;Acquiring a labeled data set, the labeled data set including a number of labeled data pairs, and each labeled data pair includes article information, question information, and corresponding answer information;
    采用所述已标注数据集中的文章信息和问题信息对所述第一阅读理解模型进行预训练;Pre-training the first reading comprehension model by using the article information and question information in the labeled data set;
    采用所述已标注数据集中的文章信息和答案信息对所述问题生成模型进行预训练;Pre-training the question generation model by using the article information and answer information in the labeled data set;
    采用所述已标注数据集中的文章信息,对所述第二阅读理解模型进行预训练。Pre-training the second reading comprehension model is performed on the article information in the labeled data set.
  18. 如权利要求16或17所述的非易失性可读存储介质,其中,所述通过预训练后的第二阅读理解模型对所述无标注数据集进行预测,得到所述无标注数据集关于文章和答案的二元数据对包括: The non-volatile readable storage medium according to claim 16 or 17, wherein the pre-trained second reading comprehension model is used to predict the unlabeled data set to obtain the unlabeled data set The binary data pairs of article and answer include:
    将所述无标注数据集输入预训练后的第二阅读理解模型,并获取所述预训练后的第二阅读理解模型的输出作为第一预测答案;Inputting the unlabeled data set into a pre-trained second reading comprehension model, and obtaining an output of the pre-trained second reading comprehension model as a first predicted answer;
    对所述无标注数据集进行命名实体识别,获取第二预测答案;Performing named entity recognition on the unlabeled data set to obtain a second predicted answer;
    采用双向长短期记忆网络和条件随机场技术从所述无标注数据集中获取第三预测答案;Obtain the third predicted answer from the unlabeled data set by using a two-way long and short-term memory network and conditional random field technology;
    合并所述第一预测答案、第二预测答案和第三预测答案,得到所述无标注数据集关于文章和答案的二元数据对。Combining the first predicted answer, the second predicted answer, and the third predicted answer to obtain a binary data pair of the article and the answer in the unlabeled data set.
  19. 如权利要求16或17所述的非易失性可读存储介质,其中,所述通过预训练后的第一阅读理解模型对所述三元数据对进行过滤包括: The non-volatile readable storage medium according to claim 16 or 17, wherein the filtering the ternary data pair through the pre-trained first reading comprehension model comprises:
    遍历所述三元数据对,通过预训练后的所述第一阅读理解模型对所述三元数据对中的文章信息和问题信息进行预测,得到所述三元数据对对应的预测答案;Traversing the ternary data pair, and predicting the article information and question information in the ternary data pair through the pre-trained first reading comprehension model to obtain the predicted answer corresponding to the ternary data pair;
    将所述三元数据对对应的预测答案与三元数据对中的答案信息进行比对;Comparing the predicted answer corresponding to the ternary data pair with the answer information in the ternary data pair;
    若所述三元数据对对应的预测答案与三元数据对中的答案信息不相同,则删除所述三元数据对;If the predicted answer corresponding to the ternary data pair is different from the answer information in the ternary data pair, delete the ternary data pair;
    若所述三元数据对对应的预测答案与三元数据对中的答案信息相同,则保留所述三元数据对。If the predicted answer corresponding to the ternary data pair is the same as the answer information in the ternary data pair, the ternary data pair is retained.
  20. 如权利要求16或17所述的非易失性可读存储介质,其中,所述根据已标注数据集中的文章主题,对过滤后的所述三元数据对进行筛选,生成伪标注数据集包括: The non-volatile readable storage medium according to claim 16 or 17, wherein said filtering the filtered ternary data pair according to the topic of the article in the marked data set to generate a pseudo-marked data set comprises :
    采用狄利克雷分布主题模型对过滤后的所述三元数据对的文章信息和已标注数据集的文章信息进行相似性分析,得到三元数据对与已标注数据集的主题相似度;Using the Dirichlet distribution topic model to analyze the similarity between the filtered article information of the ternary data pair and the article information of the labeled data set, and obtain the topic similarity between the ternary data pair and the labeled data set;
    获取主题相似度高于预设阈值的三元数据对,构建伪标注数据集。Acquire ternary data pairs whose topic similarity is higher than a preset threshold, and construct a pseudo-labeled data set.
     To
PCT/CN2020/123170 2020-02-21 2020-10-23 Reading model optimization method and apparatus based on big data, and device and medium WO2021164292A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010108092.0 2020-02-21
CN202010108092.0A CN111444677A (en) 2020-02-21 2020-02-21 Reading model optimization method, device, equipment and medium based on big data

Publications (1)

Publication Number Publication Date
WO2021164292A1 true WO2021164292A1 (en) 2021-08-26

Family

ID=71653936

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/123170 WO2021164292A1 (en) 2020-02-21 2020-10-23 Reading model optimization method and apparatus based on big data, and device and medium

Country Status (2)

Country Link
CN (1) CN111444677A (en)
WO (1) WO2021164292A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117436551A (en) * 2023-12-18 2024-01-23 杭州宇谷科技股份有限公司 Training method and system for intelligent customer service model

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444677A (en) * 2020-02-21 2020-07-24 平安科技(深圳)有限公司 Reading model optimization method, device, equipment and medium based on big data
CN112711938B (en) * 2021-03-26 2021-07-06 北京沃丰时代数据科技有限公司 Reading understanding model construction method and device, electronic equipment and storage medium
CN114495130B (en) * 2021-12-27 2023-03-24 北京百度网讯科技有限公司 Cross-modal information-based document reading understanding model training method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059152A (en) * 2018-12-25 2019-07-26 阿里巴巴集团控股有限公司 A kind of training method, device and the equipment of text information prediction model
WO2019200748A1 (en) * 2018-04-17 2019-10-24 平安科技(深圳)有限公司 Transfer learning method, device, computer device, and storage medium
CN110457675A (en) * 2019-06-26 2019-11-15 平安科技(深圳)有限公司 Prediction model training method, device, storage medium and computer equipment
CN110633730A (en) * 2019-08-07 2019-12-31 中山大学 Deep learning machine reading understanding training method based on course learning
CN110717324A (en) * 2019-09-06 2020-01-21 暨南大学 Judgment document answer information extraction method, device, extractor, medium and equipment
CN111444677A (en) * 2020-02-21 2020-07-24 平安科技(深圳)有限公司 Reading model optimization method, device, equipment and medium based on big data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019200748A1 (en) * 2018-04-17 2019-10-24 平安科技(深圳)有限公司 Transfer learning method, device, computer device, and storage medium
CN110059152A (en) * 2018-12-25 2019-07-26 阿里巴巴集团控股有限公司 A kind of training method, device and the equipment of text information prediction model
CN110457675A (en) * 2019-06-26 2019-11-15 平安科技(深圳)有限公司 Prediction model training method, device, storage medium and computer equipment
CN110633730A (en) * 2019-08-07 2019-12-31 中山大学 Deep learning machine reading understanding training method based on course learning
CN110717324A (en) * 2019-09-06 2020-01-21 暨南大学 Judgment document answer information extraction method, device, extractor, medium and equipment
CN111444677A (en) * 2020-02-21 2020-07-24 平安科技(深圳)有限公司 Reading model optimization method, device, equipment and medium based on big data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117436551A (en) * 2023-12-18 2024-01-23 杭州宇谷科技股份有限公司 Training method and system for intelligent customer service model

Also Published As

Publication number Publication date
CN111444677A (en) 2020-07-24

Similar Documents

Publication Publication Date Title
WO2021164292A1 (en) Reading model optimization method and apparatus based on big data, and device and medium
US11663409B2 (en) Systems and methods for training machine learning models using active learning
WO2021243828A1 (en) Text processing method and apparatus based on machine learning, and computer device and medium
TWI621077B (en) Character recognition method and server for claim documents
WO2021120543A1 (en) Natural language and knowledge graph-based method and device for representating learning
CN103207855B (en) For the fine granularity sentiment analysis system and method for product review information
US10878011B2 (en) Cognitive ranking of terms used during a conversation
US20190180196A1 (en) Systems and methods for generating and updating machine hybrid deep learning models
WO2021114810A1 (en) Graph structure-based official document recommendation method, apparatus, computer device, and medium
US20190179903A1 (en) Systems and methods for multi language automated action response
WO2019113122A1 (en) Systems and methods for improved machine learning for conversations
CN112287089B (en) Classification model training and automatic question-answering method and device for automatic question-answering system
WO2022227162A1 (en) Question and answer data processing method and apparatus, and computer device and storage medium
WO2020010834A1 (en) Faq question and answer library generalization method, apparatus, and device
US10885080B2 (en) Cognitive ranking of terms used during a conversation
WO2023050754A1 (en) Model training method and apparatus for private data set
CN111737432A (en) Automatic dialogue method and system based on joint training model
WO2023137911A1 (en) Intention classification method and apparatus based on small-sample corpus, and computer device
CN109614627A (en) A kind of text punctuate prediction technique, device, computer equipment and storage medium
CN114757178A (en) Core product word extraction method, device, equipment and medium
CN112445899B (en) Attribute matching method in knowledge base question and answer based on neural network
Cui et al. BiLSTM-Attention-CRF model for entity extraction in internet recruitment data
CN114969544A (en) Hot data-based recommended content generation method, device, equipment and medium
CN114722821A (en) Text matching method and device, storage medium and electronic equipment
CN114238798A (en) Search ranking method, system, device and storage medium based on neural network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20920623

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20920623

Country of ref document: EP

Kind code of ref document: A1