CN112836772A

CN112836772A - Random contrast test identification method integrating multiple BERT models based on LightGBM

Info

Publication number: CN112836772A
Application number: CN202110363597.6A
Authority: CN
Inventors: 孙鑫; 秦璇; 李玲; 刘佳利; 王雨宁; 刘艳梅; 齐亚娜; 邹康; 邓可; 马玉; 刘梅
Original assignee: West China Hospital of Sichuan University
Current assignee: West China Hospital of Sichuan University
Priority date: 2021-04-02
Filing date: 2021-04-02
Publication date: 2021-05-25
Also published as: WO2022205768A1

Abstract

The invention discloses a random contrast test identification method integrating a plurality of BERT models based on LightGBM, which comprises the following steps: step s 1: segmenting initial RCT data prepared in advance into a training set, a development set and a test set, wherein the initial RCT data comprises texts and initial classification labels; step s 2: respectively converting texts in the training set, the development set and the test set into a position vector, a text vector and a word vector; step s 3: training a model; step s 4: adjusting the hyper-parameters of the model; step s 5: classifying the texts of the training set and the development set by using the trained model; step s 6: training a LightGBM model; step s 7: and obtaining a final classification result. The invention integrates 4 different models by developing an integrated learning algorithm LightGBM, trains on RCT data provided by Cochrane, and automatically screens the subjects and abstracts of RCT classes.

Description

Random contrast test identification method integrating multiple BERT models based on LightGBM

Technical Field

The invention relates to the technical field of computer data processing, in particular to a random contrast test identification method based on a LightGBM integrated plurality of BERT models.

Background

The Random Control Test (RCT) is generally considered to be a gold standard for evaluating the safety and efficacy of drugs. In recent years, how to evaluate the effectiveness and safety of drugs by using real world evidence becomes a hot issue of increasing concern in drug development and supervision decisions at home and abroad.

For a single RCT, the experimental samples are limited, a Meta analysis is often used for comprehensively collecting small samples of various treatments of a certain disease and the results of single clinical test RCT, systematic evaluation and statistical analysis are carried out on the results, scientific conclusions as real as possible are timely provided for the society and clinicians, so that the popularization of real and effective treatment means is promoted, and an invalid and even harmful method which is not yet provided for reference is abandoned.

The literature, as an important scientific research display sharing mode, contains a lot of scientific research information. RCT-related documents are typically collected by researchers through literature search.

However, in the document retrieval process for system evaluation, due to the explosive growth of documents every year and the lack of specificity of retrieval strategies, and the number of retrieved citations is very large, the screening of RCT-related documents on retrieval results is time-consuming and labor-consuming manually.

Currently, some system evaluation software tools include RCT classification functions, including gapscoreerer, austackr, and Rayyan, which are semi-automatic reference filtering and selection software that classify documents using a Support Vector Machine (SVM). SVM is a successful machine learning model widely used in these text mining tools to classify text in the first decade of the 21 st century. SVMs rely heavily on artificially set sample characteristics, which can be unstable and labor intensive.

With the development of machine learning techniques and computer hardware, network-based machine learning approaches have gained popularity due to their good performance on many problems, particularly in image recognition and Natural Language Processing (NLP). The bi-directional encoder representation was derived from the transformer (BERT), a pre-trained model, proposed by google, which achieved the best model results in 2018 in 10 months for 11 NLP tasks. Due to the deep network and the pre-training process thereof, the BERT model can obtain better effect in different NLP tasks. In the pre-training process, the model learns the background features of the language over a large number of pre-training data sets. The machine learning process has a large amount of basic learning, and the learning effect of a specific task is better. Therefore, we wish to use different pre-trained BERT models that are medically relevant as the basic classifier for the RCT classification task.

In the last two years, LightGBM has been widely used in machine learning tasks as an integration method for integrating different model effects. Besides saving the training prediction time, the performance of the method is superior to that of all the existing Boosting algorithms.

Currently, a model that performs well in the field of text classification is supervised learning. Supervised learning models for text classification require a training process. During the training process, the model is adapted to learn the relationship between the quotation and the classification tags, where the known filter tags are used to predict quotations without known classification tags. Therefore, the accuracy of the screened quotation directly influences the classification effect of the model. Cochrane is a recognized project in the field of system evaluation, and global health science researchers from 158 countries have participated in the classification of text. Panelists trained on the study method paired, screened the headlines/abstracts independently. The reviewer resolves the divergence by discussing or, if necessary, negotiating with a third reviewer.

Disclosure of Invention

The invention aims to provide a random contrast test identification method integrating a plurality of BERT models based on LightGBM, which is used for automatically screening the subjects and abstracts of RCT types.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

a random control test identification method integrating a plurality of BERT models based on LightGBM comprises the following steps:

step s 1: segmenting initial RCT data prepared in advance into a training set, a development set and a test set, wherein the initial RCT data comprises texts and initial classification labels;

step s 2: respectively converting texts in the training set, the development set and the test set into a position vector, a text vector and a word vector;

step s 3: respectively training 4 BERT models by using the position vector, the text vector, the word vector and the initial classification label after the text conversion in the training set;

step s 4: adjusting hyper-parameters of the 4 BERT models using the converted text position vectors, text vectors, word vectors and initial classification tags in the development set;

step s 5: classifying the texts of the training set and the development set into RCT classes and non-RCT classes by using the trained 4 BERT models;

step s 6: training a LightGBM model;

step s 7: and classifying the data of the test set by using 4 BERT models to obtain a classification result, and synthesizing the classification results of the 4 BERT models by using the LightGBM model to obtain a final classification result of the test set.

Preferably, the text includes a title and a summary, and the initial classification tags include an RCT class and a non-RCT class.

Preferably, in step s1, the segmentation includes the steps of:

step s 101: dividing the initial RCT data into 5 disjoint data sets;

step s 102: sequentially selecting 1 part of 5 parts in s101 as a test set, and taking the other 4 parts as training data, thereby obtaining 5 groups of data, wherein each group of data comprises 1 training data and 1 test set, and the sample number ratio of the test set to the training data is 1: 4;

step s 103: for 5 groups of data, the training data in each group are randomly divided into a training set and a development set in a ratio of 3:1, so that each group of data is composed of a training set, a development set and a test set, wherein the training set, the development set and the test set comprise samples in a ratio of 3:1: 1.

Preferably, the 4 BERT models are BIO-BBUPC, BIO-BBUP, SCI-BBU and BBU respectively, and the 4 BERT models are used as base classifiers.

Preferably, in step s5, each text in the training set and the development set is classified by a BERT model to obtain a 2-dimensional vector as a classification result, and each text in the training set and the development set is classified by 4 BERT models to obtain an 8-dimensional vector.

Further, in step s6, the LightGBM model is trained using the 8-dimensional vector data after text conversion of the training set and the development set and the training set initial classification labels, and the hyper-parameters of the LightGBM model are adjusted step by adopting five-fold cross validation.

The invention has the following beneficial effects:

according to the method, the lightGBM models of 4 different BERT models are integrated, the questions and the abstracts of the RCT are automatically screened, the accuracy, the sensitivity and the specificity of the screening result are higher, the method is faster and more accurate, and the manual workload is reduced.

Drawings

Fig. 1 is a general framework work flow diagram of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

step s 1: segmenting pre-prepared initial RCT data into a training set, a development set and a test set, wherein the initial RCT data comprises text and initial classification labels.

The initial RCT data is derived from Cochrane. Cochrane is a recognized project in the field of system evaluation, and global health science researchers from 158 countries have participated in the classification of text. Panelists trained on the study method paired, screened the headlines/abstracts independently. And the reviewer resolves the divergence by discussing or negotiating with a third reviewer if necessary.

The text includes a title and a summary, and the initial classification tags include an RCT class and a non-RCT class.

In step s1, the segmentation comprises the steps of:

step s 101: dividing the initial RCT data into 5 disjoint data sets;

step s 102: sequentially selecting 1 part of 5 parts in s101 as a test set, and taking the other 4 parts as training data, thereby obtaining 5 groups of data, wherein each of the 5 groups of data comprises 1 part of training data and 1 part of test set, and the ratio of the training data to the test set is 4: 1;

step s 103: each of the 5 groups of data was divided into 3: the proportion of 1 is divided into a training set and a development set, so that new 5 groups of data are obtained, each new group of data comprises the training set and the development set, and the proportion of the training set, the development set and the test set is 3:1: 1.

Step s 2: and respectively converting the texts in the training set, the development set and the test set into a position vector, a text vector and a word vector.

Step s 3: the 4 BERT models are trained using the text-converted position vectors, text vectors, word vectors, and initial classification labels in the training set, respectively.

The 4 BERT models are SCI-BBU, BIO-BBUP, BBU and BIO-BBUPC respectively, and the 4 BERT models are used as base classifiers.

The 4 BERT models have the same basic BERT model structure without size, but have different initial parameters, namely BIO-BBUPC, BIO-BBUPs, SCI-BBUs and BBUs. BIO-BBUPC was pre-trained in 2018 on abstracts and clinical notes in the PubMed database; BIO-BBUPs were pre-trained in 2018 on abstracts and clinical notes in the PubMed database; SCI-BBU is pre-trained in a semantic corpus, which has 1.14 million papers and 31 hundred million marks; SCI-BBU used the full text of the paper in training, not just the abstract; BBU received pre-training on wikipedia data in 2018. Different pre-training sets imply different initial model parameters.

Step s 4: the hyper-parameters of the 4 BERT models are adjusted using the text-converted position vectors, text vectors, word vectors, and the initial classification tags in the development set. The adjustment of the hyper-parameters mainly adjusts the maximum length and the learning rate of the input text.

Step s 5: and classifying the texts of the training set and the development set into RCT classes and non-RCT classes by using the trained 4 BERT models.

In step s5, each text in the training set and the development set is classified by a BERT model to obtain a 2-dimensional vector as a classification result, and each text in the training set and the development set is classified by 4 BERT models to obtain an 8-dimensional vector.

Step s 6: the LightGBM model is trained.

In step s6, the LightGBM model is trained using the 8-dimensional vector data after text conversion of the training set and the development set and the training set initial classification labels, and the hyper-parameters of the LightGBM model are adjusted step by adopting five-fold cross validation.

As shown in fig. 1, a working process of identifying whether a text is an RCT type by a trained model is shown, where a text is subjected to 4 classification results respectively obtained by 4 base classifiers BIO-BBUP, BIO-BBUP pc, SCI-BBU, and spliced by a Concat layer shown in fig. 1, the 4 classification results are merged and used as input of a LightGBM, and a final classification result, i.e., an RCT type or a non-RCT type, is obtained by the LightGBM model. The classification result obtained by each text through the calculation of the base classifier or the LightGBM model is a 2-dimensional vector ([ 0,1] or [1,0 ]), wherein [0,1] represents a non-RCT class and [1,0] represents an RCT class.

Step s 7: and classifying the data of the test set by using 4 BERT models to obtain a classification result, and synthesizing the classification results of the 4 BERT models by using the lightGBM model to obtain a final classification result of the test set, namely a screening result.

The technical effect of the invention is illustrated by the five-fold cross validation as follows:

indicators of evaluation method performance have accuracy, sensitivity, specificity, missed studies, and workload savings. The citation of the RCT class is a qualified citation, and the citation of the non-RCT class is a disqualified citation. Accuracy is the ratio of the number of correctly predicted quotations to the total number of quotations. Sensitivity is the ratio of the number of qualified citations correctly predicted as qualified citations to the total number of qualified citations. Specificity is the ratio of the number of quotations correctly predicted as ineligible to the total number of ineligible quotations.

The five-fold cross validation mainly aims to illustrate the robustness of the model, and the model has stability. In the five-fold cross validation, the invention shows stable high sensitivity and specificity in each test set. The test set contained 1472 citations for RCT classes and 15,323 citations for non-RCT classes, totaling 16,794 documents.

The accuracy in the case study evaluation set was 95%, the sensitivity was 93%, and the specificity was 95%. The sensitivity of 93% was better in case studies than each individual BERT model. Without further measures and with the full acceptance of the accuracy of the invention, the present invention would avoid manual screening of 14,650 of 16,794 citations, resulting in a 87% reduction in workload. The final model parameters are obtained by taking all data as a training set, and the evaluation parameters of the model take the average model evaluation parameters of the five-fold cross validation as the final evaluation parameters of the model.

The mean values of the five-fold cross validation results of RCT types recognized by different NLP methods are shown in Table 1:

table 1: five-fold cross validation result mean value for identifying RCT classes by different NLP methods

The five-fold cross validation results for identifying RCT classes are shown in Table 2:

table 2: identifying five-fold cross-validation results of RCT classes

The present invention is capable of other embodiments, and various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention.

Claims

1. A random control test identification method integrating a plurality of BERT models based on LightGBM is characterized by comprising the following steps:

step s 5: classifying the training set text and the development set text into RCT classes and non-RCT classes by using the trained 4 BERT models;

step s 6: training a LightGBM model;

2. The LightGBM-based random control trial identification method of integrating multiple BERT models according to claim 1, wherein: the text includes a title and a summary, and the initial classification tags include an RCT class and a non-RCT class.

3. The LightGBM-based random control trial identification method of integrating multiple BERT models according to claim 1, wherein:

in step s1, the segmentation comprises the steps of:

step s 101: dividing the initial RCT data into 5 disjoint data sets;

4. The LightGBM-based random control trial identification method of integrating multiple BERT models according to claim 1, wherein: the 4 BERT models are BIO-BBUPC, BIO-BBUPP, SCI-BBU and BBU respectively, and the 4 BERT models are used as base classifiers.

5. The LightGBM-based random control trial identification method of integrating multiple BERT models according to claim 1, wherein: in step s5, each text in the training set and each text in the development set are classified by a BERT model to obtain a 2-dimensional vector as a classification result, and each text in the training set and each text in the development set are classified by 4 BERT models to obtain an 8-dimensional vector.

6. The LightGBM integrated multiple BERT model based random control trial identification method of claim 5, wherein: in step s6, the LightGBM model is trained using the 8-dimensional vector data and the training set initial classification labels after the training set text and the development set text are converted, and the LightGBM model hyper-parameter is adjusted step by adopting five-fold cross validation.