CN112836772A - Random contrast test identification method integrating multiple BERT models based on LightGBM - Google Patents

Random contrast test identification method integrating multiple BERT models based on LightGBM Download PDF

Info

Publication number
CN112836772A
CN112836772A CN202110363597.6A CN202110363597A CN112836772A CN 112836772 A CN112836772 A CN 112836772A CN 202110363597 A CN202110363597 A CN 202110363597A CN 112836772 A CN112836772 A CN 112836772A
Authority
CN
China
Prior art keywords
training
text
lightgbm
data
rct
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110363597.6A
Other languages
Chinese (zh)
Inventor
孙鑫
秦璇
李玲
刘佳利
王雨宁
刘艳梅
齐亚娜
邹康
邓可
马玉
刘梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
West China Hospital of Sichuan University
Original Assignee
West China Hospital of Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by West China Hospital of Sichuan University filed Critical West China Hospital of Sichuan University
Priority to CN202110363597.6A priority Critical patent/CN112836772A/en
Publication of CN112836772A publication Critical patent/CN112836772A/en
Priority to PCT/CN2021/116267 priority patent/WO2022205768A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a random contrast test identification method integrating a plurality of BERT models based on LightGBM, which comprises the following steps: step s 1: segmenting initial RCT data prepared in advance into a training set, a development set and a test set, wherein the initial RCT data comprises texts and initial classification labels; step s 2: respectively converting texts in the training set, the development set and the test set into a position vector, a text vector and a word vector; step s 3: training a model; step s 4: adjusting the hyper-parameters of the model; step s 5: classifying the texts of the training set and the development set by using the trained model; step s 6: training a LightGBM model; step s 7: and obtaining a final classification result. The invention integrates 4 different models by developing an integrated learning algorithm LightGBM, trains on RCT data provided by Cochrane, and automatically screens the subjects and abstracts of RCT classes.

Description

Random contrast test identification method integrating multiple BERT models based on LightGBM
Technical Field
The invention relates to the technical field of computer data processing, in particular to a random contrast test identification method based on a LightGBM integrated plurality of BERT models.
Background
The Random Control Test (RCT) is generally considered to be a gold standard for evaluating the safety and efficacy of drugs. In recent years, how to evaluate the effectiveness and safety of drugs by using real world evidence becomes a hot issue of increasing concern in drug development and supervision decisions at home and abroad.
For a single RCT, the experimental samples are limited, a Meta analysis is often used for comprehensively collecting small samples of various treatments of a certain disease and the results of single clinical test RCT, systematic evaluation and statistical analysis are carried out on the results, scientific conclusions as real as possible are timely provided for the society and clinicians, so that the popularization of real and effective treatment means is promoted, and an invalid and even harmful method which is not yet provided for reference is abandoned.
The literature, as an important scientific research display sharing mode, contains a lot of scientific research information. RCT-related documents are typically collected by researchers through literature search.
However, in the document retrieval process for system evaluation, due to the explosive growth of documents every year and the lack of specificity of retrieval strategies, and the number of retrieved citations is very large, the screening of RCT-related documents on retrieval results is time-consuming and labor-consuming manually.
Currently, some system evaluation software tools include RCT classification functions, including gapscoreerer, austackr, and Rayyan, which are semi-automatic reference filtering and selection software that classify documents using a Support Vector Machine (SVM). SVM is a successful machine learning model widely used in these text mining tools to classify text in the first decade of the 21 st century. SVMs rely heavily on artificially set sample characteristics, which can be unstable and labor intensive.
With the development of machine learning techniques and computer hardware, network-based machine learning approaches have gained popularity due to their good performance on many problems, particularly in image recognition and Natural Language Processing (NLP). The bi-directional encoder representation was derived from the transformer (BERT), a pre-trained model, proposed by google, which achieved the best model results in 2018 in 10 months for 11 NLP tasks. Due to the deep network and the pre-training process thereof, the BERT model can obtain better effect in different NLP tasks. In the pre-training process, the model learns the background features of the language over a large number of pre-training data sets. The machine learning process has a large amount of basic learning, and the learning effect of a specific task is better. Therefore, we wish to use different pre-trained BERT models that are medically relevant as the basic classifier for the RCT classification task.
In the last two years, LightGBM has been widely used in machine learning tasks as an integration method for integrating different model effects. Besides saving the training prediction time, the performance of the method is superior to that of all the existing Boosting algorithms.
Currently, a model that performs well in the field of text classification is supervised learning. Supervised learning models for text classification require a training process. During the training process, the model is adapted to learn the relationship between the quotation and the classification tags, where the known filter tags are used to predict quotations without known classification tags. Therefore, the accuracy of the screened quotation directly influences the classification effect of the model. Cochrane is a recognized project in the field of system evaluation, and global health science researchers from 158 countries have participated in the classification of text. Panelists trained on the study method paired, screened the headlines/abstracts independently. The reviewer resolves the divergence by discussing or, if necessary, negotiating with a third reviewer.
Disclosure of Invention
The invention aims to provide a random contrast test identification method integrating a plurality of BERT models based on LightGBM, which is used for automatically screening the subjects and abstracts of RCT types.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme:
a random control test identification method integrating a plurality of BERT models based on LightGBM comprises the following steps:
step s 1: segmenting initial RCT data prepared in advance into a training set, a development set and a test set, wherein the initial RCT data comprises texts and initial classification labels;
step s 2: respectively converting texts in the training set, the development set and the test set into a position vector, a text vector and a word vector;
step s 3: respectively training 4 BERT models by using the position vector, the text vector, the word vector and the initial classification label after the text conversion in the training set;
step s 4: adjusting hyper-parameters of the 4 BERT models using the converted text position vectors, text vectors, word vectors and initial classification tags in the development set;
step s 5: classifying the texts of the training set and the development set into RCT classes and non-RCT classes by using the trained 4 BERT models;
step s 6: training a LightGBM model;
step s 7: and classifying the data of the test set by using 4 BERT models to obtain a classification result, and synthesizing the classification results of the 4 BERT models by using the LightGBM model to obtain a final classification result of the test set.
Preferably, the text includes a title and a summary, and the initial classification tags include an RCT class and a non-RCT class.
Preferably, in step s1, the segmentation includes the steps of:
step s 101: dividing the initial RCT data into 5 disjoint data sets;
step s 102: sequentially selecting 1 part of 5 parts in s101 as a test set, and taking the other 4 parts as training data, thereby obtaining 5 groups of data, wherein each group of data comprises 1 training data and 1 test set, and the sample number ratio of the test set to the training data is 1: 4;
step s 103: for 5 groups of data, the training data in each group are randomly divided into a training set and a development set in a ratio of 3:1, so that each group of data is composed of a training set, a development set and a test set, wherein the training set, the development set and the test set comprise samples in a ratio of 3:1: 1.
Preferably, the 4 BERT models are BIO-BBUPC, BIO-BBUP, SCI-BBU and BBU respectively, and the 4 BERT models are used as base classifiers.
Preferably, in step s5, each text in the training set and the development set is classified by a BERT model to obtain a 2-dimensional vector as a classification result, and each text in the training set and the development set is classified by 4 BERT models to obtain an 8-dimensional vector.
Further, in step s6, the LightGBM model is trained using the 8-dimensional vector data after text conversion of the training set and the development set and the training set initial classification labels, and the hyper-parameters of the LightGBM model are adjusted step by adopting five-fold cross validation.
The invention has the following beneficial effects:
according to the method, the lightGBM models of 4 different BERT models are integrated, the questions and the abstracts of the RCT are automatically screened, the accuracy, the sensitivity and the specificity of the screening result are higher, the method is faster and more accurate, and the manual workload is reduced.
Drawings
Fig. 1 is a general framework work flow diagram of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.
A random control test identification method integrating a plurality of BERT models based on LightGBM comprises the following steps:
step s 1: segmenting pre-prepared initial RCT data into a training set, a development set and a test set, wherein the initial RCT data comprises text and initial classification labels.
The initial RCT data is derived from Cochrane. Cochrane is a recognized project in the field of system evaluation, and global health science researchers from 158 countries have participated in the classification of text. Panelists trained on the study method paired, screened the headlines/abstracts independently. And the reviewer resolves the divergence by discussing or negotiating with a third reviewer if necessary.
The text includes a title and a summary, and the initial classification tags include an RCT class and a non-RCT class.
In step s1, the segmentation comprises the steps of:
step s 101: dividing the initial RCT data into 5 disjoint data sets;
step s 102: sequentially selecting 1 part of 5 parts in s101 as a test set, and taking the other 4 parts as training data, thereby obtaining 5 groups of data, wherein each of the 5 groups of data comprises 1 part of training data and 1 part of test set, and the ratio of the training data to the test set is 4: 1;
step s 103: each of the 5 groups of data was divided into 3: the proportion of 1 is divided into a training set and a development set, so that new 5 groups of data are obtained, each new group of data comprises the training set and the development set, and the proportion of the training set, the development set and the test set is 3:1: 1.
Step s 2: and respectively converting the texts in the training set, the development set and the test set into a position vector, a text vector and a word vector.
Step s 3: the 4 BERT models are trained using the text-converted position vectors, text vectors, word vectors, and initial classification labels in the training set, respectively.
The 4 BERT models are SCI-BBU, BIO-BBUP, BBU and BIO-BBUPC respectively, and the 4 BERT models are used as base classifiers.
The 4 BERT models have the same basic BERT model structure without size, but have different initial parameters, namely BIO-BBUPC, BIO-BBUPs, SCI-BBUs and BBUs. BIO-BBUPC was pre-trained in 2018 on abstracts and clinical notes in the PubMed database; BIO-BBUPs were pre-trained in 2018 on abstracts and clinical notes in the PubMed database; SCI-BBU is pre-trained in a semantic corpus, which has 1.14 million papers and 31 hundred million marks; SCI-BBU used the full text of the paper in training, not just the abstract; BBU received pre-training on wikipedia data in 2018. Different pre-training sets imply different initial model parameters.
Step s 4: the hyper-parameters of the 4 BERT models are adjusted using the text-converted position vectors, text vectors, word vectors, and the initial classification tags in the development set. The adjustment of the hyper-parameters mainly adjusts the maximum length and the learning rate of the input text.
Step s 5: and classifying the texts of the training set and the development set into RCT classes and non-RCT classes by using the trained 4 BERT models.
In step s5, each text in the training set and the development set is classified by a BERT model to obtain a 2-dimensional vector as a classification result, and each text in the training set and the development set is classified by 4 BERT models to obtain an 8-dimensional vector.
Step s 6: the LightGBM model is trained.
In step s6, the LightGBM model is trained using the 8-dimensional vector data after text conversion of the training set and the development set and the training set initial classification labels, and the hyper-parameters of the LightGBM model are adjusted step by adopting five-fold cross validation.
As shown in fig. 1, a working process of identifying whether a text is an RCT type by a trained model is shown, where a text is subjected to 4 classification results respectively obtained by 4 base classifiers BIO-BBUP, BIO-BBUP pc, SCI-BBU, and spliced by a Concat layer shown in fig. 1, the 4 classification results are merged and used as input of a LightGBM, and a final classification result, i.e., an RCT type or a non-RCT type, is obtained by the LightGBM model. The classification result obtained by each text through the calculation of the base classifier or the LightGBM model is a 2-dimensional vector ([ 0,1] or [1,0 ]), wherein [0,1] represents a non-RCT class and [1,0] represents an RCT class.
Step s 7: and classifying the data of the test set by using 4 BERT models to obtain a classification result, and synthesizing the classification results of the 4 BERT models by using the lightGBM model to obtain a final classification result of the test set, namely a screening result.
The technical effect of the invention is illustrated by the five-fold cross validation as follows:
indicators of evaluation method performance have accuracy, sensitivity, specificity, missed studies, and workload savings. The citation of the RCT class is a qualified citation, and the citation of the non-RCT class is a disqualified citation. Accuracy is the ratio of the number of correctly predicted quotations to the total number of quotations. Sensitivity is the ratio of the number of qualified citations correctly predicted as qualified citations to the total number of qualified citations. Specificity is the ratio of the number of quotations correctly predicted as ineligible to the total number of ineligible quotations.
The five-fold cross validation mainly aims to illustrate the robustness of the model, and the model has stability. In the five-fold cross validation, the invention shows stable high sensitivity and specificity in each test set. The test set contained 1472 citations for RCT classes and 15,323 citations for non-RCT classes, totaling 16,794 documents.
The accuracy in the case study evaluation set was 95%, the sensitivity was 93%, and the specificity was 95%. The sensitivity of 93% was better in case studies than each individual BERT model. Without further measures and with the full acceptance of the accuracy of the invention, the present invention would avoid manual screening of 14,650 of 16,794 citations, resulting in a 87% reduction in workload. The final model parameters are obtained by taking all data as a training set, and the evaluation parameters of the model take the average model evaluation parameters of the five-fold cross validation as the final evaluation parameters of the model.
The mean values of the five-fold cross validation results of RCT types recognized by different NLP methods are shown in Table 1:
table 1: five-fold cross validation result mean value for identifying RCT classes by different NLP methods
Figure DEST_PATH_IMAGE002
The five-fold cross validation results for identifying RCT classes are shown in Table 2:
table 2: identifying five-fold cross-validation results of RCT classes
Figure DEST_PATH_IMAGE004
The present invention is capable of other embodiments, and various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention.

Claims (6)

1. A random control test identification method integrating a plurality of BERT models based on LightGBM is characterized by comprising the following steps:
step s 1: segmenting initial RCT data prepared in advance into a training set, a development set and a test set, wherein the initial RCT data comprises texts and initial classification labels;
step s 2: respectively converting texts in the training set, the development set and the test set into a position vector, a text vector and a word vector;
step s 3: respectively training 4 BERT models by using the position vector, the text vector, the word vector and the initial classification label after the text conversion in the training set;
step s 4: adjusting hyper-parameters of the 4 BERT models using the converted text position vectors, text vectors, word vectors and initial classification tags in the development set;
step s 5: classifying the training set text and the development set text into RCT classes and non-RCT classes by using the trained 4 BERT models;
step s 6: training a LightGBM model;
step s 7: and classifying the data of the test set by using 4 BERT models to obtain a classification result, and synthesizing the classification results of the 4 BERT models by using the LightGBM model to obtain a final classification result of the test set.
2. The LightGBM-based random control trial identification method of integrating multiple BERT models according to claim 1, wherein: the text includes a title and a summary, and the initial classification tags include an RCT class and a non-RCT class.
3. The LightGBM-based random control trial identification method of integrating multiple BERT models according to claim 1, wherein:
in step s1, the segmentation comprises the steps of:
step s 101: dividing the initial RCT data into 5 disjoint data sets;
step s 102: sequentially selecting 1 part of 5 parts in s101 as a test set, and taking the other 4 parts as training data, thereby obtaining 5 groups of data, wherein each group of data comprises 1 training data and 1 test set, and the sample number ratio of the test set to the training data is 1: 4;
step s 103: for 5 groups of data, the training data in each group are randomly divided into a training set and a development set in a ratio of 3:1, so that each group of data is composed of a training set, a development set and a test set, wherein the training set, the development set and the test set comprise samples in a ratio of 3:1: 1.
4. The LightGBM-based random control trial identification method of integrating multiple BERT models according to claim 1, wherein: the 4 BERT models are BIO-BBUPC, BIO-BBUPP, SCI-BBU and BBU respectively, and the 4 BERT models are used as base classifiers.
5. The LightGBM-based random control trial identification method of integrating multiple BERT models according to claim 1, wherein: in step s5, each text in the training set and each text in the development set are classified by a BERT model to obtain a 2-dimensional vector as a classification result, and each text in the training set and each text in the development set are classified by 4 BERT models to obtain an 8-dimensional vector.
6. The LightGBM integrated multiple BERT model based random control trial identification method of claim 5, wherein: in step s6, the LightGBM model is trained using the 8-dimensional vector data and the training set initial classification labels after the training set text and the development set text are converted, and the LightGBM model hyper-parameter is adjusted step by adopting five-fold cross validation.
CN202110363597.6A 2021-04-02 2021-04-02 Random contrast test identification method integrating multiple BERT models based on LightGBM Pending CN112836772A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110363597.6A CN112836772A (en) 2021-04-02 2021-04-02 Random contrast test identification method integrating multiple BERT models based on LightGBM
PCT/CN2021/116267 WO2022205768A1 (en) 2021-04-02 2021-09-02 Random contrast test identification method for integrating multiple bert models on the basis of lightgbm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110363597.6A CN112836772A (en) 2021-04-02 2021-04-02 Random contrast test identification method integrating multiple BERT models based on LightGBM

Publications (1)

Publication Number Publication Date
CN112836772A true CN112836772A (en) 2021-05-25

Family

ID=75930701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110363597.6A Pending CN112836772A (en) 2021-04-02 2021-04-02 Random contrast test identification method integrating multiple BERT models based on LightGBM

Country Status (2)

Country Link
CN (1) CN112836772A (en)
WO (1) WO2022205768A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022205768A1 (en) * 2021-04-02 2022-10-06 四川大学华西医院 Random contrast test identification method for integrating multiple bert models on the basis of lightgbm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829810A (en) * 2018-06-08 2018-11-16 东莞迪赛软件技术有限公司 File classification method towards healthy public sentiment
CN109753564A (en) * 2018-12-13 2019-05-14 四川大学 The construction method of Chinese RCT Intelligence Classifier based on machine learning
CN110210037A (en) * 2019-06-12 2019-09-06 四川大学 Category detection method towards evidence-based medicine EBM field
CN110347825A (en) * 2019-06-14 2019-10-18 北京物资学院 The short English film review classification method of one kind and device
CN112131389A (en) * 2020-10-26 2020-12-25 四川大学华西医院 Method for integrating multiple BERT models by LightGBM to accelerate system evaluation updating

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9087303B2 (en) * 2012-02-19 2015-07-21 International Business Machines Corporation Classification reliability prediction
CN112836772A (en) * 2021-04-02 2021-05-25 四川大学华西医院 Random contrast test identification method integrating multiple BERT models based on LightGBM

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829810A (en) * 2018-06-08 2018-11-16 东莞迪赛软件技术有限公司 File classification method towards healthy public sentiment
CN109753564A (en) * 2018-12-13 2019-05-14 四川大学 The construction method of Chinese RCT Intelligence Classifier based on machine learning
CN110210037A (en) * 2019-06-12 2019-09-06 四川大学 Category detection method towards evidence-based medicine EBM field
CN110347825A (en) * 2019-06-14 2019-10-18 北京物资学院 The short English film review classification method of one kind and device
CN112131389A (en) * 2020-10-26 2020-12-25 四川大学华西医院 Method for integrating multiple BERT models by LightGBM to accelerate system evaluation updating

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022205768A1 (en) * 2021-04-02 2022-10-06 四川大学华西医院 Random contrast test identification method for integrating multiple bert models on the basis of lightgbm

Also Published As

Publication number Publication date
WO2022205768A1 (en) 2022-10-06

Similar Documents

Publication Publication Date Title
WO2021047186A1 (en) Method, apparatus, device, and storage medium for processing consultation dialogue
WO2018218708A1 (en) Deep-learning-based public opinion hotspot category classification method
WO2018000269A1 (en) Data annotation method and system based on data mining and crowdsourcing
CN106528528A (en) A text emotion analysis method and device
CN109933664A (en) A kind of fine granularity mood analysis improved method based on emotion word insertion
Harfoushi et al. Sentiment analysis algorithms through azure machine learning: Analysis and comparison
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
Sathiyanarayanan et al. Identification of breast cancer using the decision tree algorithm
CN109036577A (en) Diabetic complication analysis method and device
CN102156885A (en) Image classification method based on cascaded codebook generation
Jatav An algorithm for predictive data mining approach in medical diagnosis
Borovsky et al. Moving towards accurate and early prediction of language delay with network science and machine learning approaches
CN107194617A (en) A kind of app software engineers soft skill categorizing system and method
CN109492105A (en) A kind of text sentiment classification method based on multiple features integrated study
Liu et al. Patent analysis and classification prediction of biomedicine industry: SOM-KPCA-SVM model
Livieris et al. Identification of blood cell subtypes from images using an improved SSL algorithm
Tran et al. Automated curation of CNMF-E-extracted ROI spatial footprints and calcium traces using open-source AutoML tools
Orosoo et al. Performance analysis of a novel hybrid deep learning approach in classification of quality-related English text
CN112836772A (en) Random contrast test identification method integrating multiple BERT models based on LightGBM
CN112131389B (en) Method for integrating multiple BERT models through LightGBM to accelerate system evaluation updating
CN113886562A (en) AI resume screening method, system, equipment and storage medium
Hardaya et al. Application of text mining for classification of community complaints and proposals
CN104331507B (en) Machine data classification is found automatically and the method and device of classification
CN116451114A (en) Internet of things enterprise classification system and method based on enterprise multisource entity characteristic information
CN116775897A (en) Knowledge graph construction and query method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210525

RJ01 Rejection of invention patent application after publication