CN106933805A - The recognition methods of biological event trigger word in a kind of large data sets - Google Patents

The recognition methods of biological event trigger word in a kind of large data sets Download PDF

Info

Publication number
CN106933805A
CN106933805A CN201710148320.5A CN201710148320A CN106933805A CN 106933805 A CN106933805 A CN 106933805A CN 201710148320 A CN201710148320 A CN 201710148320A CN 106933805 A CN106933805 A CN 106933805A
Authority
CN
China
Prior art keywords
sample
classification
trigger word
data
boundary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710148320.5A
Other languages
Chinese (zh)
Other versions
CN106933805B (en
Inventor
陈飞
陈一飞
刘峰
韩冰青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201710148320.5A priority Critical patent/CN106933805B/en
Publication of CN106933805A publication Critical patent/CN106933805A/en
Application granted granted Critical
Publication of CN106933805B publication Critical patent/CN106933805B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to the recognition methods technical field of biological event trigger word, the recognition methods of biological event trigger word in specifically related to a kind of large data sets, it is parallel lack sampling method (PUS), including data segmentation, boundary factor calculating, sample sub- sampling, boundary set merger and last shearing procedure, can be used for processing between classification and there is the big training dataset of significant distribution bias, belong to the sample instance of most classifications by parallel reduction to achieve the goal.Selection of the method to data is the calculating based on boundary factor, importance of the entrained information that it weighs each sample instance for classification.The recognition methods of biological event trigger word in the large data sets that above-mentioned technical proposal is provided, can simultaneously solve data volume greatly and sample distribution imbalance problem between classification, to reach the recognition effect of more preferable biological event trigger word.

Description

The recognition methods of biological event trigger word in a kind of large data sets
Technical field
The present invention relates to the recognition methods technical field of biological event trigger word, and in particular to biological in a kind of large data sets The recognition methods of event trigger word.
Background technology
With the raising of information technology and becoming increasingly popular for internet, biomedical electronic literature is used as scientific research Product, the trend being exponentially increased, these online document resources contain the preciousness biology that substantial amounts of systems biology research is badly in need of Medical knowledge.In face of the continuous surge of magnanimity biomedicine text message, Text Mining Technology is just hidden in document as extraction In important knowledge technology, be widely applied in biomedical sector.
Biological event extract refer in massive medical Research Literature the biomolecule such as automatic detection gene and protein it Between interactive relation description process, so as to extract the structured message of pre-defined event type.In this process, if Biological event trigger word can be exactly identified, the performance of event extraction will be greatly improved.Event trigger word identification is biological First step during event extraction, the trigger word that it is recognized is the basis of event argument recognition, is the core of whole event The heart.In trigger word identification, also need to recognize the classification of trigger word, the classification of trigger word i.e. the classification of whole event, if Trigger word identification is wrong, and follow-up work is lost meaning, therefore it is to carry out biomedical event extraction to carry out trigger word identification Key.Wherein, based on SVMs (SVM) and the method represented based on feature-rich be event trigger word identification in it is most normal , the best ML models of result.However, in actual event triggering identification application, the complexity on data has two Key issue.First, the disequilibrium that data are distributed between classification.Secondly, the big data of training dataset.For big data There is very big limitation and cause the performance to reduce in collection, many sorting algorithms.For example, the training complexity of SVM is highly dependent on number According to the size of collection, time consumption for training is more on large data sets.Therefore, the highly unbalanced feature of large data sets and data distribution is The identification of event trigger word brings very huge challenge of knowing clearly.
In face of large data sets, Undersampling technique is most efficient method, and it is by removing the sample in some most classifications Example builds equilibrium criterion collection, and do so can reduce computational complexity.Therefore, Undersampling technique is still under big data Effectively.Therefore, many more efficient lack sampling methods are suggested.Wherein, the lack sampling method based on cluster, it is intended to pass through The cluster for calculating data set solves unbalanced data distribution problem.In this kind of method, training data is divided into several clusters, Ran Hougen Representative sample instance is selected from the cluster of most classifications according to ratio, with a small number of classification example sets into the data for balancing Collection.Unbalanced data problem can be efficiently solved by using lack sampling method and integrated study based on cluster.In addition, a kind of New reverse random sub- sampling method (IRUS), by the random bulk sampling to most category datasets, the structure between classification Build compound decision border.Although however, these methods can alleviate to a certain extent unbalanced data study problem, but still Needs take a substantial amount of time iteratively to cluster or find the border of nearest neighbours.Therefore, in face of large data sets, these sides Method is simultaneously non-real efficient.
For large data sets, in order to overcome SVM to train the bottleneck of complexity, various methods are also suggested, for example, order Big QP PROBLEM DECOMPOSITIONs are a series of minimum possible QP problems by minimum optimization (SMO), it is allowed to the big training set of SMO treatment.Separately have The data set clustered using minimum closure ball (MEB) divides training data by MEB methods, and the center of cluster is used for svm classifier. However, classification of these methods to unbalanced data is not helped.
Solve that the simultaneous data volume in the classification problem is big while existing method all fails fine and classification between Sample distribution imbalance problem, this is the important step for solving the identification of biological event trigger word.
The content of the invention
The technical problem to be solved in the present invention is to provide a kind of recognition methods of biological event trigger word in large data sets, energy Solve that data volume is big simultaneously and classification between sample distribution imbalance problem, it can solve the sample imbalance point under large data sets Class problem, can reach the recognition effect of more preferable biological event trigger word.
In order to solve the above technical problems, present invention employs following technical scheme:Biological event is touched in a kind of large data sets The recognition methods of word is sent out, is parallel lack sampling method (Parallel Under-Sampling, PUS), comprised the following steps:
Step 1, data segmentation, define data set D={ (x1,y1),...,(xn,yn) it is training dataset, wherein xi∈ RmIt is sample instance, and yi∈ { 0,1 ..., l } is the sample instance generic, and has 1+l class label;Define Dα It is most category datasets, wherein including n0The individual individual sample instance for belonging to classification y=0, makes α=n0;By most categorical datas Collection DαRandom division is K mutually disjoint most classification Sub Data SetsK=1,2 ..., K, uses αkRepresent each many several classes of Small pin for the case data setThe number of middle sample instance, therefore haveDefine DβIt is a small number of category datasets, i.e. Dβ={ ∪ Dj, j=1,2 ..., l, wherein, β represents the number of sample in all a small number of category datasets, hasThus α is obtained > > β;
Step 2, boundary factor are calculated, and define each data set SkContain from corresponding majority classification Sub Data SetWith A small number of category dataset DβSample instance, be expressed asK=1,2 ..., K;
By after characteristic extraction step, SkBy m dimensional feature F={ ft, t=1,2 ..., m is represented, each sample boundary because Son is obtained based on its indeterminacy of calculation for belonging to all categories, uncertain main by set SkIn each sample Example x to given classification CjApart from d (x, Cj) come what is determined, the calculating of distance is defined as follows:
dis(xft,Cft) be used for calculating in t dimensional feature spaces, sample instance x to given classification CjDistance component, Because biological trigger word identification data collection is text, therefore, dis (xft,Cft) it is defined as text vector to classification CjBarycenter away from From the barycenter is word frequency TF (ft|Cj) average:
In formula, in d (x, Cj) on the basis of, each sample instance x is for classification CjDegree of membership μjX () is defined as follows:
Boundary factor BoundF (x) of sample instance x is defined as follows:
Step 3, sample sub- sampling, BoundF (x) values that will be calculated are ranked up, and will possess the α ' of maximumkIt is individual The sample of BoundF (x) values is extracted one boundary set of composition as boundary sample exampleSampling number α 'k=p × β, p As a parameter to be regulated of PUS algorithms;
Step 4, boundary set merger, by all boundary sets produced by parallel lack sampling in step 2 and 3, after merging Obtain a new most category dataset D'α, and all of a small number of classifications gather together, and obtain new training data set D'=D'α∪Dβ
Step 5, pruning, repeat lack sampling step 2 and step 3 obtain final training dataset to training data set D' D ", makes training dataset D " include α " individual maximum BoundF (x) value sample, reach most classification sample numbers and minority class very This number is balanced, i.e. α "=β.
The recognition methods of biological event trigger word in the large data sets provided in above-mentioned technical proposal, mainly in biology In event recognition task, data set is distributed unbalanced problem greatly and between sample class, it is proposed that a kind of parallel sub- sampling method (PUS) PUS-SVM trigger word identifying systems, and with reference to SVM classifier are constructed, trigger word recognition performance and effect is effectively improved Rate.Parallel sub- sampling method (PUS) uses the method for sampling based on classification border to reduce the disequilibrium of data and available Parallel distributed computing realizes sub- sampling, effectively reduces the computational complexity of large data sets.
Brief description of the drawings
Fig. 1 is the workflow diagram of the recognition methods of biological event trigger word in large data sets of the present invention;
Fig. 2 is based on same data set DataBioNLP09The time loss of sub- sampling method in biological trigger word identifying system Compare figure;
Fig. 3 is based on same data set DataBioNLP11The time loss of sub- sampling method in biological trigger word identifying system Compare figure.
Specific embodiment
In order that objects and advantages of the present invention become more apparent, the present invention is carried out specifically with reference to embodiments It is bright.It should be appreciated that following word is only used to describe one or more specific embodiments of the invention, not to the present invention The protection domain of specific request carries out considered critical.
The technical scheme that the present invention takes is as shown in figure 1, method proposed by the invention can be used between treatment classification It is a lack sampling method in the presence of the big training dataset of significant distribution bias, the sample of most classifications is belonged to by reducing Example to adjust classification between data distribution.Selection of the method to data be based on boundary factor (Boundary Factor, BoundF calculating), importance of the information for classification entrained by its each sample instance weighed.After lack sampling, Sample instance in class center will be abandoned, and the sample instance for being in classification border is then retained.In subsequent bio thing In the identification of part trigger word, SVM will act as grader, and its calculating to Optimal Separating Hyperplane depends on the sample of these classification boundaries Example, i.e. supporting vector, thus parallel sub- sampling method (PUS) remain it is that most probable help SVM is classified, comprising most The sample instance of many classification informations.
The recognition methods of biological event trigger word, is parallel lack sampling method (PUS) in a kind of large data sets of the present invention, bag Include following steps:
Step 1, data segmentation, define data set D={ (x1,y1),...,(xn,yn) it is training dataset, wherein xi∈ RmIt is sample instance, and yi∈ { 0,1 ..., l } is the sample instance generic, and has 1+l class label;Define Dα It is most category datasets, wherein including n0The individual individual sample instance for belonging to classification y=0, makes α=n0;By most categorical datas Collection DαRandom division is K mutually disjoint most classification Sub Data SetsK=1,2 ..., K, uses αkRepresent each many several classes of Small pin for the case data setThe number of middle sample instance, therefore haveDjIt is one of minority category dataset, its bag Containing njBelong to classification y=j, j=1 ..., l (i.e. a certain class event trigger word) individual sample instance.There is significant distribution in data set D Deviation, therefore n0> > nj.Make β represent the number of sample in all a small number of category datasets, then haveAnd Dβ=∪ Dj, thus j=1,2 ..., l obtain α > > β.Each in the step 1It is respectively provided with identical scale.
Step 2, boundary factor are calculated, and define each data set SkContain from corresponding majority classification Sub Data SetWith A small number of category dataset DβSample instance, be expressed asK=1,2 ..., K;
By after characteristic extraction step, SkBy m dimensional feature F={ ft, t=1,2 ..., m is represented, each sample boundary because Son is obtained based on its indeterminacy of calculation for belonging to all categories, uncertain main by set SkIn each sample Example x to given classification CjApart from d (x, Cj) come what is determined, the calculating of distance is defined as follows:
For calculating in t dimensional feature spaces, sample instance x to given classification CjDistance component, be Reduction amount of calculation, by d (x, Cj) set up in classification CjBarycenter on, rather than CjWhole samples on, due to biological trigger word Identification data collection is text, therefore,It is defined as text vector to classification CjThe distance of barycenter, the barycenter is word Frequency TF (ft|Cj) average:
In formula (2), classification CjWord frequency TF (ft|Cj) to CjIn sample number njAverage, as classification Cj's Barycenter.In formula, in d (x, Cj) on the basis of, each sample instance x is for classification CjDegree of membership μjX () is defined as follows:
From formula (3), we can obtain, and the distance of x to barycenter is smaller, then x is to CjDegree of membership it is bigger;Otherwise x is to matter The distance of the heart is bigger, then x is to CjDegree of membership it is smaller.Boundary factor BoundF (x) of sample instance x is defined as follows:
Boundary factor BoundF (x) is multiplied by two parts and obtained.Part I is to state sample instance to belong to each The probabilistic entropy of classification.Sample instance is closer to the border of classification, and the entropy of its subjection degree is bigger.Part II is sample reality Average distance of the example to all categories.If sample instance gets over the inside in classification, its average distance is smaller.Conversely, such as Fruit sample instance is closer to the border of classification, and its average distance is bigger.Therefore from two-part value, we can see that locating It is bigger in boundary factor BoundF (x) value of the sample instance on classification border, relative to the sample inside classification, its carrying More classification informations.
Step 3, sample sub- sampling, BoundF (x) values that will be calculated are ranked up, and will possess the α ' of maximumkIt is individual The sample of BoundF (x) values is extracted one boundary set of composition as boundary sample exampleSampling number α 'k=p × β, p As a parameter to be regulated of PUS algorithms;In the step 3, if data set includes noise data, to boundary sample Noise data is deleted before example sampling.
Step 4, boundary set merger, by all boundary sets produced by parallel lack sampling in step 2 and 3, after merging Obtain a new most category dataset D'α, and all of a small number of classifications gather together, and obtain new training data set D'=D'α∪Dβ
Step 5, pruning, repeat lack sampling step 2 and step 3 obtain final training dataset to training data set D' D ", makes training dataset D " include α " individual maximum BoundF (x) value sample, reach most classification sample numbers and minority class very This number is balanced, i.e. α "=β.
In order to the recognition methods of biological event trigger word in the large data sets of present invention offer is compared with other method Compared with being respectively Data used here as two corpusBioNLP09And DataBioNLP11, it is shared in BioNLP ' 09 and BioNLP ' 11 In task, trigger word identification is an intermediate steps of biomedical Event Distillation task, so without knowledge in test set Other result, therefore, we use checking data set as test data set.
The Data of table 1BioNLP09In triggering and non-toggle word type and quantity list
Table 1 gives DataBioNLP09The detailed statistical analysis of middle training dataset and checking data set.Know as trigger Other task, there is 9 kinds of trigger word types (the biomedical event of 9 kinds of correspondence) here, and they can be divided three classes:Simple event is triggered Word, binds event trigger word and complicated event trigger word.Along with this negative classification of non-toggle word, to be processed altogether is 10 classes Classification problem.Be can see from the data in form, it is the unbalanced data of significant class, because only that about 3.74% training data belongs to one of triggering part of speech, and remaining belongs to non-toggle part of speech.
The Data of table 2BioNLP11In triggering and non-toggle word type and quantity list
Table 2 provides the Data that knows clearlyBioNLP11The detailed statistical analysis of middle training dataset and checking data set.
Table 3 is the present invention compare with other existing systematic functions
Biological trigger word identifying system P R F1
PUS-SVM 69.7 69.9 68.3
SystemCRF 65.0 30.2 41.2
SystemSVM 70.2 52.6 60.1
Turku 70.5 60.6 65.2
TrigNER 69.3 57.3 62.7
Based on same data set DataBioNLP09, the performance with other existing identifying systems relatively more of the invention.Table 3 is arranged in detail The present invention and other 4 the 3 of identifying system metrics are gone out:(F values, it is recall ratio for P (precision ratio), R (recall ratio) and F1 With the weighted geometric mean of precision ratio).The result of table 3 shows, the biological trigger word identifying system PUS- proposed in the present invention SVM can reach best overall performance,
And there is significant difference with other systems.Additionally, our improvement to systematic function are based on to recall rate Improvement, it means that there are more possible trigger words to be identified, this can further improve next stage event recognition Performance.
Table 4 is based on same data set DataBioNLP09The Performance comparision of sub- sampling method in biological trigger word identifying system
Table 5 is based on same data set DataBioNLP11The Performance comparision of sub- sampling method in biological trigger word identifying system
Table 4 and table 5 analyze the identifying system based on SVM, the performance of different sub- sampling methods, include and carry in the present invention The PUS-SVM of the parallel sub- sampling method for going out, the SVM without sampling, RUS-SVM, k closest sample sub- sampling of random sampling KUS-SVM.Comparative test is in data set DataBioNLP09And DataBioNLP11It is upper to carry out respectively, under two datasets, this hair The parallel sub- sampling system of bright middle proposition can obtain optimal overall F1 performances.
In order to further analyze efficiency of the present invention in large data sets, with 8 processing core@2.67GHz and RAM Machine on contrast the run time consumption of any of the above method, include the parallel sub- sampling method that proposes in the present invention PUS-SVM, the SVM without sampling, RUS-SVM, k kUS-SVM of closest sample sub- sampling of random sampling.Comparative test exists Data set DataBioNLP09And DataBioNLP11Upper to carry out respectively, concrete outcome shows in figs. 2 and 3.From Fig. 2 and Fig. 3 I As can be seen that the recognition methods of biological event trigger word can be in an acceptable time in large data sets proposed by the present invention It is interior, obtain best biological trigger word recognition result.
Embodiments of the present invention are explained in detail above in conjunction with accompanying drawing, but the present invention is not limited to above-mentioned implementation Mode, for those skilled in the art, after content described in the present invention is known, is not departing from the present invention On the premise of principle, some equal conversion and replacement can also be made to it, these convert and to substitute also should be regarded as and belong on an equal basis Protection scope of the present invention.

Claims (3)

1. in a kind of large data sets biological event trigger word recognition methods, be parallel lack sampling method, it is characterised in that including Following steps:
Step 1, data segmentation, define data set D={ (x1,y1),...,(xn,yn) it is training dataset, wherein xi∈RmFor Sample instance, and yi∈ { 0,1 ..., l } is the generic of sample instance, has 1+l class label;Define DαIt is majority Category dataset, wherein including n0The individual individual sample instance for belonging to classification y=0, makes α=n0;By most category dataset DαWith Machine is divided into K mutually disjoint most classification Sub Data SetsUse αkRepresent each most classification subdata CollectionThe number of middle sample instance;Define DβIt is a small number of category datasets, i.e. Dβ={ ∪ Dj, j=1,2 ..., l, wherein, β The number of sample in all a small number of category datasets is represented, is hadThus α > > β are obtained;
Step 2, boundary factor are calculated, and define each data set SkContain from corresponding majority classification Sub Data SetAnd minority Category dataset DβSample instance, be expressed as
By after characteristic extraction step, SkBy m dimensional feature F={ ft, t=1,2 ..., m is represented, each sample boundary factor is Obtained based on its indeterminacy of calculation for belonging to all categories, it is uncertain main by set SkIn each sample instance X to given classification CjApart from d (x, Cj) come what is determined, the calculating of distance is defined as follows:
d ( x , C j ) = Σ t = 1 m dis 2 ( x f t , C f t ) - - - ( 1 )
For calculating in t dimensional feature spaces, sample instance x to given classification CjDistance component, due to life Thing trigger word identification data collection is text, therefore,It is defined as text vector to classification CjThe distance of barycenter, it is described Barycenter is word frequency TF (ft|Cj) average:
d i s ( x f t , C f t ) = | x f t - T F ( f t | C j ) n j | - - - ( 2 )
In formula, in d (x, Cj) on the basis of, each sample instance x is for classification CjDegree of membership μjX () is defined as follows:
And
Boundary factor BoundF (x) of sample instance x is defined as follows:
B o u n d F ( x ) = ( - Σ j = 0 l μ j ( x ) l o g ( μ j ( x ) ) ) × ( Σ j = 0 l μ j ( x ) d ( x , C j ) ) - - - ( 4 )
Step 3, sample sub- sampling, BoundF (x) values that will be calculated are ranked up, and will possess the α ' of maximumkIndividual BoundF (x) The sample of value is extracted one boundary set of composition as boundary sample exampleSampling number α 'k=p × β, p are calculated as PUS One parameter to be regulated of method;
Step 4, boundary set merger, all boundary sets produced by parallel lack sampling in step 2 and 3 are obtained after merging One new most category dataset D'α, and all of a small number of classifications gather together, and obtain new training data set D'= D'α∪Dβ
Step 5, pruning, repeat lack sampling step 2 and step 3 obtain final training dataset D to training data set D' ", Make training dataset D " comprising α " sample of individual maximum BoundF (x) value, reach most classification sample numbers and minority class very this Number balance, i.e. α "=β.
2. in large data sets according to claim 1 biological event trigger word recognition methods, it is characterised in that:The step Each in rapid 1It is respectively provided with identical scale.
3. in large data sets according to claim 1 biological event trigger word recognition methods, it is characterised in that:Described In step 3, if data set includes noise data, noise data is deleted before being sampled to border sample instance.
CN201710148320.5A 2017-03-14 2017-03-14 Method for identifying biological event trigger words in big data set Expired - Fee Related CN106933805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710148320.5A CN106933805B (en) 2017-03-14 2017-03-14 Method for identifying biological event trigger words in big data set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710148320.5A CN106933805B (en) 2017-03-14 2017-03-14 Method for identifying biological event trigger words in big data set

Publications (2)

Publication Number Publication Date
CN106933805A true CN106933805A (en) 2017-07-07
CN106933805B CN106933805B (en) 2020-04-28

Family

ID=59432925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710148320.5A Expired - Fee Related CN106933805B (en) 2017-03-14 2017-03-14 Method for identifying biological event trigger words in big data set

Country Status (1)

Country Link
CN (1) CN106933805B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108897989A (en) * 2018-06-06 2018-11-27 大连理工大学 A kind of biological event abstracting method based on candidate events element attention mechanism

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927874A (en) * 2014-04-29 2014-07-16 东南大学 Automatic incident detection method based on under-sampling and used for unbalanced data set
CN104965819A (en) * 2015-07-12 2015-10-07 大连理工大学 Biomedical event trigger word identification method based on syntactic word vector
CN105260361A (en) * 2015-10-28 2016-01-20 南京邮电大学 Trigger word tagging system and method for biomedical events
CN105512209A (en) * 2015-11-28 2016-04-20 大连理工大学 Biomedicine event trigger word identification method based on characteristic automatic learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927874A (en) * 2014-04-29 2014-07-16 东南大学 Automatic incident detection method based on under-sampling and used for unbalanced data set
CN104965819A (en) * 2015-07-12 2015-10-07 大连理工大学 Biomedical event trigger word identification method based on syntactic word vector
CN105260361A (en) * 2015-10-28 2016-01-20 南京邮电大学 Trigger word tagging system and method for biomedical events
CN105512209A (en) * 2015-11-28 2016-04-20 大连理工大学 Biomedicine event trigger word identification method based on characteristic automatic learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108897989A (en) * 2018-06-06 2018-11-27 大连理工大学 A kind of biological event abstracting method based on candidate events element attention mechanism

Also Published As

Publication number Publication date
CN106933805B (en) 2020-04-28

Similar Documents

Publication Publication Date Title
Agnihotri et al. Variable global feature selection scheme for automatic classification of text documents
CN106815369B (en) A kind of file classification method based on Xgboost sorting algorithm
Guo et al. Fast data selection for SVM training using ensemble margin
US20180165413A1 (en) Gene expression data classification method and classification system
CN110795564B (en) Text classification method lacking negative cases
CN108460421A (en) The sorting technique of unbalanced data
Bamakan et al. A novel feature selection method based on an integrated data envelopment analysis and entropy model
CN109784387A (en) Multi-level progressive classification method and system based on neural network and Bayesian model
CN109858518A (en) A kind of large data clustering method based on MapReduce
Untoro et al. Evaluation of decision tree, k-NN, Naive Bayes and SVM with MWMOTE on UCI dataset
Mizianty et al. Discretization as the enabling technique for the Naive Bayes and semi-Naive Bayes-based classification
CN111797267A (en) Medical image retrieval method and system, electronic device and storage medium
CN107679244A (en) File classification method and device
CN106844596A (en) One kind is based on improved SVM Chinese Text Categorizations
Remli et al. K-means clustering with infinite feature selection for classification tasks in gene expression data
Arora Classification of human metaspread images using convolutional neural networks
CN106933805A (en) The recognition methods of biological event trigger word in a kind of large data sets
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN110516741A (en) Classification based on dynamic classifier selection is overlapped unbalanced data classification method
Shi et al. Rough set based decision tree ensemble algorithm for text classification
CN103207893A (en) Classification method of two types of texts on basis of vector group mapping
Liu et al. Multi-class classification of support vector machines based on double binary tree
Mubaroq et al. Application of Discretization and Information Gain on Naï ve Bayes to Diagnose Heart Disease
CN110532384A (en) A kind of multitask dictionary list classification method, system, device and storage medium
Liang et al. ASE: Anomaly Scoring Based Ensemble Learning for Imbalanced Datasets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200428

Termination date: 20210314

CF01 Termination of patent right due to non-payment of annual fee