CN106933805A - The recognition methods of biological event trigger word in a kind of large data sets - Google Patents
The recognition methods of biological event trigger word in a kind of large data sets Download PDFInfo
- Publication number
- CN106933805A CN106933805A CN201710148320.5A CN201710148320A CN106933805A CN 106933805 A CN106933805 A CN 106933805A CN 201710148320 A CN201710148320 A CN 201710148320A CN 106933805 A CN106933805 A CN 106933805A
- Authority
- CN
- China
- Prior art keywords
- sample
- classification
- trigger word
- data
- boundary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to the recognition methods technical field of biological event trigger word, the recognition methods of biological event trigger word in specifically related to a kind of large data sets, it is parallel lack sampling method (PUS), including data segmentation, boundary factor calculating, sample sub- sampling, boundary set merger and last shearing procedure, can be used for processing between classification and there is the big training dataset of significant distribution bias, belong to the sample instance of most classifications by parallel reduction to achieve the goal.Selection of the method to data is the calculating based on boundary factor, importance of the entrained information that it weighs each sample instance for classification.The recognition methods of biological event trigger word in the large data sets that above-mentioned technical proposal is provided, can simultaneously solve data volume greatly and sample distribution imbalance problem between classification, to reach the recognition effect of more preferable biological event trigger word.
Description
Technical field
The present invention relates to the recognition methods technical field of biological event trigger word, and in particular to biological in a kind of large data sets
The recognition methods of event trigger word.
Background technology
With the raising of information technology and becoming increasingly popular for internet, biomedical electronic literature is used as scientific research
Product, the trend being exponentially increased, these online document resources contain the preciousness biology that substantial amounts of systems biology research is badly in need of
Medical knowledge.In face of the continuous surge of magnanimity biomedicine text message, Text Mining Technology is just hidden in document as extraction
In important knowledge technology, be widely applied in biomedical sector.
Biological event extract refer in massive medical Research Literature the biomolecule such as automatic detection gene and protein it
Between interactive relation description process, so as to extract the structured message of pre-defined event type.In this process, if
Biological event trigger word can be exactly identified, the performance of event extraction will be greatly improved.Event trigger word identification is biological
First step during event extraction, the trigger word that it is recognized is the basis of event argument recognition, is the core of whole event
The heart.In trigger word identification, also need to recognize the classification of trigger word, the classification of trigger word i.e. the classification of whole event, if
Trigger word identification is wrong, and follow-up work is lost meaning, therefore it is to carry out biomedical event extraction to carry out trigger word identification
Key.Wherein, based on SVMs (SVM) and the method represented based on feature-rich be event trigger word identification in it is most normal
, the best ML models of result.However, in actual event triggering identification application, the complexity on data has two
Key issue.First, the disequilibrium that data are distributed between classification.Secondly, the big data of training dataset.For big data
There is very big limitation and cause the performance to reduce in collection, many sorting algorithms.For example, the training complexity of SVM is highly dependent on number
According to the size of collection, time consumption for training is more on large data sets.Therefore, the highly unbalanced feature of large data sets and data distribution is
The identification of event trigger word brings very huge challenge of knowing clearly.
In face of large data sets, Undersampling technique is most efficient method, and it is by removing the sample in some most classifications
Example builds equilibrium criterion collection, and do so can reduce computational complexity.Therefore, Undersampling technique is still under big data
Effectively.Therefore, many more efficient lack sampling methods are suggested.Wherein, the lack sampling method based on cluster, it is intended to pass through
The cluster for calculating data set solves unbalanced data distribution problem.In this kind of method, training data is divided into several clusters, Ran Hougen
Representative sample instance is selected from the cluster of most classifications according to ratio, with a small number of classification example sets into the data for balancing
Collection.Unbalanced data problem can be efficiently solved by using lack sampling method and integrated study based on cluster.In addition, a kind of
New reverse random sub- sampling method (IRUS), by the random bulk sampling to most category datasets, the structure between classification
Build compound decision border.Although however, these methods can alleviate to a certain extent unbalanced data study problem, but still
Needs take a substantial amount of time iteratively to cluster or find the border of nearest neighbours.Therefore, in face of large data sets, these sides
Method is simultaneously non-real efficient.
For large data sets, in order to overcome SVM to train the bottleneck of complexity, various methods are also suggested, for example, order
Big QP PROBLEM DECOMPOSITIONs are a series of minimum possible QP problems by minimum optimization (SMO), it is allowed to the big training set of SMO treatment.Separately have
The data set clustered using minimum closure ball (MEB) divides training data by MEB methods, and the center of cluster is used for svm classifier.
However, classification of these methods to unbalanced data is not helped.
Solve that the simultaneous data volume in the classification problem is big while existing method all fails fine and classification between
Sample distribution imbalance problem, this is the important step for solving the identification of biological event trigger word.
The content of the invention
The technical problem to be solved in the present invention is to provide a kind of recognition methods of biological event trigger word in large data sets, energy
Solve that data volume is big simultaneously and classification between sample distribution imbalance problem, it can solve the sample imbalance point under large data sets
Class problem, can reach the recognition effect of more preferable biological event trigger word.
In order to solve the above technical problems, present invention employs following technical scheme:Biological event is touched in a kind of large data sets
The recognition methods of word is sent out, is parallel lack sampling method (Parallel Under-Sampling, PUS), comprised the following steps:
Step 1, data segmentation, define data set D={ (x1,y1),...,(xn,yn) it is training dataset, wherein xi∈
RmIt is sample instance, and yi∈ { 0,1 ..., l } is the sample instance generic, and has 1+l class label;Define Dα
It is most category datasets, wherein including n0The individual individual sample instance for belonging to classification y=0, makes α=n0;By most categorical datas
Collection DαRandom division is K mutually disjoint most classification Sub Data SetsK=1,2 ..., K, uses αkRepresent each many several classes of
Small pin for the case data setThe number of middle sample instance, therefore haveDefine DβIt is a small number of category datasets, i.e. Dβ={ ∪
Dj, j=1,2 ..., l, wherein, β represents the number of sample in all a small number of category datasets, hasThus α is obtained
> > β;
Step 2, boundary factor are calculated, and define each data set SkContain from corresponding majority classification Sub Data SetWith
A small number of category dataset DβSample instance, be expressed asK=1,2 ..., K;
By after characteristic extraction step, SkBy m dimensional feature F={ ft, t=1,2 ..., m is represented, each sample boundary because
Son is obtained based on its indeterminacy of calculation for belonging to all categories, uncertain main by set SkIn each sample
Example x to given classification CjApart from d (x, Cj) come what is determined, the calculating of distance is defined as follows:
dis(xft,Cft) be used for calculating in t dimensional feature spaces, sample instance x to given classification CjDistance component,
Because biological trigger word identification data collection is text, therefore, dis (xft,Cft) it is defined as text vector to classification CjBarycenter away from
From the barycenter is word frequency TF (ft|Cj) average:
In formula, in d (x, Cj) on the basis of, each sample instance x is for classification CjDegree of membership μjX () is defined as follows:
Boundary factor BoundF (x) of sample instance x is defined as follows:
Step 3, sample sub- sampling, BoundF (x) values that will be calculated are ranked up, and will possess the α ' of maximumkIt is individual
The sample of BoundF (x) values is extracted one boundary set of composition as boundary sample exampleSampling number α 'k=p × β, p
As a parameter to be regulated of PUS algorithms;
Step 4, boundary set merger, by all boundary sets produced by parallel lack sampling in step 2 and 3, after merging
Obtain a new most category dataset D'α, and all of a small number of classifications gather together, and obtain new training data set
D'=D'α∪Dβ;
Step 5, pruning, repeat lack sampling step 2 and step 3 obtain final training dataset to training data set D'
D ", makes training dataset D " include α " individual maximum BoundF (x) value sample, reach most classification sample numbers and minority class very
This number is balanced, i.e. α "=β.
The recognition methods of biological event trigger word in the large data sets provided in above-mentioned technical proposal, mainly in biology
In event recognition task, data set is distributed unbalanced problem greatly and between sample class, it is proposed that a kind of parallel sub- sampling method
(PUS) PUS-SVM trigger word identifying systems, and with reference to SVM classifier are constructed, trigger word recognition performance and effect is effectively improved
Rate.Parallel sub- sampling method (PUS) uses the method for sampling based on classification border to reduce the disequilibrium of data and available
Parallel distributed computing realizes sub- sampling, effectively reduces the computational complexity of large data sets.
Brief description of the drawings
Fig. 1 is the workflow diagram of the recognition methods of biological event trigger word in large data sets of the present invention;
Fig. 2 is based on same data set DataBioNLP09The time loss of sub- sampling method in biological trigger word identifying system
Compare figure;
Fig. 3 is based on same data set DataBioNLP11The time loss of sub- sampling method in biological trigger word identifying system
Compare figure.
Specific embodiment
In order that objects and advantages of the present invention become more apparent, the present invention is carried out specifically with reference to embodiments
It is bright.It should be appreciated that following word is only used to describe one or more specific embodiments of the invention, not to the present invention
The protection domain of specific request carries out considered critical.
The technical scheme that the present invention takes is as shown in figure 1, method proposed by the invention can be used between treatment classification
It is a lack sampling method in the presence of the big training dataset of significant distribution bias, the sample of most classifications is belonged to by reducing
Example to adjust classification between data distribution.Selection of the method to data be based on boundary factor (Boundary Factor,
BoundF calculating), importance of the information for classification entrained by its each sample instance weighed.After lack sampling,
Sample instance in class center will be abandoned, and the sample instance for being in classification border is then retained.In subsequent bio thing
In the identification of part trigger word, SVM will act as grader, and its calculating to Optimal Separating Hyperplane depends on the sample of these classification boundaries
Example, i.e. supporting vector, thus parallel sub- sampling method (PUS) remain it is that most probable help SVM is classified, comprising most
The sample instance of many classification informations.
The recognition methods of biological event trigger word, is parallel lack sampling method (PUS) in a kind of large data sets of the present invention, bag
Include following steps:
Step 1, data segmentation, define data set D={ (x1,y1),...,(xn,yn) it is training dataset, wherein xi∈
RmIt is sample instance, and yi∈ { 0,1 ..., l } is the sample instance generic, and has 1+l class label;Define Dα
It is most category datasets, wherein including n0The individual individual sample instance for belonging to classification y=0, makes α=n0;By most categorical datas
Collection DαRandom division is K mutually disjoint most classification Sub Data SetsK=1,2 ..., K, uses αkRepresent each many several classes of
Small pin for the case data setThe number of middle sample instance, therefore haveDjIt is one of minority category dataset, its bag
Containing njBelong to classification y=j, j=1 ..., l (i.e. a certain class event trigger word) individual sample instance.There is significant distribution in data set D
Deviation, therefore n0> > nj.Make β represent the number of sample in all a small number of category datasets, then haveAnd
Dβ=∪ Dj, thus j=1,2 ..., l obtain α > > β.Each in the step 1It is respectively provided with identical scale.
Step 2, boundary factor are calculated, and define each data set SkContain from corresponding majority classification Sub Data SetWith
A small number of category dataset DβSample instance, be expressed asK=1,2 ..., K;
By after characteristic extraction step, SkBy m dimensional feature F={ ft, t=1,2 ..., m is represented, each sample boundary because
Son is obtained based on its indeterminacy of calculation for belonging to all categories, uncertain main by set SkIn each sample
Example x to given classification CjApart from d (x, Cj) come what is determined, the calculating of distance is defined as follows:
For calculating in t dimensional feature spaces, sample instance x to given classification CjDistance component, be
Reduction amount of calculation, by d (x, Cj) set up in classification CjBarycenter on, rather than CjWhole samples on, due to biological trigger word
Identification data collection is text, therefore,It is defined as text vector to classification CjThe distance of barycenter, the barycenter is word
Frequency TF (ft|Cj) average:
In formula (2), classification CjWord frequency TF (ft|Cj) to CjIn sample number njAverage, as classification Cj's
Barycenter.In formula, in d (x, Cj) on the basis of, each sample instance x is for classification CjDegree of membership μjX () is defined as follows:
From formula (3), we can obtain, and the distance of x to barycenter is smaller, then x is to CjDegree of membership it is bigger;Otherwise x is to matter
The distance of the heart is bigger, then x is to CjDegree of membership it is smaller.Boundary factor BoundF (x) of sample instance x is defined as follows:
Boundary factor BoundF (x) is multiplied by two parts and obtained.Part I is to state sample instance to belong to each
The probabilistic entropy of classification.Sample instance is closer to the border of classification, and the entropy of its subjection degree is bigger.Part II is sample reality
Average distance of the example to all categories.If sample instance gets over the inside in classification, its average distance is smaller.Conversely, such as
Fruit sample instance is closer to the border of classification, and its average distance is bigger.Therefore from two-part value, we can see that locating
It is bigger in boundary factor BoundF (x) value of the sample instance on classification border, relative to the sample inside classification, its carrying
More classification informations.
Step 3, sample sub- sampling, BoundF (x) values that will be calculated are ranked up, and will possess the α ' of maximumkIt is individual
The sample of BoundF (x) values is extracted one boundary set of composition as boundary sample exampleSampling number α 'k=p × β, p
As a parameter to be regulated of PUS algorithms;In the step 3, if data set includes noise data, to boundary sample
Noise data is deleted before example sampling.
Step 4, boundary set merger, by all boundary sets produced by parallel lack sampling in step 2 and 3, after merging
Obtain a new most category dataset D'α, and all of a small number of classifications gather together, and obtain new training data set
D'=D'α∪Dβ;
Step 5, pruning, repeat lack sampling step 2 and step 3 obtain final training dataset to training data set D'
D ", makes training dataset D " include α " individual maximum BoundF (x) value sample, reach most classification sample numbers and minority class very
This number is balanced, i.e. α "=β.
In order to the recognition methods of biological event trigger word in the large data sets of present invention offer is compared with other method
Compared with being respectively Data used here as two corpusBioNLP09And DataBioNLP11, it is shared in BioNLP ' 09 and BioNLP ' 11
In task, trigger word identification is an intermediate steps of biomedical Event Distillation task, so without knowledge in test set
Other result, therefore, we use checking data set as test data set.
The Data of table 1BioNLP09In triggering and non-toggle word type and quantity list
Table 1 gives DataBioNLP09The detailed statistical analysis of middle training dataset and checking data set.Know as trigger
Other task, there is 9 kinds of trigger word types (the biomedical event of 9 kinds of correspondence) here, and they can be divided three classes:Simple event is triggered
Word, binds event trigger word and complicated event trigger word.Along with this negative classification of non-toggle word, to be processed altogether is 10 classes
Classification problem.Be can see from the data in form, it is the unbalanced data of significant class, because only that about
3.74% training data belongs to one of triggering part of speech, and remaining belongs to non-toggle part of speech.
The Data of table 2BioNLP11In triggering and non-toggle word type and quantity list
Table 2 provides the Data that knows clearlyBioNLP11The detailed statistical analysis of middle training dataset and checking data set.
Table 3 is the present invention compare with other existing systematic functions
Biological trigger word identifying system | P | R | F1 |
PUS-SVM | 69.7 | 69.9 | 68.3 |
SystemCRF | 65.0 | 30.2 | 41.2 |
SystemSVM | 70.2 | 52.6 | 60.1 |
Turku | 70.5 | 60.6 | 65.2 |
TrigNER | 69.3 | 57.3 | 62.7 |
Based on same data set DataBioNLP09, the performance with other existing identifying systems relatively more of the invention.Table 3 is arranged in detail
The present invention and other 4 the 3 of identifying system metrics are gone out:(F values, it is recall ratio for P (precision ratio), R (recall ratio) and F1
With the weighted geometric mean of precision ratio).The result of table 3 shows, the biological trigger word identifying system PUS- proposed in the present invention
SVM can reach best overall performance,
And there is significant difference with other systems.Additionally, our improvement to systematic function are based on to recall rate
Improvement, it means that there are more possible trigger words to be identified, this can further improve next stage event recognition
Performance.
Table 4 is based on same data set DataBioNLP09The Performance comparision of sub- sampling method in biological trigger word identifying system
Table 5 is based on same data set DataBioNLP11The Performance comparision of sub- sampling method in biological trigger word identifying system
Table 4 and table 5 analyze the identifying system based on SVM, the performance of different sub- sampling methods, include and carry in the present invention
The PUS-SVM of the parallel sub- sampling method for going out, the SVM without sampling, RUS-SVM, k closest sample sub- sampling of random sampling
KUS-SVM.Comparative test is in data set DataBioNLP09And DataBioNLP11It is upper to carry out respectively, under two datasets, this hair
The parallel sub- sampling system of bright middle proposition can obtain optimal overall F1 performances.
In order to further analyze efficiency of the present invention in large data sets, with 8 processing core@2.67GHz and RAM
Machine on contrast the run time consumption of any of the above method, include the parallel sub- sampling method that proposes in the present invention
PUS-SVM, the SVM without sampling, RUS-SVM, k kUS-SVM of closest sample sub- sampling of random sampling.Comparative test exists
Data set DataBioNLP09And DataBioNLP11Upper to carry out respectively, concrete outcome shows in figs. 2 and 3.From Fig. 2 and Fig. 3 I
As can be seen that the recognition methods of biological event trigger word can be in an acceptable time in large data sets proposed by the present invention
It is interior, obtain best biological trigger word recognition result.
Embodiments of the present invention are explained in detail above in conjunction with accompanying drawing, but the present invention is not limited to above-mentioned implementation
Mode, for those skilled in the art, after content described in the present invention is known, is not departing from the present invention
On the premise of principle, some equal conversion and replacement can also be made to it, these convert and to substitute also should be regarded as and belong on an equal basis
Protection scope of the present invention.
Claims (3)
1. in a kind of large data sets biological event trigger word recognition methods, be parallel lack sampling method, it is characterised in that including
Following steps:
Step 1, data segmentation, define data set D={ (x1,y1),...,(xn,yn) it is training dataset, wherein xi∈RmFor
Sample instance, and yi∈ { 0,1 ..., l } is the generic of sample instance, has 1+l class label;Define DαIt is majority
Category dataset, wherein including n0The individual individual sample instance for belonging to classification y=0, makes α=n0;By most category dataset DαWith
Machine is divided into K mutually disjoint most classification Sub Data SetsUse αkRepresent each most classification subdata
CollectionThe number of middle sample instance;Define DβIt is a small number of category datasets, i.e. Dβ={ ∪ Dj, j=1,2 ..., l, wherein, β
The number of sample in all a small number of category datasets is represented, is hadThus α > > β are obtained;
Step 2, boundary factor are calculated, and define each data set SkContain from corresponding majority classification Sub Data SetAnd minority
Category dataset DβSample instance, be expressed as
By after characteristic extraction step, SkBy m dimensional feature F={ ft, t=1,2 ..., m is represented, each sample boundary factor is
Obtained based on its indeterminacy of calculation for belonging to all categories, it is uncertain main by set SkIn each sample instance
X to given classification CjApart from d (x, Cj) come what is determined, the calculating of distance is defined as follows:
For calculating in t dimensional feature spaces, sample instance x to given classification CjDistance component, due to life
Thing trigger word identification data collection is text, therefore,It is defined as text vector to classification CjThe distance of barycenter, it is described
Barycenter is word frequency TF (ft|Cj) average:
In formula, in d (x, Cj) on the basis of, each sample instance x is for classification CjDegree of membership μjX () is defined as follows:
And
Boundary factor BoundF (x) of sample instance x is defined as follows:
Step 3, sample sub- sampling, BoundF (x) values that will be calculated are ranked up, and will possess the α ' of maximumkIndividual BoundF (x)
The sample of value is extracted one boundary set of composition as boundary sample exampleSampling number α 'k=p × β, p are calculated as PUS
One parameter to be regulated of method;
Step 4, boundary set merger, all boundary sets produced by parallel lack sampling in step 2 and 3 are obtained after merging
One new most category dataset D'α, and all of a small number of classifications gather together, and obtain new training data set D'=
D'α∪Dβ;
Step 5, pruning, repeat lack sampling step 2 and step 3 obtain final training dataset D to training data set D' ",
Make training dataset D " comprising α " sample of individual maximum BoundF (x) value, reach most classification sample numbers and minority class very this
Number balance, i.e. α "=β.
2. in large data sets according to claim 1 biological event trigger word recognition methods, it is characterised in that:The step
Each in rapid 1It is respectively provided with identical scale.
3. in large data sets according to claim 1 biological event trigger word recognition methods, it is characterised in that:Described
In step 3, if data set includes noise data, noise data is deleted before being sampled to border sample instance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710148320.5A CN106933805B (en) | 2017-03-14 | 2017-03-14 | Method for identifying biological event trigger words in big data set |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710148320.5A CN106933805B (en) | 2017-03-14 | 2017-03-14 | Method for identifying biological event trigger words in big data set |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106933805A true CN106933805A (en) | 2017-07-07 |
CN106933805B CN106933805B (en) | 2020-04-28 |
Family
ID=59432925
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710148320.5A Expired - Fee Related CN106933805B (en) | 2017-03-14 | 2017-03-14 | Method for identifying biological event trigger words in big data set |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106933805B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108897989A (en) * | 2018-06-06 | 2018-11-27 | 大连理工大学 | A kind of biological event abstracting method based on candidate events element attention mechanism |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103927874A (en) * | 2014-04-29 | 2014-07-16 | 东南大学 | Automatic incident detection method based on under-sampling and used for unbalanced data set |
CN104965819A (en) * | 2015-07-12 | 2015-10-07 | 大连理工大学 | Biomedical event trigger word identification method based on syntactic word vector |
CN105260361A (en) * | 2015-10-28 | 2016-01-20 | 南京邮电大学 | Trigger word tagging system and method for biomedical events |
CN105512209A (en) * | 2015-11-28 | 2016-04-20 | 大连理工大学 | Biomedicine event trigger word identification method based on characteristic automatic learning |
-
2017
- 2017-03-14 CN CN201710148320.5A patent/CN106933805B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103927874A (en) * | 2014-04-29 | 2014-07-16 | 东南大学 | Automatic incident detection method based on under-sampling and used for unbalanced data set |
CN104965819A (en) * | 2015-07-12 | 2015-10-07 | 大连理工大学 | Biomedical event trigger word identification method based on syntactic word vector |
CN105260361A (en) * | 2015-10-28 | 2016-01-20 | 南京邮电大学 | Trigger word tagging system and method for biomedical events |
CN105512209A (en) * | 2015-11-28 | 2016-04-20 | 大连理工大学 | Biomedicine event trigger word identification method based on characteristic automatic learning |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108897989A (en) * | 2018-06-06 | 2018-11-27 | 大连理工大学 | A kind of biological event abstracting method based on candidate events element attention mechanism |
Also Published As
Publication number | Publication date |
---|---|
CN106933805B (en) | 2020-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Agnihotri et al. | Variable global feature selection scheme for automatic classification of text documents | |
CN106815369B (en) | A kind of file classification method based on Xgboost sorting algorithm | |
Guo et al. | Fast data selection for SVM training using ensemble margin | |
US20180165413A1 (en) | Gene expression data classification method and classification system | |
CN110795564B (en) | Text classification method lacking negative cases | |
CN108460421A (en) | The sorting technique of unbalanced data | |
Bamakan et al. | A novel feature selection method based on an integrated data envelopment analysis and entropy model | |
CN109784387A (en) | Multi-level progressive classification method and system based on neural network and Bayesian model | |
CN109858518A (en) | A kind of large data clustering method based on MapReduce | |
Untoro et al. | Evaluation of decision tree, k-NN, Naive Bayes and SVM with MWMOTE on UCI dataset | |
Mizianty et al. | Discretization as the enabling technique for the Naive Bayes and semi-Naive Bayes-based classification | |
CN111797267A (en) | Medical image retrieval method and system, electronic device and storage medium | |
CN107679244A (en) | File classification method and device | |
CN106844596A (en) | One kind is based on improved SVM Chinese Text Categorizations | |
Remli et al. | K-means clustering with infinite feature selection for classification tasks in gene expression data | |
Arora | Classification of human metaspread images using convolutional neural networks | |
CN106933805A (en) | The recognition methods of biological event trigger word in a kind of large data sets | |
CN115098690B (en) | Multi-data document classification method and system based on cluster analysis | |
CN110516741A (en) | Classification based on dynamic classifier selection is overlapped unbalanced data classification method | |
Shi et al. | Rough set based decision tree ensemble algorithm for text classification | |
CN103207893A (en) | Classification method of two types of texts on basis of vector group mapping | |
Liu et al. | Multi-class classification of support vector machines based on double binary tree | |
Mubaroq et al. | Application of Discretization and Information Gain on Naï ve Bayes to Diagnose Heart Disease | |
CN110532384A (en) | A kind of multitask dictionary list classification method, system, device and storage medium | |
Liang et al. | ASE: Anomaly Scoring Based Ensemble Learning for Imbalanced Datasets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200428 Termination date: 20210314 |
|
CF01 | Termination of patent right due to non-payment of annual fee |