CN106933805A

CN106933805A - The recognition methods of biological event trigger word in a kind of large data sets

Info

Publication number: CN106933805A
Application number: CN201710148320.5A
Authority: CN
Inventors: 陈飞; 陈一飞; 刘峰; 韩冰青
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-03-14
Filing date: 2017-03-14
Publication date: 2017-07-07
Anticipated expiration: 2037-03-14
Also published as: CN106933805B

Abstract

The present invention relates to the recognition methods technical field of biological event trigger word, the recognition methods of biological event trigger word in specifically related to a kind of large data sets, it is parallel lack sampling method (PUS), including data segmentation, boundary factor calculating, sample sub- sampling, boundary set merger and last shearing procedure, can be used for processing between classification and there is the big training dataset of significant distribution bias, belong to the sample instance of most classifications by parallel reduction to achieve the goal.Selection of the method to data is the calculating based on boundary factor, importance of the entrained information that it weighs each sample instance for classification.The recognition methods of biological event trigger word in the large data sets that above-mentioned technical proposal is provided, can simultaneously solve data volume greatly and sample distribution imbalance problem between classification, to reach the recognition effect of more preferable biological event trigger word.

Description

The recognition methods of biological event trigger word in a kind of large data sets

Technical field

The present invention relates to the recognition methods technical field of biological event trigger word, and in particular to biological in a kind of large data sets The recognition methods of event trigger word.

Background technology

With the raising of information technology and becoming increasingly popular for internet, biomedical electronic literature is used as scientific research Product, the trend being exponentially increased, these online document resources contain the preciousness biology that substantial amounts of systems biology research is badly in need of Medical knowledge.In face of the continuous surge of magnanimity biomedicine text message, Text Mining Technology is just hidden in document as extraction In important knowledge technology, be widely applied in biomedical sector.

Biological event extract refer in massive medical Research Literature the biomolecule such as automatic detection gene and protein it Between interactive relation description process, so as to extract the structured message of pre-defined event type.In this process, if Biological event trigger word can be exactly identified, the performance of event extraction will be greatly improved.Event trigger word identification is biological First step during event extraction, the trigger word that it is recognized is the basis of event argument recognition, is the core of whole event The heart.In trigger word identification, also need to recognize the classification of trigger word, the classification of trigger word i.e. the classification of whole event, if Trigger word identification is wrong, and follow-up work is lost meaning, therefore it is to carry out biomedical event extraction to carry out trigger word identification Key.Wherein, based on SVMs (SVM) and the method represented based on feature-rich be event trigger word identification in it is most normal , the best ML models of result.However, in actual event triggering identification application, the complexity on data has two Key issue.First, the disequilibrium that data are distributed between classification.Secondly, the big data of training dataset.For big data There is very big limitation and cause the performance to reduce in collection, many sorting algorithms.For example, the training complexity of SVM is highly dependent on number According to the size of collection, time consumption for training is more on large data sets.Therefore, the highly unbalanced feature of large data sets and data distribution is The identification of event trigger word brings very huge challenge of knowing clearly.

In face of large data sets, Undersampling technique is most efficient method, and it is by removing the sample in some most classifications Example builds equilibrium criterion collection, and do so can reduce computational complexity.Therefore, Undersampling technique is still under big data Effectively.Therefore, many more efficient lack sampling methods are suggested.Wherein, the lack sampling method based on cluster, it is intended to pass through The cluster for calculating data set solves unbalanced data distribution problem.In this kind of method, training data is divided into several clusters, Ran Hougen Representative sample instance is selected from the cluster of most classifications according to ratio, with a small number of classification example sets into the data for balancing Collection.Unbalanced data problem can be efficiently solved by using lack sampling method and integrated study based on cluster.In addition, a kind of New reverse random sub- sampling method (IRUS), by the random bulk sampling to most category datasets, the structure between classification Build compound decision border.Although however, these methods can alleviate to a certain extent unbalanced data study problem, but still Needs take a substantial amount of time iteratively to cluster or find the border of nearest neighbours.Therefore, in face of large data sets, these sides Method is simultaneously non-real efficient.

For large data sets, in order to overcome SVM to train the bottleneck of complexity, various methods are also suggested, for example, order Big QP PROBLEM DECOMPOSITIONs are a series of minimum possible QP problems by minimum optimization (SMO), it is allowed to the big training set of SMO treatment.Separately have The data set clustered using minimum closure ball (MEB) divides training data by MEB methods, and the center of cluster is used for svm classifier. However, classification of these methods to unbalanced data is not helped.

Solve that the simultaneous data volume in the classification problem is big while existing method all fails fine and classification between Sample distribution imbalance problem, this is the important step for solving the identification of biological event trigger word.

The content of the invention

The technical problem to be solved in the present invention is to provide a kind of recognition methods of biological event trigger word in large data sets, energy Solve that data volume is big simultaneously and classification between sample distribution imbalance problem, it can solve the sample imbalance point under large data sets Class problem, can reach the recognition effect of more preferable biological event trigger word.

In order to solve the above technical problems, present invention employs following technical scheme：Biological event is touched in a kind of large data sets The recognition methods of word is sent out, is parallel lack sampling method (Parallel Under-Sampling, PUS), comprised the following steps：

Step 1, data segmentation, define data set D={ (x₁,y₁),...,(x_n,y_n) it is training dataset, wherein x_i∈ R^mIt is sample instance, and y_i∈ { 0,1 ..., l } is the sample instance generic, and has 1+l class label；Define D_α It is most category datasets, wherein including n₀The individual individual sample instance for belonging to classification y=0, makes α=n₀；By most categorical datas Collection D_αRandom division is K mutually disjoint most classification Sub Data SetsK=1,2 ..., K, uses α_kRepresent each many several classes of Small pin for the case data setThe number of middle sample instance, therefore haveDefine D_βIt is a small number of category datasets, i.e. D_β={ ∪ D_j, j=1,2 ..., l, wherein, β represents the number of sample in all a small number of category datasets, hasThus α is obtained ＞＞ β；

Step 2, boundary factor are calculated, and define each data set S^kContain from corresponding majority classification Sub Data SetWith A small number of category dataset D_βSample instance, be expressed asK=1,2 ..., K；

By after characteristic extraction step, S^kBy m dimensional feature F={ f_t, t=1,2 ..., m is represented, each sample boundary because Son is obtained based on its indeterminacy of calculation for belonging to all categories, uncertain main by set S^kIn each sample Example x to given classification C_jApart from d (x, C_j) come what is determined, the calculating of distance is defined as follows：

dis(x_ft,C_ft) be used for calculating in t dimensional feature spaces, sample instance x to given classification C_jDistance component, Because biological trigger word identification data collection is text, therefore, dis (x_ft,C_ft) it is defined as text vector to classification C_jBarycenter away from From the barycenter is word frequency TF (f_t|C_j) average：

In formula, in d (x, C_j) on the basis of, each sample instance x is for classification C_jDegree of membership μ_jX () is defined as follows：

Boundary factor BoundF (x) of sample instance x is defined as follows：

Step 3, sample sub- sampling, BoundF (x) values that will be calculated are ranked up, and will possess the α ' of maximum_kIt is individual The sample of BoundF (x) values is extracted one boundary set of composition as boundary sample exampleSampling number α '_k=p × β, p As a parameter to be regulated of PUS algorithms；

Step 4, boundary set merger, by all boundary sets produced by parallel lack sampling in step 2 and 3, after merging Obtain a new most category dataset D'_α, and all of a small number of classifications gather together, and obtain new training data set D'=D'_α∪D_β；

Step 5, pruning, repeat lack sampling step 2 and step 3 obtain final training dataset to training data set D' D ", makes training dataset D " include α " individual maximum BoundF (x) value sample, reach most classification sample numbers and minority class very This number is balanced, i.e. α "=β.

The recognition methods of biological event trigger word in the large data sets provided in above-mentioned technical proposal, mainly in biology In event recognition task, data set is distributed unbalanced problem greatly and between sample class, it is proposed that a kind of parallel sub- sampling method (PUS) PUS-SVM trigger word identifying systems, and with reference to SVM classifier are constructed, trigger word recognition performance and effect is effectively improved Rate.Parallel sub- sampling method (PUS) uses the method for sampling based on classification border to reduce the disequilibrium of data and available Parallel distributed computing realizes sub- sampling, effectively reduces the computational complexity of large data sets.

Brief description of the drawings

Fig. 1 is the workflow diagram of the recognition methods of biological event trigger word in large data sets of the present invention；

Fig. 2 is based on same data set Data_BioNLP09The time loss of sub- sampling method in biological trigger word identifying system Compare figure；

Fig. 3 is based on same data set Data_BioNLP11The time loss of sub- sampling method in biological trigger word identifying system Compare figure.

Specific embodiment

In order that objects and advantages of the present invention become more apparent, the present invention is carried out specifically with reference to embodiments It is bright.It should be appreciated that following word is only used to describe one or more specific embodiments of the invention, not to the present invention The protection domain of specific request carries out considered critical.

The technical scheme that the present invention takes is as shown in figure 1, method proposed by the invention can be used between treatment classification It is a lack sampling method in the presence of the big training dataset of significant distribution bias, the sample of most classifications is belonged to by reducing Example to adjust classification between data distribution.Selection of the method to data be based on boundary factor (Boundary Factor, BoundF calculating), importance of the information for classification entrained by its each sample instance weighed.After lack sampling, Sample instance in class center will be abandoned, and the sample instance for being in classification border is then retained.In subsequent bio thing In the identification of part trigger word, SVM will act as grader, and its calculating to Optimal Separating Hyperplane depends on the sample of these classification boundaries Example, i.e. supporting vector, thus parallel sub- sampling method (PUS) remain it is that most probable help SVM is classified, comprising most The sample instance of many classification informations.

The recognition methods of biological event trigger word, is parallel lack sampling method (PUS) in a kind of large data sets of the present invention, bag Include following steps：

Step 1, data segmentation, define data set D={ (x₁,y₁),...,(x_n,y_n) it is training dataset, wherein x_i∈ R^mIt is sample instance, and y_i∈ { 0,1 ..., l } is the sample instance generic, and has 1+l class label；Define D_α It is most category datasets, wherein including n₀The individual individual sample instance for belonging to classification y=0, makes α=n₀；By most categorical datas Collection D_αRandom division is K mutually disjoint most classification Sub Data SetsK=1,2 ..., K, uses α_kRepresent each many several classes of Small pin for the case data setThe number of middle sample instance, therefore haveD_jIt is one of minority category dataset, its bag Containing n_jBelong to classification y=j, j=1 ..., l (i.e. a certain class event trigger word) individual sample instance.There is significant distribution in data set D Deviation, therefore n₀＞＞ n_j.Make β represent the number of sample in all a small number of category datasets, then haveAnd D_β=∪ D_j, thus j=1,2 ..., l obtain α ＞＞ β.Each in the step 1It is respectively provided with identical scale.

For calculating in t dimensional feature spaces, sample instance x to given classification C_jDistance component, be Reduction amount of calculation, by d (x, C_j) set up in classification C_jBarycenter on, rather than C_jWhole samples on, due to biological trigger word Identification data collection is text, therefore,It is defined as text vector to classification C_jThe distance of barycenter, the barycenter is word Frequency TF (f_t|C_j) average：

In formula (2), classification C_jWord frequency TF (f_t|C_j) to C_jIn sample number n_jAverage, as classification C_j's Barycenter.In formula, in d (x, C_j) on the basis of, each sample instance x is for classification C_jDegree of membership μ_jX () is defined as follows：

From formula (3), we can obtain, and the distance of x to barycenter is smaller, then x is to C_jDegree of membership it is bigger；Otherwise x is to matter The distance of the heart is bigger, then x is to C_jDegree of membership it is smaller.Boundary factor BoundF (x) of sample instance x is defined as follows：

Boundary factor BoundF (x) is multiplied by two parts and obtained.Part I is to state sample instance to belong to each The probabilistic entropy of classification.Sample instance is closer to the border of classification, and the entropy of its subjection degree is bigger.Part II is sample reality Average distance of the example to all categories.If sample instance gets over the inside in classification, its average distance is smaller.Conversely, such as Fruit sample instance is closer to the border of classification, and its average distance is bigger.Therefore from two-part value, we can see that locating It is bigger in boundary factor BoundF (x) value of the sample instance on classification border, relative to the sample inside classification, its carrying More classification informations.

Step 3, sample sub- sampling, BoundF (x) values that will be calculated are ranked up, and will possess the α ' of maximum_kIt is individual The sample of BoundF (x) values is extracted one boundary set of composition as boundary sample exampleSampling number α '_k=p × β, p As a parameter to be regulated of PUS algorithms；In the step 3, if data set includes noise data, to boundary sample Noise data is deleted before example sampling.

In order to the recognition methods of biological event trigger word in the large data sets of present invention offer is compared with other method Compared with being respectively Data used here as two corpus_BioNLP09And Data_BioNLP11, it is shared in BioNLP ＇ 09 and BioNLP ＇ 11 In task, trigger word identification is an intermediate steps of biomedical Event Distillation task, so without knowledge in test set Other result, therefore, we use checking data set as test data set.

The Data of table 1_BioNLP09In triggering and non-toggle word type and quantity list

Table 1 gives Data_BioNLP09The detailed statistical analysis of middle training dataset and checking data set.Know as trigger Other task, there is 9 kinds of trigger word types (the biomedical event of 9 kinds of correspondence) here, and they can be divided three classes：Simple event is triggered Word, binds event trigger word and complicated event trigger word.Along with this negative classification of non-toggle word, to be processed altogether is 10 classes Classification problem.Be can see from the data in form, it is the unbalanced data of significant class, because only that about 3.74% training data belongs to one of triggering part of speech, and remaining belongs to non-toggle part of speech.

The Data of table 2_BioNLP11In triggering and non-toggle word type and quantity list

Table 2 provides the Data that knows clearly_BioNLP11The detailed statistical analysis of middle training dataset and checking data set.

Table 3 is the present invention compare with other existing systematic functions

Biological trigger word identifying system	P	R	F1
				PUS-SVM	69.7	69.9	68.3
SystemCRF	65.0	30.2	41.2
				SystemSVM	70.2	52.6	60.1
Turku	70.5	60.6	65.2
				TrigNER	69.3	57.3	62.7

Based on same data set Data_BioNLP09, the performance with other existing identifying systems relatively more of the invention.Table 3 is arranged in detail The present invention and other 4 the 3 of identifying system metrics are gone out：(F values, it is recall ratio for P (precision ratio), R (recall ratio) and F1 With the weighted geometric mean of precision ratio).The result of table 3 shows, the biological trigger word identifying system PUS- proposed in the present invention SVM can reach best overall performance,

And there is significant difference with other systems.Additionally, our improvement to systematic function are based on to recall rate Improvement, it means that there are more possible trigger words to be identified, this can further improve next stage event recognition Performance.

Table 4 is based on same data set Data_BioNLP09The Performance comparision of sub- sampling method in biological trigger word identifying system

Table 5 is based on same data set Data_BioNLP11The Performance comparision of sub- sampling method in biological trigger word identifying system

Table 4 and table 5 analyze the identifying system based on SVM, the performance of different sub- sampling methods, include and carry in the present invention The PUS-SVM of the parallel sub- sampling method for going out, the SVM without sampling, RUS-SVM, k closest sample sub- sampling of random sampling KUS-SVM.Comparative test is in data set Data_BioNLP09And Data_BioNLP11It is upper to carry out respectively, under two datasets, this hair The parallel sub- sampling system of bright middle proposition can obtain optimal overall F1 performances.

In order to further analyze efficiency of the present invention in large data sets, with 8 processing core@2.67GHz and RAM Machine on contrast the run time consumption of any of the above method, include the parallel sub- sampling method that proposes in the present invention PUS-SVM, the SVM without sampling, RUS-SVM, k kUS-SVM of closest sample sub- sampling of random sampling.Comparative test exists Data set Data_BioNLP09And Data_BioNLP11Upper to carry out respectively, concrete outcome shows in figs. 2 and 3.From Fig. 2 and Fig. 3 I As can be seen that the recognition methods of biological event trigger word can be in an acceptable time in large data sets proposed by the present invention It is interior, obtain best biological trigger word recognition result.

Embodiments of the present invention are explained in detail above in conjunction with accompanying drawing, but the present invention is not limited to above-mentioned implementation Mode, for those skilled in the art, after content described in the present invention is known, is not departing from the present invention On the premise of principle, some equal conversion and replacement can also be made to it, these convert and to substitute also should be regarded as and belong on an equal basis Protection scope of the present invention.

Claims

1. in a kind of large data sets biological event trigger word recognition methods, be parallel lack sampling method, it is characterised in that including Following steps：

Step 1, data segmentation, define data set D={ (x₁,y₁),...,(x_n,y_n) it is training dataset, wherein x_i∈R^mFor Sample instance, and y_i∈ { 0,1 ..., l } is the generic of sample instance, has 1+l class label；Define D_αIt is majority Category dataset, wherein including n₀The individual individual sample instance for belonging to classification y=0, makes α=n₀；By most category dataset D_αWith Machine is divided into K mutually disjoint most classification Sub Data SetsUse α_kRepresent each most classification subdata CollectionThe number of middle sample instance；Define D_βIt is a small number of category datasets, i.e. D_β={ ∪ D_j, j=1,2 ..., l, wherein, β The number of sample in all a small number of category datasets is represented, is hadThus α ＞＞ β are obtained；

Step 2, boundary factor are calculated, and define each data set S^kContain from corresponding majority classification Sub Data SetAnd minority Category dataset D_βSample instance, be expressed as

By after characteristic extraction step, S^kBy m dimensional feature F={ f_t, t=1,2 ..., m is represented, each sample boundary factor is Obtained based on its indeterminacy of calculation for belonging to all categories, it is uncertain main by set S^kIn each sample instance X to given classification C_jApart from d (x, C_j) come what is determined, the calculating of distance is defined as follows：

d (x, C_{j}) = \sqrt{Σ_{t = 1}^{m} {dis}^{2} (x_{f_{t}}, C_{f_{t}})} - - - (1)

For calculating in t dimensional feature spaces, sample instance x to given classification C_jDistance component, due to life Thing trigger word identification data collection is text, therefore,It is defined as text vector to classification C_jThe distance of barycenter, it is described Barycenter is word frequency TF (f_t|C_j) average：

d i s (x_{f_{t}}, C_{f_{t}}) = | x_{f_{t}} - \frac{T F (f_{t} | C_{j})}{n_{j}} | - - - (2)

And

Boundary factor BoundF (x) of sample instance x is defined as follows：

B o u n d F (x) = (- Σ_{j = 0}^{l} μ_{j} (x) l o g (μ_{j} (x))) \times (Σ_{j = 0}^{l} μ_{j} (x) d (x, C_{j})) - - - (4)

Step 3, sample sub- sampling, BoundF (x) values that will be calculated are ranked up, and will possess the α ' of maximum_kIndividual BoundF (x) The sample of value is extracted one boundary set of composition as boundary sample exampleSampling number α '_k=p × β, p are calculated as PUS One parameter to be regulated of method；

Step 4, boundary set merger, all boundary sets produced by parallel lack sampling in step 2 and 3 are obtained after merging One new most category dataset D'_α, and all of a small number of classifications gather together, and obtain new training data set D'= D'_α∪D_β；

Step 5, pruning, repeat lack sampling step 2 and step 3 obtain final training dataset D to training data set D' ", Make training dataset D " comprising α " sample of individual maximum BoundF (x) value, reach most classification sample numbers and minority class very this Number balance, i.e. α "=β.

2. in large data sets according to claim 1 biological event trigger word recognition methods, it is characterised in that：The step Each in rapid 1It is respectively provided with identical scale.

3. in large data sets according to claim 1 biological event trigger word recognition methods, it is characterised in that：Described In step 3, if data set includes noise data, noise data is deleted before being sampled to border sample instance.