CN107292348A

CN107292348A - A kind of Bagging_BSJ short text classification methods

Info

Publication number: CN107292348A
Application number: CN201710554325.8A
Authority: CN
Inventors: 赵德新; 张德干; 常智; 杜娜娜
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2017-07-10
Filing date: 2017-07-10
Publication date: 2017-10-24

Abstract

A kind of Bagging_BSJ short text classification methods.The features such as short text has high openness, real-time, lack of standard, when existing traditional text sorting algorithm is applied to short text classification, is influenceed larger, it is difficult to obtain preferable effect by singular data.The features such as the inventive method is directed to the high openness of short text, real-time, it is proposed that the short text classification method based on Integrated.This method is used into Bagging Integrated Algorithm thoughts, semantic feature extension is carried out to short text, and combines bayesian algorithm, algorithm of support vector machine and J48 algorithms, the short text after being extended to semantic feature is classified, and has obtained more preferable classifying quality.Set forth herein Bagging_BSJ methods its accuracys rate improve 12%, recall rate improves 28%, F values and improves 20%.

Description

A kind of Bagging_BSJ short text classification methods

Technical field

The invention belongs to the technical field that computer application is combined with natural language processing.

Background technology

Short text sorting technique refers to the technology classified to number of words in 160 words or so, the text with sparse characteristic. Under normal circumstances, the features such as short text information has real-time, language conciseness, many noises.For openness extremely strong short essay This, using traditional text sorting technique, by calculate the quantity of public lexical item between the frequency and document that lexical item occurs in document come Similitude between document is judged, its accuracy rate is not high.Therefore, for more than the exclusive real-time, language conciseness of short text, noise Feature, improves the accuracy rate of sorting algorithm, and recall rate has important application.

At present, two major classes can be divided into for the more commonly used sorting algorithm of short text：One class is changed based on certain rule The assorting process entered；Another kind of is the content based on external semantic information expansion short text, and then improves the classification effect of short text Really.

Rule-based improved method is mainly handled short text data collection, by means of feature extraction, text table Show, the innovation improved method that multiple links propose such as grader is built.But, in short text classification, due to Sparse, base The problem of semantic gap being commonly encountered when the grader of local feature is expressing short text, it is impossible to effectively distinguish different short essays Semantic information in this.

The sorting algorithm for expanding short text based on semantic information is mainly known by means of text language ambience information or external semantic Know storehouse, utilize the presentation content of the abundant short text of certain rule.This algorithm alleviates Sparse and brought to a certain extent Influence, but when amount of training data increases, the raising that assistance data is brought gradually weakens, and classifying quality can decline.For The feature of short text is openness, and the present invention has carried out wikipedia as external semantic knowledge base the extension of short essay eigen.

A large amount of concepts constantly increased are there are in wikipedia, this expands for the content of short text provides very Effective platform.Semantic Similarity Measurement is a kind of based on wikipedia text and the semantic relation of link structure information quantization mould Type, the model chooses the higher feature of similarity by calculating the semantic similarity between alternative extension feature and theme feature As extension feature, said process is referred to as semantic extension.

The main process of Wiki extension short essay eigen is as follows：(1) after given short text data is pre-processed, obtain To corresponding lexical item vector；(2) each feature lexical item (being referred to as theme feature lexical item) in vector is mapped to Wiki hundred In the theme page corresponding to section, the text message of summary section in the theme page is obtained, and the text message of acquisition is divided Word, the pretreatment of denoising, to obtain the Wiki extension feature vector of each theme feature lexical item；(3) WLA (Wikipedia are passed through Links andAbstract) algorithm carries out the given lexical item of semantic relation quantum chemical method, i.e. quantitative description and it alternatively extends lexical item Between semantic association degree.Due to the correlation degree between the alternative features extension lexical item and theme lexical item in extension vocabulary not Together, then the ability that they are supplemented body feature semantic information is just differed.Thus the given feature lexical item of quantitative description with Semantic association degree between the alternative extension lexical item that 1.1st step is obtained；(4) all theme features of the short text are extended into lexical item Combination, statistics, obtained vector are the characteristic vector after the short text is extended based on wikipedia text message.

In handling short text data collection, classical textual classification model have naive Bayesian NB ( Bayesian), support vector machines (Support Vector Machine) and decision tree (J48) algorithm.Naive Bayesian NB Model rises in classical mathematics theory, there is solid Fundamentals of Mathematics, and stable classification effectiveness.Meanwhile, needed for NB models Seldom, less sensitive to missing data, algorithm is also fairly simple for the parameter of estimation.It is separate between NB hypothesis attributes, this Assuming that being often in actual applications invalid, this brings certain influence to the correct classification of NB models.In attribute number When correlation is larger more than comparing or between attribute, the classification effectiveness of NB models is less than decision-tree model.And it is related in attribute When property is smaller, the performance of NB models is the best.Algorithm of support vector machine SVM, which is one, the learning model of supervision, generally uses To carry out pattern-recognition, classification and regression analysis.Decision tree (J48) algorithm, is a kind of method for approaching discrete function value. It is a kind of typical sorting technique, and data are handled first, using the readable rule of inductive algorithm generation and decision tree, Then new data is analyzed using decision-making.Substantially decision tree is the mistake classified by series of rules to data Journey.There are many defects in the algorithm that the studies above is proposed, poor to short text treatment effect, and such as short text carries out Wiki extension The characteristic vector obtained afterwards, dimension disaster problem may be caused when being classified.Single grader, which can not be obtained, preferably to be divided The lexical item independence of class effect, such as NB Algorithm is poor, and J48 sorting algorithms are influenceed larger by singular data.We use The algorithm of integrated study solves the above problems.

The basic ideas of Ensemble Learning Algorithms are：It is when classifying to new example information, several are independent The combining classifiers of training get up, and the classification results of these single graders are combined with certain weights, are used as final collection Constituent class result.Shown by related data, the performance of integrated classifier is more preferable than the classifying quality of single grader.

At present, integrated study sorting algorithm is broadly divided into two classes：A kind of is the parallel generation using Bagging algorithms as representative Algorithm, it requires that the dependence between component classifier is relatively weak；Another is the string using Boosting algorithms as representative There is stronger dependence between row generating algorithm, this algorithm individual.Boosting algorithms existed in actual applications (over fitting) problem of fitting, causes its classifying quality to be weaker than the classifying quality of single grader.The present invention is used The thought of Bagging algorithms, i.e.,：A training set and one group of Weak Classifier are given, there is the M sample of extraction put back to training set A training subset is constituted, n times is extracted and obtains N number of training subset.N number of grader is trained by this N number of training subset, you can obtain N number of anticipation function sequence, then with being predicted to sample set and obtain last predict the outcome by most voting mechanisms.

Recall rate (Recall Rate, be also recall ratio) is all phases in the relevant documentation number and document library retrieved Close number of files ratio, measurement be searching system recall ratio；Accuracy rate is the relevant documentation number retrieved and retrieved The ratio of total number of documents, measurement be searching system precision ratio.It is the general expression that is：Precision ratio=(the relevant information retrieved Amount/retrieved message total amount) * 100%.Recall rate (Recall) and accuracy rate (Precise) are widely used in information retrieval With two metrics in Statistical Classification field, for the quality of evaluation result.F values are precision ratio (accuracy rate) and recall ratio The synthesis of (recall rate) two indexes, is a kind of overall target, in terms of classification of assessment effect is good and bad, than accuracy rate is used alone Or recall rate is more convincing.We use F1 values (i.e. F herein_βMiddle β=1) as comprehensive evaluation index.

The content of the invention

The present invention seeks to solve the problem of short text classification accuracy is low, there is provided a kind of Bagging_BSJ short essays one's duty Class method, to improve the accuracy rate of short text classification, recall rate and F values.

The present invention is high openness for short text, the features such as real-time and lack of standard, by the use of wikipedia as knowing Know storehouse, propose to carry out WLA semantic extensions, in bayesian algorithm, the basis of algorithm of support vector machine and J48 algorithms to short text On, with reference to Bagging Integrated Algorithm thoughts, it is proposed that the integrated short text sorting algorithms of Bagging_BSJ.This method is applied to In short text classification, semantic feature extension is carried out to short text, the short text after being extended to semantic feature utilizes Bagging_ BSJ algorithms are classified, and have obtained classification accuracy more more preferable than conventional method, recall rate and F values.

Technical scheme

A kind of Bagging_BSJ short text classification methods, this method mainly includes following committed step：

1st, the WLA short texts semantic feature extension based on Wiki feature；

1.1st, feature extraction.For giving feature lexical item, this feature lexical item is reflected by disambiguation, redirecting technique It is mapped in the corresponding wikipedia page, extracts page text information, and participle is carried out to these text messages, except stop words etc. Denoising, obtains the alternative extension that the element in one group of characteristic vector being made up of lexical item, this feature vector is characterized lexical item Lexical item；

1.2nd, semantic relation quantifies.By it is proposed that WLA (WikipediaLinksandAbstract) algorithm enter Row semantic relation is calculated, and quantitative description gives the semantic association between the alternative extension lexical item that feature lexical item and the 1.1st step are obtained Degree；

1.3rd, feature expanded set is built.Extracted by correlated characteristic, lexical item semantic relation quantify after, be each Given feature lexical item builds individual features expansion word item vector C_t{(c₁,r₁),(c₂,r₂),…,(c_k,r_k), wherein c_i, i= 1,2 ..., k, are the alternative extension lexical item related to theme feature lexical item t, r_i, i=1,2 ..., k, represent c_iLanguage between t Adopted similarity, using these lexical items vector as following short text classify when sample.

2nd, the Bagging_BSJ short text sorting algorithms based on Integrated；

2.1st, training set S={ (x₁,y₁),(x₂,y₂),…,(x_m,y_n) in contain m article n kind classifications, wherein x_i For training sample, y_jFor x_iCorresponding class label；

2.2nd, using there is the sampling techniques put back to extract Z respectively from training set S₁Individual, Z₂Individual, Z₃Individual training sample Containing g sample in collection, each subset；

2.3rd, respectively with the Bagging graders that naive Bayesian is base grader to preceding Z₁Individual subset is trained, The model trained is designated asSimilar, middle Z₂Individual subset and last Z₃Individual subset is respectively with support Vector machine and J48 are that base grader is trained, and obtained disaggregated model is designated as respectivelyWithZ is obtained with this method training₁+Z₂+Z₃Individual grader；

2.4th, assorting process is the disaggregated model H for obtaining 2.3 training_i, i=1,2 ..., Z₁+Z₂+Z₃, act on Sample (i.e. new samples data) to be sorted, and integrated processing is carried out to classification results by means of Voting Algorithm, so as to judge new Sample class；I.e.：

Wherein, WLA (semantic relevancy) algorithmic formula described in the quantization of the 1.2nd step semantic relation derives as follows：

First, it is contemplated that two lexical items corresponding wikipedia theme page abstract section calculates the semantic phase of two entries Guan Du, formula is as follows：

Wherein, a, b are two candidate topics, N₁, N₂It is group of words T respectively₁, T₂Word quantity, q is two group of words Public word number, MAX (N₁,N₂)/MIN(N₁,N₂) it is mediation parameter, T₁W_iIt is group of words T₁In i-th of public word power Weight, wherein tf_iIt is the frequency that i-th of word occurs in a document, calculation formula is as follows：

Wherein, V represents T₁、T₂In public word frequency sum.

Secondly, it is considered to entered using the chain in the wikipedia theme page pointed by lexical item and go out information with chain to calculate semanteme The degree of correlation, wherein, it is as follows that the chain that David Milne are proposed enters computational methods：

Because the wikipedia theme page also has chain to go out structure, it is also contemplated for wherein so chain is gone out into structure, finally using chain The formula for connecing Structure Calculation semantic relevancy is as follows

Sim_l(a, b)=β Sim_out(a,b)+(1-β)Sim_in(a,b)

A, b are two candidate topics, and A, B is that the chain of the corresponding theme page enters quantity, and W is wikipedia theme page number Amount, sim_out(a, b) is the semantic relevancy for going out calculating by theme page chain, its computational methods and sim_in(a, b) is identical, comprehensive Belonging to upper, show that WLA calculation formula is as follows：

WLA_sim(a,b)=α Sim_α(a,b)+(1-α)Sim_l(a,b)

In the present invention, α, β represent the text message of lexical item correspondence wikipedia and the weights of link structure respectively, respectively Take α=0.7, β=0.3 i.e.,

Sim=0.7*Sim_α(a,b)+0.3*Sim_l(a,b)

Wherein Sim_l=0.7*Sim_in+0.3*Sim_out。

Feature expanded set construction method process is as follows described in 1.3rd step：

It is that each given theme is special as shown in figure 1, being extracted by correlated characteristic, after semantic relation quantifies between lexical item Levy lexical item and build corresponding feature expansion word item vector C_t{(c₁,r₁),(c₂,r₂),…,(c_k,r_k), wherein c_i(i=1,2 ..., K) it is the alternative extension lexical item related to theme feature lexical item t, r_i(i=1,2 ..., k) represent c_iIt is semantic similar between t Degree.The problem of considering the alternative extension lexical item frequency of occurrences, the present invention will alternatively extend lexical item and theme feature using equation below Semantic similarity and its frequency of occurrences between lexical item are integrated.

r_iThe semantic similarity between alternative extension lexical item and theme feature lexical item t is represented, k represents theme feature lexical item t The number of element, N in corresponding alternative expansion word item vector_iRepresent t_iThe frequency of appearance.Wherein C_tThe order of middle element according to r_iBig minispread.

Bagging_BSJ arthmetic statements are as follows：

According to algorithm above, Bagging_BSJ algorithm flow charts are as shown in Figure 2.The part presentation class device of solid line connection Training process, dotted line connection part represent test process.When training grader, the Sampling techniques put back to first by having Extract Z₁+Z₂+Z₃Individual training sample subset, then the Bagging graders using naive Bayesian as base grader are to preceding Z₁Height Collection is trained, and the model trained is designated asSimilar, middle Z₂Individual subset and last Z₃Individual subset It is trained respectively using SVMs and J48 as base grader, obtained disaggregated model is designated as respectively WithZ is obtained with this method training₁+Z₂+Z₃Individual grader.

Advantages and positive effects of the present invention

The present invention is applied in short text classification, and WLA semantic extensions are carried out to short text, carries out correlated characteristic extraction, right Semantic relation quantification treatment, construction feature expanded set, and based on Bagging Integrated Algorithm thoughts, calculated with reference to naive Bayesian Method, algorithm of support vector machine and J48 algorithms, overcome the defect of three kinds of algorithms, propose Bagging_BSJ algorithms.Can be more preferable Feature extension and classification are carried out to short text.Theoretical and experiment shows that this method is being permitted than traditional NB Algorithm etc. Many-side has more preferable efficiency, such as accuracy rate, recall rate and F values etc..

Bagging_BSJ methods proposed by the present invention can be applicable to the various aspects of short text classification, and such as QQ message is micro- Letter, short message, microblogging etc..The present invention can effectively make up the defects such as sparse, the semantic scarcity of short essay eigen, and for the analysis of public opinion, The fields such as social instant message processing provide reference means, clear with algorithm steps, excellent to short text classification effectiveness height etc. Point, thus with very strong actual application value.

Brief description of the drawings

Fig. 1 is Wiki extension short text feature lexical item table model figure.

Fig. 2 is Bagging_BSJ algorithm flow charts of the present invention.

Fig. 3 is the corresponding wikipedia theme page figure of lexical item.

Fig. 4 is that a variety of sorting algorithms carry out classification time cost figure to identical data.

Fig. 5 is classification accuracy figure of a variety of sorting algorithms on different pieces of information collection.

Fig. 6 is classification recall rate figure of a variety of sorting algorithms on different pieces of information collection.

Fig. 7 is classification F value figure of a variety of sorting algorithms on different pieces of information collection.

Fig. 8 is classification time loss figure of a variety of sorting algorithms on different pieces of information collection.

Embodiment

Embodiment one, short text WLA semantic extensions and Bagging_BSJ classification

The WLA semantic features extension based on Wiki is carried out to short text, and is classified with Bagging_BSJ algorithms, is had Body step is as follows:

1st, WLA semantic extensions are carried out to short text

(1) lexical item book is given, the theme page corresponding to it is found, as shown in Figure 3.Utilize Lucence participle instruments After being pre-processed, obtain one group and alternatively extend lexical item { write, printing, illustration, sheet, text, e- Book, page, paper, ink, parchment, material, book, leaf }, as lexical item book is based on wikipedia Simple feature extends lexical item table.

(2) calculated using the Arithmetic of Semantic Similarity WLA based on wikipedia proposed, utilize equation below：

WLA_sim(a,b)=α Sim_a(a,b)+(1-α)Sim_l(a,b)

α=0.7, β=0.3 is taken respectively, is obtained

Sim=0.3*Sim_l+0.7*Sim_t

Wherein Sim_l=0.7*Sim_in+0.3*Sim_out

Calculate semantic related between theme feature lexical item book and each alternative extension feature lexical item in alternative extension lexical item Property, obtain result as follows：

(write, 0.74), (printing, 0.73), (illustration, 0.78), (sheet, 0.79), (text, 0.88), (book, 1), (e-book, 0.828), (page, 0.876), (paper, 0.86) }.

(3) feature extension lexical item table is built

Five feature lexical items of top ranked are selected as lexical item book feature extension lexical item, i.e., book, text, Page, paper,

E-book }, (1) and (2) two steps are repeated to this five lexical items, obtaining Wiki spread vector is：{e- Book, information, source, physical, database, document, material, newspaper, digital…}

The lexical item vector after semantic extension, which can finally be obtained, is：

{ (information, 0.82), (database, 0.798), (book, 0.796) ... }

2nd, Bagging_BSJ classification is carried out to the lexical item after WLA semantic extensions

Using Weka digging tools, with Bagging_BSJ sorting algorithms model proposed by the present invention to word obtained above Item is classified, wherein, take Z₁=Z₂=Z₃=15, g=1000.

From Fig. 4, it could be assumed that：Bagging_BSJ algorithms proposed by the present invention are slightly more than SVM and NB required times, But much smaller than J48 algorithms.Be due to J48 disaggregated models when classifying to each group of experimental data, be required for instructing again Practice model.

Different type short text is classified, including following three kinds of data types.Undressed short essay basis data, By Wiki extend after short text and this method propose WLA semantic extensions after short text.Respectively use NB, SVM, J48 and The inventive method Bagging_BSJ algorithms are classified, and are obtained a result as shown in Fig. 5, Fig. 6, Fig. 7 and Fig. 8.

From Fig. 5, it could be assumed that：Classification accuracy of each data set on different graders all shows more consistent Trend, Wiki extension and WLA semantic extensions after data set classification accuracy (94.6%), extended far above without feature Short text classification.And by Wiki extend after short text classification accuracy rate, slightly below by the short text after WLA semantic extensions Classification accuracy.

From Fig. 6, it could be assumed that：Inventive algorithm Bagging_BSJ recall rate (93.3%) behaves oneself best, other The classificatory recall rate highest of short text of the grader after WLA semantic extensions, the classification in the original data set not extended is called together The rate of returning is minimum.Although and classification of the Bagging_BSJ graders on Wiki growth data collection and WLA semantic extension data sets is called together The rate of returning is equal, but far above the recall rate in original data set.

From Fig. 7, it could be assumed that：Consider classification accuracy and recall rate, i.e. F values.Short essay one's duty after extension The short essay basis data classification that analogy does not extend is demonstrated by preferable F values (94.1%).Compared to former data and Wiki spreading number According to the effect of the short text classification after WLA semantic extensions proposed by the present invention is best.

From Fig. 8, it could be assumed that：Spend minimum based on the time that short essay basis data are classified, base proposed by the present invention It is slightly above the classification processing time of former data in the classification processing time of WLA semantic extensions, but less than wikipedia simple extension The time loss of short text classification.

Complex chart 5, Fig. 6, Fig. 7 and Fig. 8, it was therefore concluded that：It is proposed by the present invention to be based on WLA relative to other sorting techniques Short text Bagging_BSJ sorting techniques after semantic feature extension are in accuracy rate, recall rate, shown in the index such as F values Preferable performance.It is effective to solve that accuracy rate when traditional text disaggregated model is classified applied to short text is low, recall rate is low Problem, while also shortening the time cost of short text classification.

Claims

1. a kind of Bagging_BSJ short text classification methods, it is characterised in that this method mainly includes following committed step：

1st, the WLA short texts semantic feature extension based on wikipedia knowledge base；

1.1st, correlated characteristic is extracted, for giving feature lexical item, is mapped to this feature lexical item by disambiguation justice and redirection In the corresponding wikipedia page, extract page text information, and to these text messages carry out denoising, obtain one group by Element in the characteristic vector of lexical item composition, this feature vector is characterized the alternative extension lexical item of lexical item；

1.2nd, semantic relation quantifies, and semantic relation is carried out by WLA (Wikipedia Links and Abstract) algorithm Calculate, quantitative description gives the semantic association degree between the alternative extension lexical item that feature lexical item and the 1.1st step are obtained；

1.3rd, extracted by correlated characteristic, after semantic relation quantifies between lexical item, be each given theme feature lexical item structure Build corresponding feature expansion word item vector C_t{(c₁,r₁),(c₂,r₂),…,(c_k,r_k), wherein c_i, i=1,2 ..., k, be and master Inscribe the related alternative extension lexical items of feature lexical item t, r_i, i=1,2 ..., k, represent c_iSemantic similarity between t, by these Sample when lexical item vector is classified as following short text；

2nd, the Bagging_BSJ short text sorting algorithms based on Integrated；

2.1st, training set S={ (x are assumed₁,y₁),(x₂,y₂),…,(x_m,y_n) in contain m article n kind classifications, wherein x_iFor Training sample, y_jFor x_iCorresponding class label；

2.2nd, using there is the sampling techniques put back to extract Z from training set S respectively₁Individual, Z₂Individual, Z₃Individual training sample subset, often Height is concentrated containing g sample；

2.3rd, the Bagging graders using naive Bayesian as base grader are to preceding Z₁Individual subset is trained, the mould trained Type is designated asSimilar, middle Z₂Individual subset and last Z₃Individual subset is respectively with SVMs and J48 It is trained for base grader, obtained disaggregated model is designated as respectivelyWithWith this Method training obtains Z₁+Z₂+Z₃Individual grader；

2.4th, assorting process is the disaggregated model H for obtaining 2.3 training_i, i=1,2 ..., Z₁+Z₂+Z₃, act on to be sorted Sample, and integrated processing is carried out to classification results by means of Voting Algorithm, so as to judge new samples classification；I.e.：