CN104462406A

CN104462406A - Algorithm for extracting text model features to classify text models

Info

Publication number: CN104462406A
Application number: CN201410765214.8A
Authority: CN
Inventors: 刘江; 李健铨; 李炜
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2014-12-10
Filing date: 2014-12-10
Publication date: 2015-03-25

Abstract

The invention discloses an algorithm for extracting text model features to classify text models. The algorithm includes calculating a first phase weights of text model training data, then calculating the first phase weights to acquire new and old data distribution of characteristics from text model training data, performing recalculating to acquire a second phase weights, finally achieving classification of the text models and acquiring the target characteristics by listing the second phase weights from small to large. By the application of the algorithm, the characteristics extracted from the text models do not excessively tend to old data from the training data and is not simply acquired from little new data from the training data, and thereby preferable classification effects are acquired.

Description

A kind of text model feature of extracting carries out sorting algorithm

Technical field

The present invention relates to and two sorting algorithms are carried out to text model feature, particularly relate to a kind of text model feature of extracting and carry out sorting algorithm.

Background technology

1) text mining

Along with the development of computer technology, network technology, the information come tumbling makes people at a loss as to what to do sometimes, from vast as the open sea information ocean, obtain rapidly and exactly the information oneself needed most, and becomes very difficult.In magnanimity information, many is text messages.So create a kind of new information processing technology---text mining.Text mining is from large amount of text information, and extract implicit, useful knowledge, this process is also referred to as the Knowledge Discovery in text database.It relates to multiple ambits such as database, machine learning, natural language processing, analysis of statistical data.Research contents comprises the problem such as generation, information extraction of text cluster, text classification, text snippet.

2) text classification

Text classification is an important problem in text mining research, and it refers under given taxonomic hierarchies, and a large amount of text is divided into two or more classification.Utilize computing machine to carry out text classification, not only speed is fast, and accuracy rate is relatively high.In actual life, there is a lot of application, such as, Web page has been classified, the page comprising identical content has been classified as a class.The step of text classification mainly comprises and obtains Training document collection, the pre-service of information, feature extraction, text representation, selection sort method and Performance Evaluation six steps.

3) transfer learning

In many practical applications, text message is enormous amount not only, and the content comprised is also fast more among new change, and such as, the content of Web page often changes theme.In traditional classification learning, basic hypothesis is exactly think to obey unified distribution for the data of train classification models and the data of goal task.Because the data of goal task can often change, which results in when the models applying trained in goal task time, model may be out-of-date.If re-start mark to the data of goal task continually, cost dearly, also do not accomplish in time.We can claim the data of goal task to be new data, and that in the past accumulated, a large amount of, classified data can be claimed to be legacy data.How to maximally utilise the classificating knowledge of legacy data, new data is classified, become a urgent problem.Transfer learning becomes the hot issue of Data Mining in recent years, and the key distinction of it and conventional machines learning method does not need Dynamic data exchange with the hypothesis distributed.In transfer learning method, need to take out a small amount of new data, manually mark, as a part for training data.Only be used for training pattern with these data, quantity wretched insufficiency.So supplementing using a large amount of classified legacy data as training data.Legacy data and new data from different field, may have different distributions.

Traditional feature extraction algorithm, does not have to consider situation that is new, legacy data different distributions, does not consider the problem of training data data skew.Because new, legacy data have different distributions, when differing greatly, if with the character representation new data extracted from legacy data time, new data there will be the phenomenon that a lot of feature weight is 0.Because new data in training data is little, if therefrom extract feature separately, the feature extracted can not represent all new datas well.If on the basis of these features, the data of training data and goal task are represented, and carry out text classification, good effect must be obtained.

Summary of the invention

For problems of the prior art, the invention provides a kind of text model feature of extracting and carry out sorting algorithm, this algorithm neither too tends to the legacy data in training data to the feature that text model extracts, also do not obtain from a small amount of new data training data merely, good classifying quality can be obtained.

For solving the technical matters existed in prior art, the present invention adopts following technical scheme:

1, a kind of text model feature of extracting carries out sorting algorithm, comprises the steps:

The first, to its algorithm of weights that the training data of text model adopts information gain algorithm (IG, Information Gain) to obtain feature be:

IG (t) = - Σ_{i = 1}^{m} P (C_{i}) \log P (C_{i}) + P (t) Σ_{i = 1}^{m} P (C_{i} | t) \log P (C_{i} | t) + P (\overset{&OverBar;}{t}) Σ_{i = 1}^{m} P (C_{i} | \overset{&OverBar;}{t}) \log P (C_{i} | \overset{&OverBar;}{t})

Formula (1)

In formula (1), P (C _i) be classification C _icomprise the ratio of textual data and text sum, the ratio that P (t) is the textual data and text sum that comprise feature t, P (C _i| t) for text when there is feature t belongs to C _iprobability, for not containing the textual data of feature t and the ratio of text sum, for when there is not feature t, text belongs to C _iprobability;

The second, weights IG step one obtained sorts, and extracts first stage α * K feature;

3rd, adopt formula (2) and (3) to calculate the distribution situation of feature t new data, legacy data in the training data of text model first stage α * K feature, obtain:

w _same(t,C _same)＝f(t,C _same)*n(t,C _same)/N(C _same)(2)

w _dif(t,C _dif)＝f(t,C _dif)*n(t,C _dif)/N(C _dif)(3)

Wherein, C _sameand C _difrepresent new, the legacy data in training data respectively, f (t, C _same) and f (t, C _dif) number of times that occurs in new, legacy data of representation feature t respectively, n (t, C _same) and n (t, C _dif) represent new respectively, occur the textual data of feature t, N (C in legacy data _same) and N (C _dif) be respectively new, text sum in legacy data, w _same(t, C _same) and w _dif(t, C _dif) the respectively distribution of representation feature t in new, legacy data;

4th, by the distribution of feature t in step 3 in new, legacy data, adopt formula (4) to calculate the final weights of feature t, extract subordinate phase α * K feature:

max{w _same(t,C _same),w _dif(t,C _dif)}/min{w _same(t,C _same),w _dif(t,C _dif)}(4)

5th, circulation step two is to step 4 successively, constantly extracts subordinate phase α * K feature;

6th, obtain subordinate phase subordinate phase α * K feature to step 5 and sort from little arrival by weight, the minimum K of a weight selection feature completes text model classification.

Beneficial aspects of the present invention:

The first, the feature extracting method that the present invention proposes, is applicable to content of text and constantly changes, and adopts the method for transfer learning to carry out the occasion of text classification.Because the cost manually marked is large, in training data, a small amount of new data can only be had, need in training data to retain a large amount of legacy datas.

The second, the feature extracting method that the present invention proposes, take into account the reality of training data data skew, new legacy data different distributions, can represent text data better.

3rd, the feature extracting method that the present invention proposes, by first stage feature extraction, filters out those and comprises the feature that realm information is few, text area calibration is not high.Along with the continuous change of α is large, increasing feature is extracted, and enters subordinate phase and extracts.In subordinate phase feature extraction, concern be the similarity that feature distributes in new data and legacy data, distribute more consistent, the weight calculated is higher, also more forward in feature ordering.But along with α continues to become large, much comprise the feature that realm information is few, text area calibration is not high and have passed first stage extraction, and it is consistent owing to distributing in new legacy data, higher mark is obtained in final sequence, these features are also unfavorable for text representation, by the effect of impact classification, cause the accuracy rate step-down of classifying.Experiment shows, when α gets 2, to obtain the maximal value of classification accuracy rate.

When the method for application migration study solves text classification problem, algorithm of the present invention is applied in Text character extraction link, the feature extracted can be made neither too to be inclined to legacy data, also not obtain from a small amount of new data merely, thus improve the accuracy of text classification.

Attached body explanation

Fig. 1 is that the present invention's one extraction text model feature carries out sorting algorithm process flow diagram.

Embodiment:

Below in conjunction with accompanying drawing, the present invention is described in more detail:

As shown in Fig. 1 101, solve in the process of text two classification problem utilizing transfer learning method, training data is made up of new data and legacy data, legacy data and new data are from different field, different distributions may be had, and there is the problem of data skew in training data: new data is little, and legacy data is a lot.Between the feature of text and training data, one of following four kinds of situations must be met:

1) in legacy data, occurrence probability is comparatively large, and occurrence probability is less in new data;

2) in legacy data, occurrence probability is less, and occurrence probability is larger in new data;

3) in legacy data, occurrence probability is comparatively large, and in new data, occurrence probability is also very large;

4) in legacy data, occurrence probability is less, and in new data, occurrence probability is also very little.

Meet situation 1) feature be more prone to represent legacy data, instead of new data; Meet situation 2) feature probably only effective to these a small amount of new datas; Meet situation 4) feature major part can be filtered in extraction.The feature of these 3 kinds of situations is not suitable as the feature of text.

Comparatively speaking, meet situation 3) feature be only suitable.They can not only represent legacy data, also more likely well represent new data simultaneously.To will meet situation 3 exactly herein) feature extraction out.

In first stage, feature should be extracted, in these training datas for all training datas, both comprise a small amount of new data and also comprise a large amount of legacy data, new data derives from target domain, and legacy data may derive from other field, and these data all have passed through mark.

In the feature that the first stage extracts, illustrate that they are adapted at for representing text in the wider field (both comprised new data field and also comprised legacy data field) of higher level, the realm information comprised is more, higher to the discrimination of text.Extract through the first stage, will comprise realm information less, the characteristic filter not high to text area calibration has fallen.

Extract the stage of feature at first, adopt the feature extracting method that certain is traditional.Such as adopt information gain (IG, Information Gain) method to calculate the weights of feature, the IG value of feature t may be defined as:

IG (t) = - Σ_{i = 1}^{m} P (C_{i}) \log P (C_{i}) + P (t) Σ_{i = 1}^{m} P (C_{i} | t) \log P (C_{i} | t) + P (\overset{&OverBar;}{t}) Σ_{i = 1}^{m} P (C_{i} | \overset{&OverBar;}{t}) \log P (C_{i} | \overset{&OverBar;}{t})

Formula (1)

In formula (1), P (C _i) be classification C _icomprise the ratio of textual data and text sum, the ratio that P (t) is the textual data and text sum that comprise feature t, P (C _i| t) for text when there is feature t belongs to C _iprobability, for not containing the textual data of feature t and the ratio of text sum, for when there is not feature t, text belongs to C _iprobability.

To all candidate feature comprised in training data, calculate its weights (such as IG value) and sort, choosing several features extracted as the first stage above.The feature quantity that first stage extracts should be greater than the quantity finally expecting to obtain, if wish finally to extract K feature, just need to extract α * K feature (α >1, concrete value can adjust according to test situation) in the first stage.

As shown in Fig. 1 102 to 103, at second stage, consider which feature is more suitable for representing new data.

When subordinate phase extracts feature, can not consider separately to occur frequent, intensive feature in those new datas in training data, because probably these features only reflect a very little aspect of new data, simultaneously, all features of first stage can not be directly used in the expression of text, because legacy data quantitatively account for major part in training data, these features are partial to the content representing legacy data.

At second stage, α * K the feature should extracted from the first stage, choose those similar features that distributes in new, legacy data, the distribution said here refers to the density degree that feature occurs in the text.In new, legacy data, distributional class seemingly, illustrates that these features well can not only represent the text of legacy data, also well can represent the text of new data.

As shown in Fig. 1 104, there is such criterion, with regard to needing, a tolerance having been done to the distribution of feature in new, legacy data.Go to judge the significance level of certain feature in new, legacy data by this tolerance, meanwhile, calculated amount of this tolerance is little as much as possible, because in most of the cases, the text feature of un-extracted is up to dimension up to ten thousand.The present invention adopts formula (2) and (3) to calculate the distribution of feature t in new, legacy data.

w _same(t,C _same)＝f(t,C _same)*n(t,C _same)/N(C _same) (2)

w _dif(t,C _dif)＝f(t,C _dif)*n(t,C _dif)/N(C _dif) (3)

Wherein, C _sameand C _difrepresent new, the legacy data in training data respectively, f (t, C _same) and f (t, C _dif) number of times that occurs in new, legacy data of representation feature t respectively, n (t, C _same) and n (t, C _dif) represent new respectively, occur the textual data of feature t, N (C in legacy data _same) and N (C _dif) be respectively new, text sum in legacy data, w _same(t, C _same) and w _dif(t, C _dif) the respectively distribution of representation feature t in new, legacy data.

In formula (2) and (3), w _same(t, C _same) or w _dif(t, C _dif) value larger, with regard to characterization t at C _sameor C _difin more important.The feature that finally will retain to be exactly those distribute in new, legacy data similar feature, distributing similar is exactly require for a characteristic item t, w _same(t, C _same) and w _dif(t, C _dif) value close as much as possible.The present invention adopts formula (4) to calculate the final weights of feature t, and these weights are more close to 1, and the distribution of feature t in new, legacy data is more similar.

max{w _same(t,C _same),w _dif(t,C _dif)}/min{w _same(t,C _same),w _dif(t,C _dif)} (4)

Circulation step step as shown in Fig. 1 102 to 104 successively, constantly extracts subordinate phase α * K feature;

As shown in Fig. 1 105, in the subordinate phase of feature extraction, the α * extracted from first stage K feature, calculate the distribution of feature one by one according to formula (2) and (3), calculate the weights of feature according to formula (4).This α * K feature is pressed weight sequencing, K the character that weight selection is minimum.This K feature is namely by the feature that the inventive method is extracted.

The present invention mainly studies the feature extraction for transfer learning method.Solve in the process of text two classification problem utilizing transfer learning method, feature extraction step wherein, existing method is improved, for the situation having a small amount of new data and a large amount of legacy data in training data, propose a kind of method of second extraction, effectively can improve accuracy rate and the recall rate of classification.

According to the key step of text classification, application of the present invention is described.

1) Training document collection is obtained.The quality of training data is directly connected to the quality of final disaggregated model, therefore should be selected by the expert of association area when choosing Training document, in the hope of obtaining higher quality, or uses those researchers to use more public data collection.

2) pre-service of information.The data adopted during research are taken from actual life mostly, and such document often also comprises Many researchers and unconcerned content, and as the advertisement of inserting in Web document, some useless HTML marks, etc.Therefore pre-service must be carried out to text before the feature extraction carrying out text.Pretreated work comprises removal noise data, removes stop words, carries out root reduction to English text.For Chinese text, also has work---the participle that crucial.Because in Chinese, element the most basic in sentence is word instead of word, unlike in English, fixing separator is had between word and word, in order to carry out feature extraction, also need to carry out participle to Chinese text, participle not only for add separator between word and word, also will will carry out the mark of part of speech to word.Had much good Chinese word segmentation instrument now, as the ICTCLAS of the Chinese Academy of Sciences, the IKAnalyzer increased income, is well positioned to meet the demand of general user.

3) feature extraction.Some word in extraction text, as the feature of text, can adopt feature extraction algorithm proposed by the invention.

4) text representation.Text generally all adopts natural language description, and computing machine can not understand its implication.Therefore need text to be converted into the understandable form of computing machine, feature extraction is for this process lays the foundation.Current vector space model (VSM) is the good method of the more effect of application.

5) file classification method is selected.The text of structured representation is classified.Can adopt the method for transfer learning, now conventional method has AdaBoost algorithm, TrAdaBoost algorithm etc.

6) Performance Evaluation.The more evaluation index of current use has: accuracy rate, recall rate, grand Average Accuracy and grand average recall rate.

User adopts above text classification flow process, uses the feature extraction algorithm that the present invention proposes in transfer learning, can improve accuracy rate and the recall rate of text classification.

Above with reference to drawings and Examples, to invention has been schematic description, this description is not restricted.Those of ordinary skill in the art will be understood that in actual applications, and in the present invention, some change all may occur the set-up mode of each parts, and other staff also may make similar Design under its enlightenment.It is pointed out that only otherwise depart from design aim of the present invention, all apparent changes and similar Design thereof, be all included within protection scope of the present invention.

Claims

1. extract text model feature and carry out a sorting algorithm, comprise the steps:

IG (t) = - Σ_{i = 1}^{m} P (C_{i}) \log P (C_{i}) + P (t) Σ_{i = 1}^{m} P (C_{i} | t) \log P (C_{i} | t) + P (\overset{&OverBar;}{t}) Σ_{i = 1}^{m} P (C_{i} | \overset{&OverBar;}{t}) \log P (C_{i} | \overset{&OverBar;}{t})

Formula (1)

w _same(t,C _same)＝f(t,C _same)*n(t,C _same)/N(C _same) (2)

w _dif(t,C _dif)＝f(t,C _dif)*n(t,C _dif)/N(C _dif) (3)