CN104462406A - Algorithm for extracting text model features to classify text models - Google Patents

Algorithm for extracting text model features to classify text models Download PDF

Info

Publication number
CN104462406A
CN104462406A CN201410765214.8A CN201410765214A CN104462406A CN 104462406 A CN104462406 A CN 104462406A CN 201410765214 A CN201410765214 A CN 201410765214A CN 104462406 A CN104462406 A CN 104462406A
Authority
CN
China
Prior art keywords
feature
text
data
dif
same
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410765214.8A
Other languages
Chinese (zh)
Inventor
刘江
李健铨
李炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201410765214.8A priority Critical patent/CN104462406A/en
Publication of CN104462406A publication Critical patent/CN104462406A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses an algorithm for extracting text model features to classify text models. The algorithm includes calculating a first phase weights of text model training data, then calculating the first phase weights to acquire new and old data distribution of characteristics from text model training data, performing recalculating to acquire a second phase weights, finally achieving classification of the text models and acquiring the target characteristics by listing the second phase weights from small to large. By the application of the algorithm, the characteristics extracted from the text models do not excessively tend to old data from the training data and is not simply acquired from little new data from the training data, and thereby preferable classification effects are acquired.

Description

A kind of text model feature of extracting carries out sorting algorithm
Technical field
The present invention relates to and two sorting algorithms are carried out to text model feature, particularly relate to a kind of text model feature of extracting and carry out sorting algorithm.
Background technology
1) text mining
Along with the development of computer technology, network technology, the information come tumbling makes people at a loss as to what to do sometimes, from vast as the open sea information ocean, obtain rapidly and exactly the information oneself needed most, and becomes very difficult.In magnanimity information, many is text messages.So create a kind of new information processing technology---text mining.Text mining is from large amount of text information, and extract implicit, useful knowledge, this process is also referred to as the Knowledge Discovery in text database.It relates to multiple ambits such as database, machine learning, natural language processing, analysis of statistical data.Research contents comprises the problem such as generation, information extraction of text cluster, text classification, text snippet.
2) text classification
Text classification is an important problem in text mining research, and it refers under given taxonomic hierarchies, and a large amount of text is divided into two or more classification.Utilize computing machine to carry out text classification, not only speed is fast, and accuracy rate is relatively high.In actual life, there is a lot of application, such as, Web page has been classified, the page comprising identical content has been classified as a class.The step of text classification mainly comprises and obtains Training document collection, the pre-service of information, feature extraction, text representation, selection sort method and Performance Evaluation six steps.
3) transfer learning
In many practical applications, text message is enormous amount not only, and the content comprised is also fast more among new change, and such as, the content of Web page often changes theme.In traditional classification learning, basic hypothesis is exactly think to obey unified distribution for the data of train classification models and the data of goal task.Because the data of goal task can often change, which results in when the models applying trained in goal task time, model may be out-of-date.If re-start mark to the data of goal task continually, cost dearly, also do not accomplish in time.We can claim the data of goal task to be new data, and that in the past accumulated, a large amount of, classified data can be claimed to be legacy data.How to maximally utilise the classificating knowledge of legacy data, new data is classified, become a urgent problem.Transfer learning becomes the hot issue of Data Mining in recent years, and the key distinction of it and conventional machines learning method does not need Dynamic data exchange with the hypothesis distributed.In transfer learning method, need to take out a small amount of new data, manually mark, as a part for training data.Only be used for training pattern with these data, quantity wretched insufficiency.So supplementing using a large amount of classified legacy data as training data.Legacy data and new data from different field, may have different distributions.
Traditional feature extraction algorithm, does not have to consider situation that is new, legacy data different distributions, does not consider the problem of training data data skew.Because new, legacy data have different distributions, when differing greatly, if with the character representation new data extracted from legacy data time, new data there will be the phenomenon that a lot of feature weight is 0.Because new data in training data is little, if therefrom extract feature separately, the feature extracted can not represent all new datas well.If on the basis of these features, the data of training data and goal task are represented, and carry out text classification, good effect must be obtained.
Summary of the invention
For problems of the prior art, the invention provides a kind of text model feature of extracting and carry out sorting algorithm, this algorithm neither too tends to the legacy data in training data to the feature that text model extracts, also do not obtain from a small amount of new data training data merely, good classifying quality can be obtained.
For solving the technical matters existed in prior art, the present invention adopts following technical scheme:
1, a kind of text model feature of extracting carries out sorting algorithm, comprises the steps:
The first, to its algorithm of weights that the training data of text model adopts information gain algorithm (IG, Information Gain) to obtain feature be:
IG ( t ) = - Σ i = 1 m P ( C i ) log P ( C i ) + P ( t ) Σ i = 1 m P ( C i | t ) log P ( C i | t ) + P ( t ‾ ) Σ i = 1 m P ( C i | t ‾ ) log P ( C i | t ‾ )
Formula (1)
In formula (1), P (C i) be classification C icomprise the ratio of textual data and text sum, the ratio that P (t) is the textual data and text sum that comprise feature t, P (C i| t) for text when there is feature t belongs to C iprobability, for not containing the textual data of feature t and the ratio of text sum, for when there is not feature t, text belongs to C iprobability;
The second, weights IG step one obtained sorts, and extracts first stage α * K feature;
3rd, adopt formula (2) and (3) to calculate the distribution situation of feature t new data, legacy data in the training data of text model first stage α * K feature, obtain:
w same(t,C same)=f(t,C same)*n(t,C same)/N(C same)(2)
w dif(t,C dif)=f(t,C dif)*n(t,C dif)/N(C dif)(3)
Wherein, C sameand C difrepresent new, the legacy data in training data respectively, f (t, C same) and f (t, C dif) number of times that occurs in new, legacy data of representation feature t respectively, n (t, C same) and n (t, C dif) represent new respectively, occur the textual data of feature t, N (C in legacy data same) and N (C dif) be respectively new, text sum in legacy data, w same(t, C same) and w dif(t, C dif) the respectively distribution of representation feature t in new, legacy data;
4th, by the distribution of feature t in step 3 in new, legacy data, adopt formula (4) to calculate the final weights of feature t, extract subordinate phase α * K feature:
max{w same(t,C same),w dif(t,C dif)}/min{w same(t,C same),w dif(t,C dif)}(4)
5th, circulation step two is to step 4 successively, constantly extracts subordinate phase α * K feature;
6th, obtain subordinate phase subordinate phase α * K feature to step 5 and sort from little arrival by weight, the minimum K of a weight selection feature completes text model classification.
Beneficial aspects of the present invention:
The first, the feature extracting method that the present invention proposes, is applicable to content of text and constantly changes, and adopts the method for transfer learning to carry out the occasion of text classification.Because the cost manually marked is large, in training data, a small amount of new data can only be had, need in training data to retain a large amount of legacy datas.
The second, the feature extracting method that the present invention proposes, take into account the reality of training data data skew, new legacy data different distributions, can represent text data better.
3rd, the feature extracting method that the present invention proposes, by first stage feature extraction, filters out those and comprises the feature that realm information is few, text area calibration is not high.Along with the continuous change of α is large, increasing feature is extracted, and enters subordinate phase and extracts.In subordinate phase feature extraction, concern be the similarity that feature distributes in new data and legacy data, distribute more consistent, the weight calculated is higher, also more forward in feature ordering.But along with α continues to become large, much comprise the feature that realm information is few, text area calibration is not high and have passed first stage extraction, and it is consistent owing to distributing in new legacy data, higher mark is obtained in final sequence, these features are also unfavorable for text representation, by the effect of impact classification, cause the accuracy rate step-down of classifying.Experiment shows, when α gets 2, to obtain the maximal value of classification accuracy rate.
When the method for application migration study solves text classification problem, algorithm of the present invention is applied in Text character extraction link, the feature extracted can be made neither too to be inclined to legacy data, also not obtain from a small amount of new data merely, thus improve the accuracy of text classification.
Attached body explanation
Fig. 1 is that the present invention's one extraction text model feature carries out sorting algorithm process flow diagram.
Embodiment:
Below in conjunction with accompanying drawing, the present invention is described in more detail:
As shown in Fig. 1 101, solve in the process of text two classification problem utilizing transfer learning method, training data is made up of new data and legacy data, legacy data and new data are from different field, different distributions may be had, and there is the problem of data skew in training data: new data is little, and legacy data is a lot.Between the feature of text and training data, one of following four kinds of situations must be met:
1) in legacy data, occurrence probability is comparatively large, and occurrence probability is less in new data;
2) in legacy data, occurrence probability is less, and occurrence probability is larger in new data;
3) in legacy data, occurrence probability is comparatively large, and in new data, occurrence probability is also very large;
4) in legacy data, occurrence probability is less, and in new data, occurrence probability is also very little.
Meet situation 1) feature be more prone to represent legacy data, instead of new data; Meet situation 2) feature probably only effective to these a small amount of new datas; Meet situation 4) feature major part can be filtered in extraction.The feature of these 3 kinds of situations is not suitable as the feature of text.
Comparatively speaking, meet situation 3) feature be only suitable.They can not only represent legacy data, also more likely well represent new data simultaneously.To will meet situation 3 exactly herein) feature extraction out.
In first stage, feature should be extracted, in these training datas for all training datas, both comprise a small amount of new data and also comprise a large amount of legacy data, new data derives from target domain, and legacy data may derive from other field, and these data all have passed through mark.
In the feature that the first stage extracts, illustrate that they are adapted at for representing text in the wider field (both comprised new data field and also comprised legacy data field) of higher level, the realm information comprised is more, higher to the discrimination of text.Extract through the first stage, will comprise realm information less, the characteristic filter not high to text area calibration has fallen.
Extract the stage of feature at first, adopt the feature extracting method that certain is traditional.Such as adopt information gain (IG, Information Gain) method to calculate the weights of feature, the IG value of feature t may be defined as:
IG ( t ) = - Σ i = 1 m P ( C i ) log P ( C i ) + P ( t ) Σ i = 1 m P ( C i | t ) log P ( C i | t ) + P ( t ‾ ) Σ i = 1 m P ( C i | t ‾ ) log P ( C i | t ‾ )
Formula (1)
In formula (1), P (C i) be classification C icomprise the ratio of textual data and text sum, the ratio that P (t) is the textual data and text sum that comprise feature t, P (C i| t) for text when there is feature t belongs to C iprobability, for not containing the textual data of feature t and the ratio of text sum, for when there is not feature t, text belongs to C iprobability.
To all candidate feature comprised in training data, calculate its weights (such as IG value) and sort, choosing several features extracted as the first stage above.The feature quantity that first stage extracts should be greater than the quantity finally expecting to obtain, if wish finally to extract K feature, just need to extract α * K feature (α >1, concrete value can adjust according to test situation) in the first stage.
As shown in Fig. 1 102 to 103, at second stage, consider which feature is more suitable for representing new data.
When subordinate phase extracts feature, can not consider separately to occur frequent, intensive feature in those new datas in training data, because probably these features only reflect a very little aspect of new data, simultaneously, all features of first stage can not be directly used in the expression of text, because legacy data quantitatively account for major part in training data, these features are partial to the content representing legacy data.
At second stage, α * K the feature should extracted from the first stage, choose those similar features that distributes in new, legacy data, the distribution said here refers to the density degree that feature occurs in the text.In new, legacy data, distributional class seemingly, illustrates that these features well can not only represent the text of legacy data, also well can represent the text of new data.
As shown in Fig. 1 104, there is such criterion, with regard to needing, a tolerance having been done to the distribution of feature in new, legacy data.Go to judge the significance level of certain feature in new, legacy data by this tolerance, meanwhile, calculated amount of this tolerance is little as much as possible, because in most of the cases, the text feature of un-extracted is up to dimension up to ten thousand.The present invention adopts formula (2) and (3) to calculate the distribution of feature t in new, legacy data.
w same(t,C same)=f(t,C same)*n(t,C same)/N(C same) (2)
w dif(t,C dif)=f(t,C dif)*n(t,C dif)/N(C dif) (3)
Wherein, C sameand C difrepresent new, the legacy data in training data respectively, f (t, C same) and f (t, C dif) number of times that occurs in new, legacy data of representation feature t respectively, n (t, C same) and n (t, C dif) represent new respectively, occur the textual data of feature t, N (C in legacy data same) and N (C dif) be respectively new, text sum in legacy data, w same(t, C same) and w dif(t, C dif) the respectively distribution of representation feature t in new, legacy data.
In formula (2) and (3), w same(t, C same) or w dif(t, C dif) value larger, with regard to characterization t at C sameor C difin more important.The feature that finally will retain to be exactly those distribute in new, legacy data similar feature, distributing similar is exactly require for a characteristic item t, w same(t, C same) and w dif(t, C dif) value close as much as possible.The present invention adopts formula (4) to calculate the final weights of feature t, and these weights are more close to 1, and the distribution of feature t in new, legacy data is more similar.
max{w same(t,C same),w dif(t,C dif)}/min{w same(t,C same),w dif(t,C dif)} (4)
Circulation step step as shown in Fig. 1 102 to 104 successively, constantly extracts subordinate phase α * K feature;
As shown in Fig. 1 105, in the subordinate phase of feature extraction, the α * extracted from first stage K feature, calculate the distribution of feature one by one according to formula (2) and (3), calculate the weights of feature according to formula (4).This α * K feature is pressed weight sequencing, K the character that weight selection is minimum.This K feature is namely by the feature that the inventive method is extracted.
The present invention mainly studies the feature extraction for transfer learning method.Solve in the process of text two classification problem utilizing transfer learning method, feature extraction step wherein, existing method is improved, for the situation having a small amount of new data and a large amount of legacy data in training data, propose a kind of method of second extraction, effectively can improve accuracy rate and the recall rate of classification.
According to the key step of text classification, application of the present invention is described.
1) Training document collection is obtained.The quality of training data is directly connected to the quality of final disaggregated model, therefore should be selected by the expert of association area when choosing Training document, in the hope of obtaining higher quality, or uses those researchers to use more public data collection.
2) pre-service of information.The data adopted during research are taken from actual life mostly, and such document often also comprises Many researchers and unconcerned content, and as the advertisement of inserting in Web document, some useless HTML marks, etc.Therefore pre-service must be carried out to text before the feature extraction carrying out text.Pretreated work comprises removal noise data, removes stop words, carries out root reduction to English text.For Chinese text, also has work---the participle that crucial.Because in Chinese, element the most basic in sentence is word instead of word, unlike in English, fixing separator is had between word and word, in order to carry out feature extraction, also need to carry out participle to Chinese text, participle not only for add separator between word and word, also will will carry out the mark of part of speech to word.Had much good Chinese word segmentation instrument now, as the ICTCLAS of the Chinese Academy of Sciences, the IKAnalyzer increased income, is well positioned to meet the demand of general user.
3) feature extraction.Some word in extraction text, as the feature of text, can adopt feature extraction algorithm proposed by the invention.
4) text representation.Text generally all adopts natural language description, and computing machine can not understand its implication.Therefore need text to be converted into the understandable form of computing machine, feature extraction is for this process lays the foundation.Current vector space model (VSM) is the good method of the more effect of application.
5) file classification method is selected.The text of structured representation is classified.Can adopt the method for transfer learning, now conventional method has AdaBoost algorithm, TrAdaBoost algorithm etc.
6) Performance Evaluation.The more evaluation index of current use has: accuracy rate, recall rate, grand Average Accuracy and grand average recall rate.
User adopts above text classification flow process, uses the feature extraction algorithm that the present invention proposes in transfer learning, can improve accuracy rate and the recall rate of text classification.
Above with reference to drawings and Examples, to invention has been schematic description, this description is not restricted.Those of ordinary skill in the art will be understood that in actual applications, and in the present invention, some change all may occur the set-up mode of each parts, and other staff also may make similar Design under its enlightenment.It is pointed out that only otherwise depart from design aim of the present invention, all apparent changes and similar Design thereof, be all included within protection scope of the present invention.

Claims (1)

1. extract text model feature and carry out a sorting algorithm, comprise the steps:
The first, to its algorithm of weights that the training data of text model adopts information gain algorithm (IG, Information Gain) to obtain feature be: IG ( t ) = - Σ i = 1 m P ( C i ) log P ( C i ) + P ( t ) Σ i = 1 m P ( C i | t ) log P ( C i | t ) + P ( t ‾ ) Σ i = 1 m P ( C i | t ‾ ) log P ( C i | t ‾ )
Formula (1)
In formula (1), P (C i) be classification C icomprise the ratio of textual data and text sum, the ratio that P (t) is the textual data and text sum that comprise feature t, P (C i| t) for text when there is feature t belongs to C iprobability, for not containing the textual data of feature t and the ratio of text sum, for when there is not feature t, text belongs to C iprobability;
The second, weights IG step one obtained sorts, and extracts first stage α * K feature;
3rd, adopt formula (2) and (3) to calculate the distribution situation of feature t new data, legacy data in the training data of text model first stage α * K feature, obtain:
w same(t,C same)=f(t,C same)*n(t,C same)/N(C same) (2)
w dif(t,C dif)=f(t,C dif)*n(t,C dif)/N(C dif) (3)
Wherein, C sameand C difrepresent new, the legacy data in training data respectively, f (t, C same) and f (t, C dif) number of times that occurs in new, legacy data of representation feature t respectively, n (t, C same) and n (t, C dif) represent new respectively, occur the textual data of feature t, N (C in legacy data same) and N (C dif) be respectively new, text sum in legacy data, w same(t, C same) and w dif(t, C dif) the respectively distribution of representation feature t in new, legacy data;
4th, by the distribution of feature t in step 3 in new, legacy data, adopt formula (4) to calculate the final weights of feature t, extract subordinate phase α * K feature:
max{w same(t,C same),w dif(t,C dif)}/min{w same(t,C same),w dif(t,C dif)} (4)
5th, circulation step two is to step 4 successively, constantly extracts subordinate phase α * K feature;
6th, obtain subordinate phase subordinate phase α * K feature to step 5 and sort from little arrival by weight, the minimum K of a weight selection feature completes text model classification.
CN201410765214.8A 2014-12-10 2014-12-10 Algorithm for extracting text model features to classify text models Pending CN104462406A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410765214.8A CN104462406A (en) 2014-12-10 2014-12-10 Algorithm for extracting text model features to classify text models

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410765214.8A CN104462406A (en) 2014-12-10 2014-12-10 Algorithm for extracting text model features to classify text models

Publications (1)

Publication Number Publication Date
CN104462406A true CN104462406A (en) 2015-03-25

Family

ID=52908441

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410765214.8A Pending CN104462406A (en) 2014-12-10 2014-12-10 Algorithm for extracting text model features to classify text models

Country Status (1)

Country Link
CN (1) CN104462406A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111553146A (en) * 2020-05-09 2020-08-18 杭州中科睿鉴科技有限公司 News writing style modeling method, writing style-influence analysis method and news quality evaluation method
CN112989032A (en) * 2019-12-17 2021-06-18 医渡云(北京)技术有限公司 Entity relationship classification method, apparatus, medium and electronic device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012088972A (en) * 2010-10-20 2012-05-10 Nippon Telegr & Teleph Corp <Ntt> Data classification device, data classification method and data classification program
CN102750338A (en) * 2012-06-04 2012-10-24 天津大学 Text processing method facing transfer learning and text feature extraction method thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012088972A (en) * 2010-10-20 2012-05-10 Nippon Telegr & Teleph Corp <Ntt> Data classification device, data classification method and data classification program
CN102750338A (en) * 2012-06-04 2012-10-24 天津大学 Text processing method facing transfer learning and text feature extraction method thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989032A (en) * 2019-12-17 2021-06-18 医渡云(北京)技术有限公司 Entity relationship classification method, apparatus, medium and electronic device
CN111553146A (en) * 2020-05-09 2020-08-18 杭州中科睿鉴科技有限公司 News writing style modeling method, writing style-influence analysis method and news quality evaluation method

Similar Documents

Publication Publication Date Title
CN104199972B (en) A kind of name entity relation extraction and construction method based on deep learning
CN106294593B (en) In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN104199857B (en) A kind of tax document hierarchy classification method based on multi-tag classification
CN103984681B (en) News event evolution analysis method based on time sequence distribution information and topic model
CN110866117A (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN106709754A (en) Power user grouping method based on text mining
CN109885670A (en) A kind of interaction attention coding sentiment analysis method towards topic text
CN103020122A (en) Transfer learning method based on semi-supervised clustering
CN103617157A (en) Text similarity calculation method based on semantics
CN108920482B (en) Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model
CN102289522A (en) Method of intelligently classifying texts
CN101127042A (en) Sensibility classification method based on language model
CN103699525A (en) Method and device for automatically generating abstract on basis of multi-dimensional characteristics of text
CN109086375A (en) A kind of short text subject extraction method based on term vector enhancing
CN106601235A (en) Semi-supervision multitask characteristic selecting speech recognition method
CN101702167A (en) Method for extracting attribution and comment word with template based on internet
CN103390046A (en) Multi-scale dictionary natural scene image classification method based on latent Dirichlet model
CN103020167B (en) A kind of computer Chinese file classification method
CN109446423B (en) System and method for judging sentiment of news and texts
CN110705272A (en) Named entity identification method for automobile engine fault diagnosis
CN102880631A (en) Chinese author identification method based on double-layer classification model, and device for realizing Chinese author identification method
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
CN105912525A (en) Sentiment classification method for semi-supervised learning based on theme characteristics
CN110728144B (en) Extraction type document automatic summarization method based on context semantic perception

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150325

WD01 Invention patent application deemed withdrawn after publication