CN109977206A

CN109977206A - A kind of short text feature extracting method blended based on multiple features factor

Info

Publication number: CN109977206A
Application number: CN201910211517.8A
Authority: CN
Inventors: 高岭; 周俊鹏; 马景超; 何丹; 王文涛; 高全力
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2019-03-20
Filing date: 2019-03-20
Publication date: 2019-07-05

Abstract

A kind of short text feature extracting method blended based on multiple features factor is segmented short text comment by stammerer participle tool, stop words is gone to handle, preliminary text feature term vector matrix is constructed with this；Weight calculation is carried out to the Feature Words vector matrix after building in conjunction with traditional TF-IDF algorithm, weight vectors matrix is obtained with this；Introduced feature lexeme sets impact factorWith the part of speech feature factor, and part-of-speech tagging is carried out one by one to preliminary text feature word, calculate each Feature WordsWithValue；By what is obtainedWithIt is worth weighted value corresponding with tradition TF-IDF algorithm to be multiplied, finally obtains the weight vectors matrix of TF-IDF algorithm after optimization.Provided technical solution according to the present invention, is able to solve the term weight function unbalance of traditional TF-IDF algorithm to a certain extent, to improve the accuracy of Text character extraction, provides validity for emotional semantic classification task and helps.

Description

A kind of short text feature extracting method blended based on multiple features factor

Technical field

The present invention relates to Text Mining Technology fields, and in particular to a kind of short text blended based on multiple features factor is special Levy extracting method.

Background technique

With the propulsion in Web3.0 epoch, internet information is dissolved into people's lives more and more.In high volume User delivers the speech viewpoint of oneself on network with regard to some event or commodity, and these comment informations can be very under time effect Influence to big degree the thinking and behavior of people.Meanwhile these comment informations contain people various emotional attitudes and Emotion information, such as, pleasure, anger, sorrow, happiness, compassion or positive, neutral, passive.Based on these comment informations, other users Comment and view of the group of subscribers to certain event or commodity are understood by the network platform, therefore these information are with huge potential Tap value.In addition, major network platform can all generate ten hundreds of texts daily during the high speed development of internet Comment information, and these information are also continuously generating.If only carrying out text mining by manual type, need A large amount of human and material resources is expended, therefore, is just particularly important by technical methods such as text minings.

The key point of text mining is that Text character extraction, and the method for feature extraction has very much, and such as vector is empty Between the methods of model (VSM), word frequency-reverse document-frequency (TF-IDF), mutual information (MI), chi-square statistics (Chi-square), These methods achieve good effect in the research work of feature extraction.Wherein, TF-IDF method has in feature extraction There is generally acknowledged representative value, this method is with the directly proportional increase of number that the importance of words occurs in the text, but simultaneously The frequency occurred in corpus also with it is inversely proportional decline.This method measures the importance of a word with word frequency, causes certain The meaningless high frequency of a little Feature Words occurs, so that relatively high weight is imparted, and some vocabulary for having sort feature are special For sign because low frequency reason is dropped, it is unbalance that this phenomenon is understood to be weight.Meanwhile this method does not consider the anteroposterior position of word Set with itself part of speech feature, and the word of the forward word and appearance position of appearance position rearward all has certain important tradeoff, So that inevitably there are some errors in the process of feature extraction.

Summary of the invention

To achieve the goals above, the object of the present invention is to provide a kind of short text blended based on multiple features factor is special Extracting method is levied, the accuracy of feature extraction is improved.

To achieve the goals above, the technical solution adopted by the present invention is that:

A kind of short text feature extracting method blended based on multiple features factor, comprising the following steps:

1) short text comment is segmented by stammerer participle tool, stop words is gone to handle, preliminary text is constructed with this Eigen term vector matrix；Include:

Comment on commodity for user such as extracts, filters at the pretreatment；

Using stammerer participle tool short text comment information is segmented, in conjunction with Stopwords vocabulary to participle after Text carries out stop words and operates；

Assuming that there are n items to comment on sentence, data prediction is carried out to this n item comment sentence, it will be able to it is special to obtain preliminary text Term vector matrix is levied, then defined feature term vector matrix is F={ w_i1,w_i2,…,w_ik|1≤i≤n,k∈N⁺}；

2) it combines traditional TF-IDF algorithm to carry out weight calculation to the Feature Words vector matrix after building, is weighed with this Weight vector matrix；Calculate TF value, IDF value and the corresponding weighted value W of Feature Words_tf；

The ratio between the number occurred in document d using Feature Words and the number of files comprising the specific word as the word weight, Word importance in a certain specific file is measured are as follows:

F in formula_i,jIndicate word t_iIn file d_jThe number of middle appearance, denominator are then in file d_jIn all words appearance The sum of number, the IDF of a certain particular words can be by general act number divided by the number of files comprising the word, then result carried out Logarithm operation obtains:

D indicates the total number of files in corpus in formula, | { j:t_i∈d_j| it include the number of files of the word, if should For word in corpus, will lead to dividend is 0, therefore, generally be will use | { j:t_i∈d_j|+1, finally, the calculating shape of TF-IDF Formula indicates are as follows:

W_tf(t_i,d_j)=TF (t_i,d_j)×IDF(t_i)；

3) introduced feature lexeme sets impact factor α and part of speech feature factor-beta, and carries out one by one to preliminary text feature word Part-of-speech tagging calculates the α and β value of each Feature Words, defines the meaning of two factors, and solve to it；Use stammerer point Word tool carries out part-of-speech tagging to preliminary text feature word F；

1 is defined, position impact factor α: the feature in front-rear position can be found according to the Feature Words part of speech in comment sentence Its ability for distinguishing sentence of the Feature Words of word relative intermediate position is higher, meanwhile, the composition of most of sentence is Subject, Predicate and Object structure, And the position of subject and object more appears in the head and tail parts of sentence, the differentiation of the separating capacities of this kind of Feature Words with respect to predicate Ability can also show more preferable, therefore measure to the half of the characteristic length Len of the comment sentence after every pretreatment, with The Feature Words in middle position are initial position, are extended respectively to both sides, and defaulting middle position l is 1, and form indicates Are as follows:

WhereinIndicate the position of j-th of Feature Words of the i-th row, Len indicates the Feature Words length of every text comments sentence；

It defines 2 (part of speech feature factor-betas): by the Feature Words part of speech structure of parsing sentence, and analyzing the structure of Chinese sentence At relationship, defines its main part of speech hierarchal order arrangement and is followed successively by noun, verb, adjective, adverbial word and other part of speech vocabulary, Defining its impact factor according to this hierarchal order as a result, is β={ 5,4,3,2,1 }；

4) obtained Feature Words position impact factor α and part of speech feature factor-beta and tradition TF-IDF algorithm are calculated Respective weights value W_tfIt is multiplied, the weight vectors matrix of optimization TF-IDF algorithm is obtained with this, comprising:

Calculate the position impact factor α and part of speech feature factor-beta of each Feature Words；

Term weight function vector matrix after building optimization TF-IDF algorithm；

In conjunction with the method for defining 1 and defining 2, two kinds of impact factors is introduced into traditional TF-IDF algorithm, utilizes TF- After IDF algorithm carries out weight calculation, comprehensive meter is carried out in conjunction with the position impact factor α and part of speech feature factor-beta of each Feature Words It calculates, the weight Weight after being optimized with this, procedural representation are as follows:

Weight=α * W_tf*β。

The TF-IDF algorithm optimization step are as follows:

1) preliminary text feature term vector matrix is constructed using preprocessed text data set；

2) feature term vector is obtained one by one, and solves part of speech feature factor β value using WordFea_Factor function；

3) it calculates Feature Words vector length Len and character pair lexeme sets Loc；

4) the position impact factor α value for defining the equations character pair word in 1 is combined；

5) the weighted value W for solving traditional TF-IDF algorithm_tfIt is multiplied with α, β value, finally obtains TF-IDF algorithm after optimization Weight Weight；

Wherein, WordLoc_Factor indicates Feature Words position impact factor calculating process, is gone out according to Feature Words in the sentence The specific gravity that existing position accounts for entire sentence length solves α, and WordFea_Factor indicates part of speech feature factor calculating process, according to The part of speech grade of Feature Words carries out assignment.

The beneficial effects of the present invention are:

Short text comment information generally has the characteristics that length is brief, theme is clear, emotion is dense, the master that sentence is constituted Body is usually subject-predicate or Subject, Predicate and Object structure, and subject and object are usually noun, and predicate is usually verb, and adjective to Modification noun, adverbial word are used to modify verb.To the emotional expression of sentence, there is also one to a certain extent for the part of speech of these phrases Fixed influence.The part of speech of Chinese language structure has point of notional word and function word, such as, noun, adjective, verb, number, pronoun Belong to notional word with quantifier etc., and adverbial word, preposition, interjection, modal particle, auxiliary word and conjunction etc. belong to function word.Wherein stammerer point Word tool provides a set of complete Rules for Part of Speech Tagging, as shown in Fig. 2, noun, verb, adjective and adverbial word part of speech respectively with N, v, a and d beginning carries out part of speech coding.

The essence of TF-IDF algorithm assigns corresponding weight mainly using word frequency statistics as consideration foundation, and Chinese Comment property sentence composed structure be usually noun or nominal phrase as main body, on this basis, with verb, adjective and Adverbial word expands the rich and complexity of sentence, but the main structure of sentence does not change.Meanwhile Chinese comment sentence accumulates The part of speech of the attributive character contained occurs mostly in the form of noun or nominal phrase, and such as " mobile phone is pretty good, and appearance is seen very well, is System is also smooth, and the speed of service is very good ", wherein " mobile phone, appearance, system, the speed of service " is all attributive character word, these attribute words It is able to reflect the intrinsic feature of goods themselves to a certain extent.And comment on the word or short that emotion tendency is had in sentence Language can embody the emotion stated in text information, usually upper to include positive, neutral, passive three kinds of Sentiment orientations.Such as it comments By sentence " mobile phone is certified products, is used very smooth ", wherein " certified products, very smooth " is exactly emotion word, the shared word of these emotion words Property is all adjective or noun etc..And early stage researchers are mainly adjective to the definition of emotion word, then by emotion word Research expands to noun and verb.By analyzing comment text, it can be found that emotion word is mainly with adjective and minority Based on noun and verb, if some words comprising emotion will be abandoned using a certain part of speech as emotion word, seen containing emotion Point, attitude the covering surface of comment sentence will greatly reduce, thereby reduce the accuracy rate of affective characteristics Result.

Detailed description of the invention

Fig. 1 is the main process figure of the short text feature extraction provided by the present invention blended based on multiple features factor；

Fig. 2 is stammerer participle tool provided by the present invention often with part-of-speech tagging figure；

Fig. 3 is Feature Words front-rear position factor structure provided by the present invention definition figure；

Fig. 4 is Feature Words part of speech structural analysis figure provided by the present invention；

Fig. 5 is optimization TF-IDF algorithm core process figure provided by the present invention.

Specific embodiment

The technical method in the application is described in detail with reference to the accompanying drawing.

The present invention sets impact factor and the part of speech feature factor for deficiency existing for TF-IDF algorithm, introduced feature lexeme, TF-IDF algorithm is improved, a kind of short text feature extracting method blended based on multiple features factor is proposed, to Improve the problems such as TF-IDF algorithm is unbalance to the weight occurred in the weight computations of Feature Words.In order to make it easy to understand, this hair The bright narration specific implementation content in the form of hypothesis.

It is introduced firstly the need of to traditional TF-IDF algorithm, TF-IDF algorithm is that occur in document d with Feature Words Weight as the word of number and the ratio between the number of files comprising the specific word.For the word importance in a certain specific file It measures are as follows:

f_i,jIndicate word t_iIn file d_jThe number of middle appearance, denominator are then in file d_jIn all words frequency of occurrence The sum of.

Reverse document-frequency IDF is the measurement of a word general importance, indicates that the document comprising entry is fewer, IDF It is bigger, then illustrate that entry has preferable class discrimination ability.The IDF of a certain particular words can by general act number divided by comprising The number of files of the word, then result progress logarithm operation is obtained:

D indicates the total number of files in corpus, | { j:t_i∈d_j| it include the number of files of the word, if the word In corpus, will lead to dividend is 0, therefore, generally be will use | { j:t_i∈d_j}|+1.Finally, the form of calculation table of TF-IDF It is shown as:

W_tf(t_i,d_j)=TF (t_i,d_j)×IDF(t_i)

Assuming that text document includes C₁、C₂And C₃, and Feature Words respectively indicate w₁、w₂、w₃、w₄And w₅, as shown in table 1, indicate The frequency that Feature Words occur in different document.Wherein, it is specified that text document sum is 50, C₁、C₂、C₃Corresponding Feature Words Sum is 30,25,40, w respectively₁、w₂、w₃、w₄、w₅It include sum in text document is 30,13,18,25,40 clearly.It is first First, TF, IDF and W are calculated separately using traditional TF-IDF algorithm_tf, the results are shown in Table 2.

1 Feature Words of table are in different document frequency of occurrence

2 TF-IDF algorithm of table calculates term weight function

Assuming that there are n items to comment on sentence, data prediction is carried out to this n item comment sentence, it will be able to it is special to obtain preliminary text Term vector matrix is levied, then defined feature term vector matrix is F={ w_i1,w_i2,…,w_ik|1≤i≤n,k∈N⁺}.Meanwhile it examining Consider the influence factor of Feature Words front-rear position and part of speech feature, and position impact factor is defined as α, the part of speech feature factor is fixed Justice is β.

It defines 1 (position impact factor α): the feature in front-rear position can be found according to the Feature Words part of speech in comment sentence Its ability for distinguishing sentence of the Feature Words of word relative intermediate position is higher.Meanwhile the composition of most of sentence is Subject, Predicate and Object structure, And the position of subject and object more appears in the head and tail parts of sentence, the differentiation of the separating capacities of this kind of Feature Words with respect to predicate Ability can also show more preferable.Therefore the half of the characteristic length Len of the comment sentence after every pretreatment is measured, with The Feature Words in middle position are initial position, are extended respectively to both sides, and defaulting middle position l is 1, as shown in figure 3, its Form indicates are as follows:

WhereinIndicate the position of j-th of Feature Words of the i-th row, Len indicates the Feature Words length of every text comments sentence.

It defines 2 (part of speech feature factor-betas): by the Feature Words part of speech structure of parsing sentence, and analyzing the structure of Chinese sentence At relationship, as shown in figure 4, define its main part of speech hierarchal order arrangement be followed successively by noun (n), verb (v), adjective (a), Adverbial word (d) and other part of speech vocabulary.Defining its impact factor according to this hierarchal order as a result, is β={ 5,4,3,2,1 }.

By above-mentioned analysis and definition, two kinds of impact factors are introduced into traditional TF-IDF algorithm.Utilize TF- After IDF algorithm carries out weight calculation, comprehensive meter is carried out in conjunction with the position impact factor α and part of speech feature factor-beta of each Feature Words It calculates, the weight after being optimized with this, procedural representation are as follows:

Weight=α * W_tf*β

It is described based on above method, the application realizes optimization TF-IDF algorithm, proposes a kind of optimization TF- The core process of IDF algorithm, as shown in figure 5, the key step of its process are as follows:

Preliminary text feature term vector matrix is constructed using preprocessed text data set；

Feature term vector is obtained one by one, and solves part of speech feature factor β value using WordFea_Factor function；

It calculates Feature Words vector length Len and character pair lexeme sets Loc；

In conjunction with the position impact factor α value for defining the equations character pair word in 1；

The weighted value W that traditional TF-IDF algorithm is solved_tfIt is multiplied with α, β value, finally obtains TF-IDF algorithm after optimizing Weight Weight.

Wherein, WordLoc_Factor indicates Feature Words position impact factor calculating process, is gone out according to Feature Words in the sentence The specific gravity that existing position accounts for entire sentence length solves α.WordFea_Factor indicates part of speech feature factor calculating process, according to The part of speech grade of Feature Words carries out assignment.

This algorithm realizes that wherein algorithm 1 is characterized the pseudocode that lexeme sets impact factor calculating process with Python, Algorithm 2 is the pseudocode of part of speech feature factor calculating process, and algorithm 3 is the pseudo- generation for optimizing TF-IDF algorithm weights calculating process Code.In addition, the calculating process of algorithm 3 needs the calculating of algorithm 1 and algorithm 2 to assist.

WordLoc_Factor (calculating of Feature Words position impact factor)

The WordLoc_Factor algorithm description calculating subprocess of Feature Words position impact factor, it is desirable to provide pretreatment The position of feature term vector later, the length of every group of feature term vector and the vector where Feature Words, and the calculating of the position Process is realized in algorithm 3.

Algorithm 1:WordLoc_Factor

Input: word, Len, Loc (Feature Words, feature vector length, Feature Words position)

Output: α (position impact factor)

Algorithm 2:WordFea_Factor

Input: word, flag (Feature Words, part of speech mark)

Output: β (the part of speech feature factor)

Algorithm 3:Update_TF-IDF

Input: wordsMat (term weight function matrix)

Output: WeightMat (term weight function matrix after optimization)

In conjunction with above-mentioned hypothesis and definition, the term weight function of optimization TF-IDF algorithm is calculated, it is necessary to assume that text Document C₁、C₂、C₃Part of speech composed structure, i.e. C₁=(n, v, n, a, d ...), C₂=(n, v, a, n, d ...), C₃=(n, d, v, A, n ...),Each Feature Words institute is calculated according to the formula defined in 1 as a result, Corresponding α value only calculates the impact factor α and β of first five Feature Words since cited Feature Words are 5.Such as 3 institute of table Show, it is shown that the term weight function calculated result of optimization TF-IDF algorithm.

Table 3 optimizes TF-IDF algorithm and calculates term weight function

It by the data hypothesis testing of algorithm, is tentatively found in conjunction with 3 result of table, traditional TF-IDF algorithm calculated result W_tf There are obvious gaps with optimization TF-IDF algorithm calculated result Weight.Meanwhile it is important with the position of Feature Words and part of speech Degree, the weight after optimization also adjust therewith.

Approach described above can improve deficiency existing for traditional TF-IDF algorithm to a certain extent, it is even more important that Traditional TF-IDF algorithm is not directed to factor and integrated by technical method provided by the present invention, and the position of introduced feature word influences The factor and the part of speech feature factor, the weight after calculating TF-IDF algorithm are adjusted, and more accurately improve Feature Words with this Separating capacity.In addition, the technical method in this specification is all made of progressive description, the embodiment for the modules being previously mentioned Between there are close associations, meanwhile, the key technology method being previously mentioned in detail in the claims, have in the present specification in detail It is thin to introduce.

It should be pointed out that the invention is not limited to technical method that is describe above and being previously mentioned in the accompanying drawings, it is right In those of ordinary skill, without prejudice to its engineering philosophy, several modifications and supplement can be made, but these are repaired Change and supplement is accordingly to be regarded as protection scope of the present invention.

Claims

1. a kind of short text feature extracting method blended based on multiple features factor, which comprises the following steps:

1) short text comment is segmented by stammerer participle tool, stop words is gone to handle, it is special to construct preliminary text with this Levy term vector matrix；Include:

Comment on commodity for user such as extracts, filters at the pretreatment；

Short text comment information is segmented using stammerer participle tool, in conjunction with Stopwords vocabulary to the text after participle Stop words is carried out to operate；

Assuming that there are n items to comment on sentence, data prediction is carried out to this n item comment sentence, it will be able to obtain preliminary text feature word Vector matrix, then defined feature term vector matrix is F={ w_i1,w_i2,…,w_ik|1≤i≤n,k∈N+}；

2) combine traditional TF-IDF algorithm to after building Feature Words vector matrix carry out weight calculation, with this obtain weight to Moment matrix；Calculate TF value, IDF value and the corresponding weighted value W of Feature Words_tf；

The ratio between the number occurred in document d using Feature Words and the number of files comprising the specific word as the word weight, for Word importance in a certain specific file is measured are as follows:

F in formula_i,jIndicate word t_iIn file d_jThe number of middle appearance, denominator are then in file d_jIn all words frequency of occurrence The sum of, the IDF of a certain particular words can be by general act number divided by the number of files comprising the word, then result carried out logarithm Operation obtains:

In formula | D | indicate the total number of files in corpus, | { j:t_i∈d_j| it include the number of files of the word, if the word For language in corpus, will lead to dividend is 0, therefore, generally be will use | { j:t_i∈d_j|+1, finally, the form of calculation of TF-IDF It indicates are as follows:

W_tf(t_i,d_j)=TF (t_i,d_j)×IDF(t_i)；

3) introduced feature lexeme sets impact factor α and part of speech feature factor-beta, and carries out part of speech one by one to preliminary text feature word Mark, calculates the α and β value of each Feature Words, defines the meaning of two factors, and solve to it；Work is segmented using stammerer Have and part-of-speech tagging is carried out to preliminary text feature word F；

1 is defined, position impact factor α: the Feature Words phase in front-rear position can be found according to the Feature Words part of speech in comment sentence To the Feature Words in middle position, its ability for distinguishing sentence is higher, meanwhile, the composition of most of sentence is Subject, Predicate and Object structure, and is led The position of language and object more appears in the head and tail parts of sentence, the separating capacity of the separating capacities of this kind of Feature Words with respect to predicate Also what can be showed is more preferable, therefore measures to the half of the characteristic length Len of the comment sentence after every pretreatment, with centre The Feature Words of position are initial position, are extended respectively to both sides, and defaulting middle position l is 1, and form indicates are as follows:

Define 2 (part of speech feature factor-betas): by the Feature Words part of speech structure of parsing sentence, and the composition for analyzing Chinese sentence is closed System defines its main part of speech hierarchal order arrangement and is followed successively by noun, verb, adjective, adverbial word and other part of speech vocabulary, by This, defining its impact factor according to this hierarchal order is β={ 5,4,3,2,1 }；

4) obtained Feature Words position impact factor α and part of speech feature factor-beta are calculated with tradition TF-IDF algorithm corresponding Weighted value W_tfIt is multiplied, the weight vectors matrix of optimization TF-IDF algorithm is obtained with this, comprising:

In conjunction with the method for defining 1 and defining 2, two kinds of impact factors is introduced into traditional TF-IDF algorithm, utilizes TF-IDF After algorithm carries out weight calculation, COMPREHENSIVE CALCULATING is carried out in conjunction with the position impact factor α and part of speech feature factor-beta of each Feature Words, Weight Weight after being optimized with this, procedural representation are as follows:

Weight=α * W_tf*β。

2. a kind of short text feature extracting method blended based on multiple features factor according to claim 1, feature It is, the TF-IDF algorithm optimization step are as follows:

5) the weighted value W for solving traditional TF-IDF algorithm_tfIt is multiplied with α, β value, finally obtains the power of TF-IDF algorithm after optimization Weight Weight；

Wherein, WordLoc_Factor indicates Feature Words position impact factor calculating process, is occurred according to Feature Words in this The specific gravity that position accounts for entire sentence length solves α, and WordFea_Factor indicates part of speech feature factor calculating process, according to feature The part of speech grade of word carries out assignment.