CN109977206A - A kind of short text feature extracting method blended based on multiple features factor - Google Patents
A kind of short text feature extracting method blended based on multiple features factor Download PDFInfo
- Publication number
- CN109977206A CN109977206A CN201910211517.8A CN201910211517A CN109977206A CN 109977206 A CN109977206 A CN 109977206A CN 201910211517 A CN201910211517 A CN 201910211517A CN 109977206 A CN109977206 A CN 109977206A
- Authority
- CN
- China
- Prior art keywords
- feature
- factor
- word
- speech
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 239000013598 vector Substances 0.000 claims abstract description 33
- 239000011159 matrix material Substances 0.000 claims abstract description 25
- 238000005457 optimization Methods 0.000 claims abstract description 17
- 238000004364 calculation method Methods 0.000 claims abstract description 8
- 230000008569 process Effects 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 13
- 239000000203 mixture Substances 0.000 claims description 4
- 230000005484 gravity Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 2
- 238000000605 extraction Methods 0.000 abstract description 8
- 230000002996 emotional effect Effects 0.000 abstract description 3
- 230000008451 emotion Effects 0.000 description 12
- 238000005065 mining Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 230000004069 differentiation Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000007935 neutral effect Effects 0.000 description 2
- 239000000047 product Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000012916 structural analysis Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000001550 time effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of short text feature extracting method blended based on multiple features factor is segmented short text comment by stammerer participle tool, stop words is gone to handle, preliminary text feature term vector matrix is constructed with this;Weight calculation is carried out to the Feature Words vector matrix after building in conjunction with traditional TF-IDF algorithm, weight vectors matrix is obtained with this;Introduced feature lexeme sets impact factorWith the part of speech feature factor, and part-of-speech tagging is carried out one by one to preliminary text feature word, calculate each Feature WordsWithValue;By what is obtainedWithIt is worth weighted value corresponding with tradition TF-IDF algorithm to be multiplied, finally obtains the weight vectors matrix of TF-IDF algorithm after optimization.Provided technical solution according to the present invention, is able to solve the term weight function unbalance of traditional TF-IDF algorithm to a certain extent, to improve the accuracy of Text character extraction, provides validity for emotional semantic classification task and helps.
Description
Technical field
The present invention relates to Text Mining Technology fields, and in particular to a kind of short text blended based on multiple features factor is special
Levy extracting method.
Background technique
With the propulsion in Web3.0 epoch, internet information is dissolved into people's lives more and more.In high volume
User delivers the speech viewpoint of oneself on network with regard to some event or commodity, and these comment informations can be very under time effect
Influence to big degree the thinking and behavior of people.Meanwhile these comment informations contain people various emotional attitudes and
Emotion information, such as, pleasure, anger, sorrow, happiness, compassion or positive, neutral, passive.Based on these comment informations, other users
Comment and view of the group of subscribers to certain event or commodity are understood by the network platform, therefore these information are with huge potential
Tap value.In addition, major network platform can all generate ten hundreds of texts daily during the high speed development of internet
Comment information, and these information are also continuously generating.If only carrying out text mining by manual type, need
A large amount of human and material resources is expended, therefore, is just particularly important by technical methods such as text minings.
The key point of text mining is that Text character extraction, and the method for feature extraction has very much, and such as vector is empty
Between the methods of model (VSM), word frequency-reverse document-frequency (TF-IDF), mutual information (MI), chi-square statistics (Chi-square),
These methods achieve good effect in the research work of feature extraction.Wherein, TF-IDF method has in feature extraction
There is generally acknowledged representative value, this method is with the directly proportional increase of number that the importance of words occurs in the text, but simultaneously
The frequency occurred in corpus also with it is inversely proportional decline.This method measures the importance of a word with word frequency, causes certain
The meaningless high frequency of a little Feature Words occurs, so that relatively high weight is imparted, and some vocabulary for having sort feature are special
For sign because low frequency reason is dropped, it is unbalance that this phenomenon is understood to be weight.Meanwhile this method does not consider the anteroposterior position of word
Set with itself part of speech feature, and the word of the forward word and appearance position of appearance position rearward all has certain important tradeoff,
So that inevitably there are some errors in the process of feature extraction.
Summary of the invention
To achieve the goals above, the object of the present invention is to provide a kind of short text blended based on multiple features factor is special
Extracting method is levied, the accuracy of feature extraction is improved.
To achieve the goals above, the technical solution adopted by the present invention is that:
A kind of short text feature extracting method blended based on multiple features factor, comprising the following steps:
1) short text comment is segmented by stammerer participle tool, stop words is gone to handle, preliminary text is constructed with this
Eigen term vector matrix;Include:
Comment on commodity for user such as extracts, filters at the pretreatment;
Using stammerer participle tool short text comment information is segmented, in conjunction with Stopwords vocabulary to participle after
Text carries out stop words and operates;
Assuming that there are n items to comment on sentence, data prediction is carried out to this n item comment sentence, it will be able to it is special to obtain preliminary text
Term vector matrix is levied, then defined feature term vector matrix is F={ wi1,wi2,…,wik|1≤i≤n,k∈N+};
2) it combines traditional TF-IDF algorithm to carry out weight calculation to the Feature Words vector matrix after building, is weighed with this
Weight vector matrix;Calculate TF value, IDF value and the corresponding weighted value W of Feature Wordstf;
The ratio between the number occurred in document d using Feature Words and the number of files comprising the specific word as the word weight,
Word importance in a certain specific file is measured are as follows:
F in formulai,jIndicate word tiIn file djThe number of middle appearance, denominator are then in file djIn all words appearance
The sum of number, the IDF of a certain particular words can be by general act number divided by the number of files comprising the word, then result carried out
Logarithm operation obtains:
D indicates the total number of files in corpus in formula, | { j:ti∈dj| it include the number of files of the word, if should
For word in corpus, will lead to dividend is 0, therefore, generally be will use | { j:ti∈dj|+1, finally, the calculating shape of TF-IDF
Formula indicates are as follows:
Wtf(ti,dj)=TF (ti,dj)×IDF(ti);
3) introduced feature lexeme sets impact factor α and part of speech feature factor-beta, and carries out one by one to preliminary text feature word
Part-of-speech tagging calculates the α and β value of each Feature Words, defines the meaning of two factors, and solve to it;Use stammerer point
Word tool carries out part-of-speech tagging to preliminary text feature word F;
1 is defined, position impact factor α: the feature in front-rear position can be found according to the Feature Words part of speech in comment sentence
Its ability for distinguishing sentence of the Feature Words of word relative intermediate position is higher, meanwhile, the composition of most of sentence is Subject, Predicate and Object structure,
And the position of subject and object more appears in the head and tail parts of sentence, the differentiation of the separating capacities of this kind of Feature Words with respect to predicate
Ability can also show more preferable, therefore measure to the half of the characteristic length Len of the comment sentence after every pretreatment, with
The Feature Words in middle position are initial position, are extended respectively to both sides, and defaulting middle position l is 1, and form indicates
Are as follows:
WhereinIndicate the position of j-th of Feature Words of the i-th row, Len indicates the Feature Words length of every text comments sentence;
It defines 2 (part of speech feature factor-betas): by the Feature Words part of speech structure of parsing sentence, and analyzing the structure of Chinese sentence
At relationship, defines its main part of speech hierarchal order arrangement and is followed successively by noun, verb, adjective, adverbial word and other part of speech vocabulary,
Defining its impact factor according to this hierarchal order as a result, is β={ 5,4,3,2,1 };
4) obtained Feature Words position impact factor α and part of speech feature factor-beta and tradition TF-IDF algorithm are calculated
Respective weights value WtfIt is multiplied, the weight vectors matrix of optimization TF-IDF algorithm is obtained with this, comprising:
Calculate the position impact factor α and part of speech feature factor-beta of each Feature Words;
Term weight function vector matrix after building optimization TF-IDF algorithm;
In conjunction with the method for defining 1 and defining 2, two kinds of impact factors is introduced into traditional TF-IDF algorithm, utilizes TF-
After IDF algorithm carries out weight calculation, comprehensive meter is carried out in conjunction with the position impact factor α and part of speech feature factor-beta of each Feature Words
It calculates, the weight Weight after being optimized with this, procedural representation are as follows:
Weight=α * Wtf*β。
The TF-IDF algorithm optimization step are as follows:
1) preliminary text feature term vector matrix is constructed using preprocessed text data set;
2) feature term vector is obtained one by one, and solves part of speech feature factor β value using WordFea_Factor function;
3) it calculates Feature Words vector length Len and character pair lexeme sets Loc;
4) the position impact factor α value for defining the equations character pair word in 1 is combined;
5) the weighted value W for solving traditional TF-IDF algorithmtfIt is multiplied with α, β value, finally obtains TF-IDF algorithm after optimization
Weight Weight;
Wherein, WordLoc_Factor indicates Feature Words position impact factor calculating process, is gone out according to Feature Words in the sentence
The specific gravity that existing position accounts for entire sentence length solves α, and WordFea_Factor indicates part of speech feature factor calculating process, according to
The part of speech grade of Feature Words carries out assignment.
The beneficial effects of the present invention are:
Short text comment information generally has the characteristics that length is brief, theme is clear, emotion is dense, the master that sentence is constituted
Body is usually subject-predicate or Subject, Predicate and Object structure, and subject and object are usually noun, and predicate is usually verb, and adjective to
Modification noun, adverbial word are used to modify verb.To the emotional expression of sentence, there is also one to a certain extent for the part of speech of these phrases
Fixed influence.The part of speech of Chinese language structure has point of notional word and function word, such as, noun, adjective, verb, number, pronoun
Belong to notional word with quantifier etc., and adverbial word, preposition, interjection, modal particle, auxiliary word and conjunction etc. belong to function word.Wherein stammerer point
Word tool provides a set of complete Rules for Part of Speech Tagging, as shown in Fig. 2, noun, verb, adjective and adverbial word part of speech respectively with
N, v, a and d beginning carries out part of speech coding.
The essence of TF-IDF algorithm assigns corresponding weight mainly using word frequency statistics as consideration foundation, and Chinese
Comment property sentence composed structure be usually noun or nominal phrase as main body, on this basis, with verb, adjective and
Adverbial word expands the rich and complexity of sentence, but the main structure of sentence does not change.Meanwhile Chinese comment sentence accumulates
The part of speech of the attributive character contained occurs mostly in the form of noun or nominal phrase, and such as " mobile phone is pretty good, and appearance is seen very well, is
System is also smooth, and the speed of service is very good ", wherein " mobile phone, appearance, system, the speed of service " is all attributive character word, these attribute words
It is able to reflect the intrinsic feature of goods themselves to a certain extent.And comment on the word or short that emotion tendency is had in sentence
Language can embody the emotion stated in text information, usually upper to include positive, neutral, passive three kinds of Sentiment orientations.Such as it comments
By sentence " mobile phone is certified products, is used very smooth ", wherein " certified products, very smooth " is exactly emotion word, the shared word of these emotion words
Property is all adjective or noun etc..And early stage researchers are mainly adjective to the definition of emotion word, then by emotion word
Research expands to noun and verb.By analyzing comment text, it can be found that emotion word is mainly with adjective and minority
Based on noun and verb, if some words comprising emotion will be abandoned using a certain part of speech as emotion word, seen containing emotion
Point, attitude the covering surface of comment sentence will greatly reduce, thereby reduce the accuracy rate of affective characteristics Result.
Detailed description of the invention
Fig. 1 is the main process figure of the short text feature extraction provided by the present invention blended based on multiple features factor;
Fig. 2 is stammerer participle tool provided by the present invention often with part-of-speech tagging figure;
Fig. 3 is Feature Words front-rear position factor structure provided by the present invention definition figure;
Fig. 4 is Feature Words part of speech structural analysis figure provided by the present invention;
Fig. 5 is optimization TF-IDF algorithm core process figure provided by the present invention.
Specific embodiment
The technical method in the application is described in detail with reference to the accompanying drawing.
The present invention sets impact factor and the part of speech feature factor for deficiency existing for TF-IDF algorithm, introduced feature lexeme,
TF-IDF algorithm is improved, a kind of short text feature extracting method blended based on multiple features factor is proposed, to
Improve the problems such as TF-IDF algorithm is unbalance to the weight occurred in the weight computations of Feature Words.In order to make it easy to understand, this hair
The bright narration specific implementation content in the form of hypothesis.
It is introduced firstly the need of to traditional TF-IDF algorithm, TF-IDF algorithm is that occur in document d with Feature Words
Weight as the word of number and the ratio between the number of files comprising the specific word.For the word importance in a certain specific file
It measures are as follows:
fi,jIndicate word tiIn file djThe number of middle appearance, denominator are then in file djIn all words frequency of occurrence
The sum of.
Reverse document-frequency IDF is the measurement of a word general importance, indicates that the document comprising entry is fewer, IDF
It is bigger, then illustrate that entry has preferable class discrimination ability.The IDF of a certain particular words can by general act number divided by comprising
The number of files of the word, then result progress logarithm operation is obtained:
D indicates the total number of files in corpus, | { j:ti∈dj| it include the number of files of the word, if the word
In corpus, will lead to dividend is 0, therefore, generally be will use | { j:ti∈dj}|+1.Finally, the form of calculation table of TF-IDF
It is shown as:
Wtf(ti,dj)=TF (ti,dj)×IDF(ti)
Assuming that text document includes C1、C2And C3, and Feature Words respectively indicate w1、w2、w3、w4And w5, as shown in table 1, indicate
The frequency that Feature Words occur in different document.Wherein, it is specified that text document sum is 50, C1、C2、C3Corresponding Feature Words
Sum is 30,25,40, w respectively1、w2、w3、w4、w5It include sum in text document is 30,13,18,25,40 clearly.It is first
First, TF, IDF and W are calculated separately using traditional TF-IDF algorithmtf, the results are shown in Table 2.
1 Feature Words of table are in different document frequency of occurrence
2 TF-IDF algorithm of table calculates term weight function
Assuming that there are n items to comment on sentence, data prediction is carried out to this n item comment sentence, it will be able to it is special to obtain preliminary text
Term vector matrix is levied, then defined feature term vector matrix is F={ wi1,wi2,…,wik|1≤i≤n,k∈N+}.Meanwhile it examining
Consider the influence factor of Feature Words front-rear position and part of speech feature, and position impact factor is defined as α, the part of speech feature factor is fixed
Justice is β.
It defines 1 (position impact factor α): the feature in front-rear position can be found according to the Feature Words part of speech in comment sentence
Its ability for distinguishing sentence of the Feature Words of word relative intermediate position is higher.Meanwhile the composition of most of sentence is Subject, Predicate and Object structure,
And the position of subject and object more appears in the head and tail parts of sentence, the differentiation of the separating capacities of this kind of Feature Words with respect to predicate
Ability can also show more preferable.Therefore the half of the characteristic length Len of the comment sentence after every pretreatment is measured, with
The Feature Words in middle position are initial position, are extended respectively to both sides, and defaulting middle position l is 1, as shown in figure 3, its
Form indicates are as follows:
WhereinIndicate the position of j-th of Feature Words of the i-th row, Len indicates the Feature Words length of every text comments sentence.
It defines 2 (part of speech feature factor-betas): by the Feature Words part of speech structure of parsing sentence, and analyzing the structure of Chinese sentence
At relationship, as shown in figure 4, define its main part of speech hierarchal order arrangement be followed successively by noun (n), verb (v), adjective (a),
Adverbial word (d) and other part of speech vocabulary.Defining its impact factor according to this hierarchal order as a result, is β={ 5,4,3,2,1 }.
By above-mentioned analysis and definition, two kinds of impact factors are introduced into traditional TF-IDF algorithm.Utilize TF-
After IDF algorithm carries out weight calculation, comprehensive meter is carried out in conjunction with the position impact factor α and part of speech feature factor-beta of each Feature Words
It calculates, the weight after being optimized with this, procedural representation are as follows:
Weight=α * Wtf*β
It is described based on above method, the application realizes optimization TF-IDF algorithm, proposes a kind of optimization TF-
The core process of IDF algorithm, as shown in figure 5, the key step of its process are as follows:
Preliminary text feature term vector matrix is constructed using preprocessed text data set;
Feature term vector is obtained one by one, and solves part of speech feature factor β value using WordFea_Factor function;
It calculates Feature Words vector length Len and character pair lexeme sets Loc;
In conjunction with the position impact factor α value for defining the equations character pair word in 1;
The weighted value W that traditional TF-IDF algorithm is solvedtfIt is multiplied with α, β value, finally obtains TF-IDF algorithm after optimizing
Weight Weight.
Wherein, WordLoc_Factor indicates Feature Words position impact factor calculating process, is gone out according to Feature Words in the sentence
The specific gravity that existing position accounts for entire sentence length solves α.WordFea_Factor indicates part of speech feature factor calculating process, according to
The part of speech grade of Feature Words carries out assignment.
This algorithm realizes that wherein algorithm 1 is characterized the pseudocode that lexeme sets impact factor calculating process with Python,
Algorithm 2 is the pseudocode of part of speech feature factor calculating process, and algorithm 3 is the pseudo- generation for optimizing TF-IDF algorithm weights calculating process
Code.In addition, the calculating process of algorithm 3 needs the calculating of algorithm 1 and algorithm 2 to assist.
WordLoc_Factor (calculating of Feature Words position impact factor)
The WordLoc_Factor algorithm description calculating subprocess of Feature Words position impact factor, it is desirable to provide pretreatment
The position of feature term vector later, the length of every group of feature term vector and the vector where Feature Words, and the calculating of the position
Process is realized in algorithm 3.
Algorithm 1:WordLoc_Factor
Input: word, Len, Loc (Feature Words, feature vector length, Feature Words position)
Output: α (position impact factor)
Algorithm 2:WordFea_Factor
Input: word, flag (Feature Words, part of speech mark)
Output: β (the part of speech feature factor)
Algorithm 3:Update_TF-IDF
Input: wordsMat (term weight function matrix)
Output: WeightMat (term weight function matrix after optimization)
In conjunction with above-mentioned hypothesis and definition, the term weight function of optimization TF-IDF algorithm is calculated, it is necessary to assume that text
Document C1、C2、C3Part of speech composed structure, i.e. C1=(n, v, n, a, d ...), C2=(n, v, a, n, d ...), C3=(n, d, v,
A, n ...),Each Feature Words institute is calculated according to the formula defined in 1 as a result,
Corresponding α value only calculates the impact factor α and β of first five Feature Words since cited Feature Words are 5.Such as 3 institute of table
Show, it is shown that the term weight function calculated result of optimization TF-IDF algorithm.
Table 3 optimizes TF-IDF algorithm and calculates term weight function
It by the data hypothesis testing of algorithm, is tentatively found in conjunction with 3 result of table, traditional TF-IDF algorithm calculated result Wtf
There are obvious gaps with optimization TF-IDF algorithm calculated result Weight.Meanwhile it is important with the position of Feature Words and part of speech
Degree, the weight after optimization also adjust therewith.
Approach described above can improve deficiency existing for traditional TF-IDF algorithm to a certain extent, it is even more important that
Traditional TF-IDF algorithm is not directed to factor and integrated by technical method provided by the present invention, and the position of introduced feature word influences
The factor and the part of speech feature factor, the weight after calculating TF-IDF algorithm are adjusted, and more accurately improve Feature Words with this
Separating capacity.In addition, the technical method in this specification is all made of progressive description, the embodiment for the modules being previously mentioned
Between there are close associations, meanwhile, the key technology method being previously mentioned in detail in the claims, have in the present specification in detail
It is thin to introduce.
It should be pointed out that the invention is not limited to technical method that is describe above and being previously mentioned in the accompanying drawings, it is right
In those of ordinary skill, without prejudice to its engineering philosophy, several modifications and supplement can be made, but these are repaired
Change and supplement is accordingly to be regarded as protection scope of the present invention.
Claims (2)
1. a kind of short text feature extracting method blended based on multiple features factor, which comprises the following steps:
1) short text comment is segmented by stammerer participle tool, stop words is gone to handle, it is special to construct preliminary text with this
Levy term vector matrix;Include:
Comment on commodity for user such as extracts, filters at the pretreatment;
Short text comment information is segmented using stammerer participle tool, in conjunction with Stopwords vocabulary to the text after participle
Stop words is carried out to operate;
Assuming that there are n items to comment on sentence, data prediction is carried out to this n item comment sentence, it will be able to obtain preliminary text feature word
Vector matrix, then defined feature term vector matrix is F={ wi1,wi2,…,wik|1≤i≤n,k∈N+};
2) combine traditional TF-IDF algorithm to after building Feature Words vector matrix carry out weight calculation, with this obtain weight to
Moment matrix;Calculate TF value, IDF value and the corresponding weighted value W of Feature Wordstf;
The ratio between the number occurred in document d using Feature Words and the number of files comprising the specific word as the word weight, for
Word importance in a certain specific file is measured are as follows:
F in formulai,jIndicate word tiIn file djThe number of middle appearance, denominator are then in file djIn all words frequency of occurrence
The sum of, the IDF of a certain particular words can be by general act number divided by the number of files comprising the word, then result carried out logarithm
Operation obtains:
In formula | D | indicate the total number of files in corpus, | { j:ti∈dj| it include the number of files of the word, if the word
For language in corpus, will lead to dividend is 0, therefore, generally be will use | { j:ti∈dj|+1, finally, the form of calculation of TF-IDF
It indicates are as follows:
Wtf(ti,dj)=TF (ti,dj)×IDF(ti);
3) introduced feature lexeme sets impact factor α and part of speech feature factor-beta, and carries out part of speech one by one to preliminary text feature word
Mark, calculates the α and β value of each Feature Words, defines the meaning of two factors, and solve to it;Work is segmented using stammerer
Have and part-of-speech tagging is carried out to preliminary text feature word F;
1 is defined, position impact factor α: the Feature Words phase in front-rear position can be found according to the Feature Words part of speech in comment sentence
To the Feature Words in middle position, its ability for distinguishing sentence is higher, meanwhile, the composition of most of sentence is Subject, Predicate and Object structure, and is led
The position of language and object more appears in the head and tail parts of sentence, the separating capacity of the separating capacities of this kind of Feature Words with respect to predicate
Also what can be showed is more preferable, therefore measures to the half of the characteristic length Len of the comment sentence after every pretreatment, with centre
The Feature Words of position are initial position, are extended respectively to both sides, and defaulting middle position l is 1, and form indicates are as follows:
WhereinIndicate the position of j-th of Feature Words of the i-th row, Len indicates the Feature Words length of every text comments sentence;
Define 2 (part of speech feature factor-betas): by the Feature Words part of speech structure of parsing sentence, and the composition for analyzing Chinese sentence is closed
System defines its main part of speech hierarchal order arrangement and is followed successively by noun, verb, adjective, adverbial word and other part of speech vocabulary, by
This, defining its impact factor according to this hierarchal order is β={ 5,4,3,2,1 };
4) obtained Feature Words position impact factor α and part of speech feature factor-beta are calculated with tradition TF-IDF algorithm corresponding
Weighted value WtfIt is multiplied, the weight vectors matrix of optimization TF-IDF algorithm is obtained with this, comprising:
Calculate the position impact factor α and part of speech feature factor-beta of each Feature Words;
Term weight function vector matrix after building optimization TF-IDF algorithm;
In conjunction with the method for defining 1 and defining 2, two kinds of impact factors is introduced into traditional TF-IDF algorithm, utilizes TF-IDF
After algorithm carries out weight calculation, COMPREHENSIVE CALCULATING is carried out in conjunction with the position impact factor α and part of speech feature factor-beta of each Feature Words,
Weight Weight after being optimized with this, procedural representation are as follows:
Weight=α * Wtf*β。
2. a kind of short text feature extracting method blended based on multiple features factor according to claim 1, feature
It is, the TF-IDF algorithm optimization step are as follows:
1) preliminary text feature term vector matrix is constructed using preprocessed text data set;
2) feature term vector is obtained one by one, and solves part of speech feature factor β value using WordFea_Factor function;
3) it calculates Feature Words vector length Len and character pair lexeme sets Loc;
4) the position impact factor α value for defining the equations character pair word in 1 is combined;
5) the weighted value W for solving traditional TF-IDF algorithmtfIt is multiplied with α, β value, finally obtains the power of TF-IDF algorithm after optimization
Weight Weight;
Wherein, WordLoc_Factor indicates Feature Words position impact factor calculating process, is occurred according to Feature Words in this
The specific gravity that position accounts for entire sentence length solves α, and WordFea_Factor indicates part of speech feature factor calculating process, according to feature
The part of speech grade of word carries out assignment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910211517.8A CN109977206A (en) | 2019-03-20 | 2019-03-20 | A kind of short text feature extracting method blended based on multiple features factor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910211517.8A CN109977206A (en) | 2019-03-20 | 2019-03-20 | A kind of short text feature extracting method blended based on multiple features factor |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109977206A true CN109977206A (en) | 2019-07-05 |
Family
ID=67079596
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910211517.8A Pending CN109977206A (en) | 2019-03-20 | 2019-03-20 | A kind of short text feature extracting method blended based on multiple features factor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109977206A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110472240A (en) * | 2019-07-26 | 2019-11-19 | 北京影谱科技股份有限公司 | Text feature and device based on TF-IDF |
CN110516249A (en) * | 2019-08-29 | 2019-11-29 | 新华三信息安全技术有限公司 | A kind of Sentiment orientation information obtaining method and device |
CN111046282A (en) * | 2019-12-06 | 2020-04-21 | 贝壳技术有限公司 | Text label setting method, device, medium and electronic equipment |
CN111046169A (en) * | 2019-12-24 | 2020-04-21 | 东软集团股份有限公司 | Method, device and equipment for extracting subject term and storage medium |
CN111444704A (en) * | 2020-03-27 | 2020-07-24 | 中南大学 | Network security keyword extraction method based on deep neural network |
CN111626040A (en) * | 2020-05-28 | 2020-09-04 | 数网金融有限公司 | Method for determining sentence similarity, related equipment and readable storage medium |
CN112380868A (en) * | 2020-12-10 | 2021-02-19 | 广东泰迪智能科技股份有限公司 | Petition-purpose multi-classification device based on event triples and method thereof |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103744953A (en) * | 2014-01-02 | 2014-04-23 | 中国科学院计算机网络信息中心 | Network hotspot mining method based on Chinese text emotion recognition |
CN105022805A (en) * | 2015-07-02 | 2015-11-04 | 四川大学 | Emotional analysis method based on SO-PMI (Semantic Orientation-Pointwise Mutual Information) commodity evaluation information |
CN108763477A (en) * | 2018-05-29 | 2018-11-06 | 厦门快商通信息技术有限公司 | A kind of short text classification method and system |
CN109284352A (en) * | 2018-09-30 | 2019-01-29 | 哈尔滨工业大学 | A kind of querying method of the assessment class document random length words and phrases based on inverted index |
-
2019
- 2019-03-20 CN CN201910211517.8A patent/CN109977206A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103744953A (en) * | 2014-01-02 | 2014-04-23 | 中国科学院计算机网络信息中心 | Network hotspot mining method based on Chinese text emotion recognition |
CN105022805A (en) * | 2015-07-02 | 2015-11-04 | 四川大学 | Emotional analysis method based on SO-PMI (Semantic Orientation-Pointwise Mutual Information) commodity evaluation information |
CN108763477A (en) * | 2018-05-29 | 2018-11-06 | 厦门快商通信息技术有限公司 | A kind of short text classification method and system |
CN109284352A (en) * | 2018-09-30 | 2019-01-29 | 哈尔滨工业大学 | A kind of querying method of the assessment class document random length words and phrases based on inverted index |
Non-Patent Citations (4)
Title |
---|
Y. YANG: "Research and Realization of Internet Public Opinion Analysis Based on Improved TF - IDF Algorithm", 《2017 16TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS TO BUSINESS, ENGINEERING AND SCIENCE (DCABES)》 * |
Z. DE-YANG 等: "Research on micro-blog emotional tendency based on keyword extra", 《2018 37TH CHINESE CONTROL CONFERENCE (CCC)》 * |
吴维 等: "基于多特征与复合分类法的中文微博情感分析", 《北京信息科技大学学报(自然科学版)》 * |
琚春华 等: "基于多特征融合的跨域情感分类模型研究", 《知识管理论坛》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110472240A (en) * | 2019-07-26 | 2019-11-19 | 北京影谱科技股份有限公司 | Text feature and device based on TF-IDF |
CN110516249A (en) * | 2019-08-29 | 2019-11-29 | 新华三信息安全技术有限公司 | A kind of Sentiment orientation information obtaining method and device |
CN111046282A (en) * | 2019-12-06 | 2020-04-21 | 贝壳技术有限公司 | Text label setting method, device, medium and electronic equipment |
CN111046169A (en) * | 2019-12-24 | 2020-04-21 | 东软集团股份有限公司 | Method, device and equipment for extracting subject term and storage medium |
CN111046169B (en) * | 2019-12-24 | 2024-03-26 | 东软集团股份有限公司 | Method, device, equipment and storage medium for extracting subject term |
CN111444704A (en) * | 2020-03-27 | 2020-07-24 | 中南大学 | Network security keyword extraction method based on deep neural network |
CN111444704B (en) * | 2020-03-27 | 2023-09-19 | 中南大学 | Network safety keyword extraction method based on deep neural network |
CN111626040A (en) * | 2020-05-28 | 2020-09-04 | 数网金融有限公司 | Method for determining sentence similarity, related equipment and readable storage medium |
CN112380868A (en) * | 2020-12-10 | 2021-02-19 | 广东泰迪智能科技股份有限公司 | Petition-purpose multi-classification device based on event triples and method thereof |
CN112380868B (en) * | 2020-12-10 | 2024-02-13 | 广东泰迪智能科技股份有限公司 | Multi-classification device and method for interview destination based on event triplets |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109977206A (en) | A kind of short text feature extracting method blended based on multiple features factor | |
CN111767741B (en) | Text emotion analysis method based on deep learning and TFIDF algorithm | |
Rychalska et al. | Samsung Poland NLP Team at SemEval-2016 Task 1: Necessity for diversity; combining recursive autoencoders, WordNet and ensemble methods to measure semantic similarity. | |
CN106383817A (en) | Paper title generation method capable of utilizing distributed semantic information | |
CN109960786A (en) | Chinese Measurement of word similarity based on convergence strategy | |
CN106503049A (en) | A kind of microblog emotional sorting technique for merging multiple affection resources based on SVM | |
CN102955772B (en) | A kind of similarity calculating method based on semanteme and device | |
CN105843897A (en) | Vertical domain-oriented intelligent question and answer system | |
CN104331394A (en) | Text classification method based on viewpoint | |
CN103226580A (en) | Interactive-text-oriented topic detection method | |
CN102122297A (en) | Semantic-based Chinese network text emotion extracting method | |
CN109002473A (en) | A kind of sentiment analysis method based on term vector and part of speech | |
CN108874896A (en) | A kind of humorous recognition methods based on neural network and humorous feature | |
CN110008465A (en) | The measure of sentence semantics distance | |
CN106446147A (en) | Emotion analysis method based on structuring features | |
Koptyra et al. | Clarin-emo: Training emotion recognition models using human annotation and chatgpt | |
Lee et al. | Who speaks like a style of vitamin: Towards syntax-aware dialogue summarization using multi-task learning | |
CN110324278A (en) | Account main body consistency detecting method, device and equipment | |
Qu et al. | Emotion Classification for Spanish with XLM-RoBERTa and TextCNN. | |
Wang et al. | YNU-HPCC at semeval-2018 task 2: Multi-ensemble Bi-GRU model with attention mechanism for multilingual emoji prediction | |
Rajendran et al. | Is something better than nothing? automatically predicting stance-based arguments using deep learning and small labelled dataset | |
CN111813927A (en) | Sentence similarity calculation method based on topic model and LSTM | |
CN115146031A (en) | Short text position detection method based on deep learning and assistant features | |
Tavan et al. | Identifying Ironic Content Spreaders on Twitter using Psychometrics, Contextual and Ironic Features with Gradient Boosting Classifier. | |
CN108959269B (en) | A kind of sentence auto ordering method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190705 |