CN109977206A - A kind of short text feature extracting method blended based on multiple features factor - Google Patents

A kind of short text feature extracting method blended based on multiple features factor Download PDF

Info

Publication number
CN109977206A
CN109977206A CN201910211517.8A CN201910211517A CN109977206A CN 109977206 A CN109977206 A CN 109977206A CN 201910211517 A CN201910211517 A CN 201910211517A CN 109977206 A CN109977206 A CN 109977206A
Authority
CN
China
Prior art keywords
feature
factor
word
speech
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910211517.8A
Other languages
Chinese (zh)
Inventor
高岭
周俊鹏
马景超
何丹
王文涛
高全力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest University
Original Assignee
Northwest University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest University filed Critical Northwest University
Priority to CN201910211517.8A priority Critical patent/CN109977206A/en
Publication of CN109977206A publication Critical patent/CN109977206A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of short text feature extracting method blended based on multiple features factor is segmented short text comment by stammerer participle tool, stop words is gone to handle, preliminary text feature term vector matrix is constructed with this;Weight calculation is carried out to the Feature Words vector matrix after building in conjunction with traditional TF-IDF algorithm, weight vectors matrix is obtained with this;Introduced feature lexeme sets impact factorWith the part of speech feature factor, and part-of-speech tagging is carried out one by one to preliminary text feature word, calculate each Feature WordsWithValue;By what is obtainedWithIt is worth weighted value corresponding with tradition TF-IDF algorithm to be multiplied, finally obtains the weight vectors matrix of TF-IDF algorithm after optimization.Provided technical solution according to the present invention, is able to solve the term weight function unbalance of traditional TF-IDF algorithm to a certain extent, to improve the accuracy of Text character extraction, provides validity for emotional semantic classification task and helps.

Description

A kind of short text feature extracting method blended based on multiple features factor
Technical field
The present invention relates to Text Mining Technology fields, and in particular to a kind of short text blended based on multiple features factor is special Levy extracting method.
Background technique
With the propulsion in Web3.0 epoch, internet information is dissolved into people's lives more and more.In high volume User delivers the speech viewpoint of oneself on network with regard to some event or commodity, and these comment informations can be very under time effect Influence to big degree the thinking and behavior of people.Meanwhile these comment informations contain people various emotional attitudes and Emotion information, such as, pleasure, anger, sorrow, happiness, compassion or positive, neutral, passive.Based on these comment informations, other users Comment and view of the group of subscribers to certain event or commodity are understood by the network platform, therefore these information are with huge potential Tap value.In addition, major network platform can all generate ten hundreds of texts daily during the high speed development of internet Comment information, and these information are also continuously generating.If only carrying out text mining by manual type, need A large amount of human and material resources is expended, therefore, is just particularly important by technical methods such as text minings.
The key point of text mining is that Text character extraction, and the method for feature extraction has very much, and such as vector is empty Between the methods of model (VSM), word frequency-reverse document-frequency (TF-IDF), mutual information (MI), chi-square statistics (Chi-square), These methods achieve good effect in the research work of feature extraction.Wherein, TF-IDF method has in feature extraction There is generally acknowledged representative value, this method is with the directly proportional increase of number that the importance of words occurs in the text, but simultaneously The frequency occurred in corpus also with it is inversely proportional decline.This method measures the importance of a word with word frequency, causes certain The meaningless high frequency of a little Feature Words occurs, so that relatively high weight is imparted, and some vocabulary for having sort feature are special For sign because low frequency reason is dropped, it is unbalance that this phenomenon is understood to be weight.Meanwhile this method does not consider the anteroposterior position of word Set with itself part of speech feature, and the word of the forward word and appearance position of appearance position rearward all has certain important tradeoff, So that inevitably there are some errors in the process of feature extraction.
Summary of the invention
To achieve the goals above, the object of the present invention is to provide a kind of short text blended based on multiple features factor is special Extracting method is levied, the accuracy of feature extraction is improved.
To achieve the goals above, the technical solution adopted by the present invention is that:
A kind of short text feature extracting method blended based on multiple features factor, comprising the following steps:
1) short text comment is segmented by stammerer participle tool, stop words is gone to handle, preliminary text is constructed with this Eigen term vector matrix;Include:
Comment on commodity for user such as extracts, filters at the pretreatment;
Using stammerer participle tool short text comment information is segmented, in conjunction with Stopwords vocabulary to participle after Text carries out stop words and operates;
Assuming that there are n items to comment on sentence, data prediction is carried out to this n item comment sentence, it will be able to it is special to obtain preliminary text Term vector matrix is levied, then defined feature term vector matrix is F={ wi1,wi2,…,wik|1≤i≤n,k∈N+};
2) it combines traditional TF-IDF algorithm to carry out weight calculation to the Feature Words vector matrix after building, is weighed with this Weight vector matrix;Calculate TF value, IDF value and the corresponding weighted value W of Feature Wordstf
The ratio between the number occurred in document d using Feature Words and the number of files comprising the specific word as the word weight, Word importance in a certain specific file is measured are as follows:
F in formulai,jIndicate word tiIn file djThe number of middle appearance, denominator are then in file djIn all words appearance The sum of number, the IDF of a certain particular words can be by general act number divided by the number of files comprising the word, then result carried out Logarithm operation obtains:
D indicates the total number of files in corpus in formula, | { j:ti∈dj| it include the number of files of the word, if should For word in corpus, will lead to dividend is 0, therefore, generally be will use | { j:ti∈dj|+1, finally, the calculating shape of TF-IDF Formula indicates are as follows:
Wtf(ti,dj)=TF (ti,dj)×IDF(ti);
3) introduced feature lexeme sets impact factor α and part of speech feature factor-beta, and carries out one by one to preliminary text feature word Part-of-speech tagging calculates the α and β value of each Feature Words, defines the meaning of two factors, and solve to it;Use stammerer point Word tool carries out part-of-speech tagging to preliminary text feature word F;
1 is defined, position impact factor α: the feature in front-rear position can be found according to the Feature Words part of speech in comment sentence Its ability for distinguishing sentence of the Feature Words of word relative intermediate position is higher, meanwhile, the composition of most of sentence is Subject, Predicate and Object structure, And the position of subject and object more appears in the head and tail parts of sentence, the differentiation of the separating capacities of this kind of Feature Words with respect to predicate Ability can also show more preferable, therefore measure to the half of the characteristic length Len of the comment sentence after every pretreatment, with The Feature Words in middle position are initial position, are extended respectively to both sides, and defaulting middle position l is 1, and form indicates Are as follows:
WhereinIndicate the position of j-th of Feature Words of the i-th row, Len indicates the Feature Words length of every text comments sentence;
It defines 2 (part of speech feature factor-betas): by the Feature Words part of speech structure of parsing sentence, and analyzing the structure of Chinese sentence At relationship, defines its main part of speech hierarchal order arrangement and is followed successively by noun, verb, adjective, adverbial word and other part of speech vocabulary, Defining its impact factor according to this hierarchal order as a result, is β={ 5,4,3,2,1 };
4) obtained Feature Words position impact factor α and part of speech feature factor-beta and tradition TF-IDF algorithm are calculated Respective weights value WtfIt is multiplied, the weight vectors matrix of optimization TF-IDF algorithm is obtained with this, comprising:
Calculate the position impact factor α and part of speech feature factor-beta of each Feature Words;
Term weight function vector matrix after building optimization TF-IDF algorithm;
In conjunction with the method for defining 1 and defining 2, two kinds of impact factors is introduced into traditional TF-IDF algorithm, utilizes TF- After IDF algorithm carries out weight calculation, comprehensive meter is carried out in conjunction with the position impact factor α and part of speech feature factor-beta of each Feature Words It calculates, the weight Weight after being optimized with this, procedural representation are as follows:
Weight=α * Wtf*β。
The TF-IDF algorithm optimization step are as follows:
1) preliminary text feature term vector matrix is constructed using preprocessed text data set;
2) feature term vector is obtained one by one, and solves part of speech feature factor β value using WordFea_Factor function;
3) it calculates Feature Words vector length Len and character pair lexeme sets Loc;
4) the position impact factor α value for defining the equations character pair word in 1 is combined;
5) the weighted value W for solving traditional TF-IDF algorithmtfIt is multiplied with α, β value, finally obtains TF-IDF algorithm after optimization Weight Weight;
Wherein, WordLoc_Factor indicates Feature Words position impact factor calculating process, is gone out according to Feature Words in the sentence The specific gravity that existing position accounts for entire sentence length solves α, and WordFea_Factor indicates part of speech feature factor calculating process, according to The part of speech grade of Feature Words carries out assignment.
The beneficial effects of the present invention are:
Short text comment information generally has the characteristics that length is brief, theme is clear, emotion is dense, the master that sentence is constituted Body is usually subject-predicate or Subject, Predicate and Object structure, and subject and object are usually noun, and predicate is usually verb, and adjective to Modification noun, adverbial word are used to modify verb.To the emotional expression of sentence, there is also one to a certain extent for the part of speech of these phrases Fixed influence.The part of speech of Chinese language structure has point of notional word and function word, such as, noun, adjective, verb, number, pronoun Belong to notional word with quantifier etc., and adverbial word, preposition, interjection, modal particle, auxiliary word and conjunction etc. belong to function word.Wherein stammerer point Word tool provides a set of complete Rules for Part of Speech Tagging, as shown in Fig. 2, noun, verb, adjective and adverbial word part of speech respectively with N, v, a and d beginning carries out part of speech coding.
The essence of TF-IDF algorithm assigns corresponding weight mainly using word frequency statistics as consideration foundation, and Chinese Comment property sentence composed structure be usually noun or nominal phrase as main body, on this basis, with verb, adjective and Adverbial word expands the rich and complexity of sentence, but the main structure of sentence does not change.Meanwhile Chinese comment sentence accumulates The part of speech of the attributive character contained occurs mostly in the form of noun or nominal phrase, and such as " mobile phone is pretty good, and appearance is seen very well, is System is also smooth, and the speed of service is very good ", wherein " mobile phone, appearance, system, the speed of service " is all attributive character word, these attribute words It is able to reflect the intrinsic feature of goods themselves to a certain extent.And comment on the word or short that emotion tendency is had in sentence Language can embody the emotion stated in text information, usually upper to include positive, neutral, passive three kinds of Sentiment orientations.Such as it comments By sentence " mobile phone is certified products, is used very smooth ", wherein " certified products, very smooth " is exactly emotion word, the shared word of these emotion words Property is all adjective or noun etc..And early stage researchers are mainly adjective to the definition of emotion word, then by emotion word Research expands to noun and verb.By analyzing comment text, it can be found that emotion word is mainly with adjective and minority Based on noun and verb, if some words comprising emotion will be abandoned using a certain part of speech as emotion word, seen containing emotion Point, attitude the covering surface of comment sentence will greatly reduce, thereby reduce the accuracy rate of affective characteristics Result.
Detailed description of the invention
Fig. 1 is the main process figure of the short text feature extraction provided by the present invention blended based on multiple features factor;
Fig. 2 is stammerer participle tool provided by the present invention often with part-of-speech tagging figure;
Fig. 3 is Feature Words front-rear position factor structure provided by the present invention definition figure;
Fig. 4 is Feature Words part of speech structural analysis figure provided by the present invention;
Fig. 5 is optimization TF-IDF algorithm core process figure provided by the present invention.
Specific embodiment
The technical method in the application is described in detail with reference to the accompanying drawing.
The present invention sets impact factor and the part of speech feature factor for deficiency existing for TF-IDF algorithm, introduced feature lexeme, TF-IDF algorithm is improved, a kind of short text feature extracting method blended based on multiple features factor is proposed, to Improve the problems such as TF-IDF algorithm is unbalance to the weight occurred in the weight computations of Feature Words.In order to make it easy to understand, this hair The bright narration specific implementation content in the form of hypothesis.
It is introduced firstly the need of to traditional TF-IDF algorithm, TF-IDF algorithm is that occur in document d with Feature Words Weight as the word of number and the ratio between the number of files comprising the specific word.For the word importance in a certain specific file It measures are as follows:
fi,jIndicate word tiIn file djThe number of middle appearance, denominator are then in file djIn all words frequency of occurrence The sum of.
Reverse document-frequency IDF is the measurement of a word general importance, indicates that the document comprising entry is fewer, IDF It is bigger, then illustrate that entry has preferable class discrimination ability.The IDF of a certain particular words can by general act number divided by comprising The number of files of the word, then result progress logarithm operation is obtained:
D indicates the total number of files in corpus, | { j:ti∈dj| it include the number of files of the word, if the word In corpus, will lead to dividend is 0, therefore, generally be will use | { j:ti∈dj}|+1.Finally, the form of calculation table of TF-IDF It is shown as:
Wtf(ti,dj)=TF (ti,dj)×IDF(ti)
Assuming that text document includes C1、C2And C3, and Feature Words respectively indicate w1、w2、w3、w4And w5, as shown in table 1, indicate The frequency that Feature Words occur in different document.Wherein, it is specified that text document sum is 50, C1、C2、C3Corresponding Feature Words Sum is 30,25,40, w respectively1、w2、w3、w4、w5It include sum in text document is 30,13,18,25,40 clearly.It is first First, TF, IDF and W are calculated separately using traditional TF-IDF algorithmtf, the results are shown in Table 2.
1 Feature Words of table are in different document frequency of occurrence
2 TF-IDF algorithm of table calculates term weight function
Assuming that there are n items to comment on sentence, data prediction is carried out to this n item comment sentence, it will be able to it is special to obtain preliminary text Term vector matrix is levied, then defined feature term vector matrix is F={ wi1,wi2,…,wik|1≤i≤n,k∈N+}.Meanwhile it examining Consider the influence factor of Feature Words front-rear position and part of speech feature, and position impact factor is defined as α, the part of speech feature factor is fixed Justice is β.
It defines 1 (position impact factor α): the feature in front-rear position can be found according to the Feature Words part of speech in comment sentence Its ability for distinguishing sentence of the Feature Words of word relative intermediate position is higher.Meanwhile the composition of most of sentence is Subject, Predicate and Object structure, And the position of subject and object more appears in the head and tail parts of sentence, the differentiation of the separating capacities of this kind of Feature Words with respect to predicate Ability can also show more preferable.Therefore the half of the characteristic length Len of the comment sentence after every pretreatment is measured, with The Feature Words in middle position are initial position, are extended respectively to both sides, and defaulting middle position l is 1, as shown in figure 3, its Form indicates are as follows:
WhereinIndicate the position of j-th of Feature Words of the i-th row, Len indicates the Feature Words length of every text comments sentence.
It defines 2 (part of speech feature factor-betas): by the Feature Words part of speech structure of parsing sentence, and analyzing the structure of Chinese sentence At relationship, as shown in figure 4, define its main part of speech hierarchal order arrangement be followed successively by noun (n), verb (v), adjective (a), Adverbial word (d) and other part of speech vocabulary.Defining its impact factor according to this hierarchal order as a result, is β={ 5,4,3,2,1 }.
By above-mentioned analysis and definition, two kinds of impact factors are introduced into traditional TF-IDF algorithm.Utilize TF- After IDF algorithm carries out weight calculation, comprehensive meter is carried out in conjunction with the position impact factor α and part of speech feature factor-beta of each Feature Words It calculates, the weight after being optimized with this, procedural representation are as follows:
Weight=α * Wtf
It is described based on above method, the application realizes optimization TF-IDF algorithm, proposes a kind of optimization TF- The core process of IDF algorithm, as shown in figure 5, the key step of its process are as follows:
Preliminary text feature term vector matrix is constructed using preprocessed text data set;
Feature term vector is obtained one by one, and solves part of speech feature factor β value using WordFea_Factor function;
It calculates Feature Words vector length Len and character pair lexeme sets Loc;
In conjunction with the position impact factor α value for defining the equations character pair word in 1;
The weighted value W that traditional TF-IDF algorithm is solvedtfIt is multiplied with α, β value, finally obtains TF-IDF algorithm after optimizing Weight Weight.
Wherein, WordLoc_Factor indicates Feature Words position impact factor calculating process, is gone out according to Feature Words in the sentence The specific gravity that existing position accounts for entire sentence length solves α.WordFea_Factor indicates part of speech feature factor calculating process, according to The part of speech grade of Feature Words carries out assignment.
This algorithm realizes that wherein algorithm 1 is characterized the pseudocode that lexeme sets impact factor calculating process with Python, Algorithm 2 is the pseudocode of part of speech feature factor calculating process, and algorithm 3 is the pseudo- generation for optimizing TF-IDF algorithm weights calculating process Code.In addition, the calculating process of algorithm 3 needs the calculating of algorithm 1 and algorithm 2 to assist.
WordLoc_Factor (calculating of Feature Words position impact factor)
The WordLoc_Factor algorithm description calculating subprocess of Feature Words position impact factor, it is desirable to provide pretreatment The position of feature term vector later, the length of every group of feature term vector and the vector where Feature Words, and the calculating of the position Process is realized in algorithm 3.
Algorithm 1:WordLoc_Factor
Input: word, Len, Loc (Feature Words, feature vector length, Feature Words position)
Output: α (position impact factor)
Algorithm 2:WordFea_Factor
Input: word, flag (Feature Words, part of speech mark)
Output: β (the part of speech feature factor)
Algorithm 3:Update_TF-IDF
Input: wordsMat (term weight function matrix)
Output: WeightMat (term weight function matrix after optimization)
In conjunction with above-mentioned hypothesis and definition, the term weight function of optimization TF-IDF algorithm is calculated, it is necessary to assume that text Document C1、C2、C3Part of speech composed structure, i.e. C1=(n, v, n, a, d ...), C2=(n, v, a, n, d ...), C3=(n, d, v, A, n ...),Each Feature Words institute is calculated according to the formula defined in 1 as a result, Corresponding α value only calculates the impact factor α and β of first five Feature Words since cited Feature Words are 5.Such as 3 institute of table Show, it is shown that the term weight function calculated result of optimization TF-IDF algorithm.
Table 3 optimizes TF-IDF algorithm and calculates term weight function
It by the data hypothesis testing of algorithm, is tentatively found in conjunction with 3 result of table, traditional TF-IDF algorithm calculated result Wtf There are obvious gaps with optimization TF-IDF algorithm calculated result Weight.Meanwhile it is important with the position of Feature Words and part of speech Degree, the weight after optimization also adjust therewith.
Approach described above can improve deficiency existing for traditional TF-IDF algorithm to a certain extent, it is even more important that Traditional TF-IDF algorithm is not directed to factor and integrated by technical method provided by the present invention, and the position of introduced feature word influences The factor and the part of speech feature factor, the weight after calculating TF-IDF algorithm are adjusted, and more accurately improve Feature Words with this Separating capacity.In addition, the technical method in this specification is all made of progressive description, the embodiment for the modules being previously mentioned Between there are close associations, meanwhile, the key technology method being previously mentioned in detail in the claims, have in the present specification in detail It is thin to introduce.
It should be pointed out that the invention is not limited to technical method that is describe above and being previously mentioned in the accompanying drawings, it is right In those of ordinary skill, without prejudice to its engineering philosophy, several modifications and supplement can be made, but these are repaired Change and supplement is accordingly to be regarded as protection scope of the present invention.

Claims (2)

1. a kind of short text feature extracting method blended based on multiple features factor, which comprises the following steps:
1) short text comment is segmented by stammerer participle tool, stop words is gone to handle, it is special to construct preliminary text with this Levy term vector matrix;Include:
Comment on commodity for user such as extracts, filters at the pretreatment;
Short text comment information is segmented using stammerer participle tool, in conjunction with Stopwords vocabulary to the text after participle Stop words is carried out to operate;
Assuming that there are n items to comment on sentence, data prediction is carried out to this n item comment sentence, it will be able to obtain preliminary text feature word Vector matrix, then defined feature term vector matrix is F={ wi1,wi2,…,wik|1≤i≤n,k∈N+};
2) combine traditional TF-IDF algorithm to after building Feature Words vector matrix carry out weight calculation, with this obtain weight to Moment matrix;Calculate TF value, IDF value and the corresponding weighted value W of Feature Wordstf
The ratio between the number occurred in document d using Feature Words and the number of files comprising the specific word as the word weight, for Word importance in a certain specific file is measured are as follows:
F in formulai,jIndicate word tiIn file djThe number of middle appearance, denominator are then in file djIn all words frequency of occurrence The sum of, the IDF of a certain particular words can be by general act number divided by the number of files comprising the word, then result carried out logarithm Operation obtains:
In formula | D | indicate the total number of files in corpus, | { j:ti∈dj| it include the number of files of the word, if the word For language in corpus, will lead to dividend is 0, therefore, generally be will use | { j:ti∈dj|+1, finally, the form of calculation of TF-IDF It indicates are as follows:
Wtf(ti,dj)=TF (ti,dj)×IDF(ti);
3) introduced feature lexeme sets impact factor α and part of speech feature factor-beta, and carries out part of speech one by one to preliminary text feature word Mark, calculates the α and β value of each Feature Words, defines the meaning of two factors, and solve to it;Work is segmented using stammerer Have and part-of-speech tagging is carried out to preliminary text feature word F;
1 is defined, position impact factor α: the Feature Words phase in front-rear position can be found according to the Feature Words part of speech in comment sentence To the Feature Words in middle position, its ability for distinguishing sentence is higher, meanwhile, the composition of most of sentence is Subject, Predicate and Object structure, and is led The position of language and object more appears in the head and tail parts of sentence, the separating capacity of the separating capacities of this kind of Feature Words with respect to predicate Also what can be showed is more preferable, therefore measures to the half of the characteristic length Len of the comment sentence after every pretreatment, with centre The Feature Words of position are initial position, are extended respectively to both sides, and defaulting middle position l is 1, and form indicates are as follows:
WhereinIndicate the position of j-th of Feature Words of the i-th row, Len indicates the Feature Words length of every text comments sentence;
Define 2 (part of speech feature factor-betas): by the Feature Words part of speech structure of parsing sentence, and the composition for analyzing Chinese sentence is closed System defines its main part of speech hierarchal order arrangement and is followed successively by noun, verb, adjective, adverbial word and other part of speech vocabulary, by This, defining its impact factor according to this hierarchal order is β={ 5,4,3,2,1 };
4) obtained Feature Words position impact factor α and part of speech feature factor-beta are calculated with tradition TF-IDF algorithm corresponding Weighted value WtfIt is multiplied, the weight vectors matrix of optimization TF-IDF algorithm is obtained with this, comprising:
Calculate the position impact factor α and part of speech feature factor-beta of each Feature Words;
Term weight function vector matrix after building optimization TF-IDF algorithm;
In conjunction with the method for defining 1 and defining 2, two kinds of impact factors is introduced into traditional TF-IDF algorithm, utilizes TF-IDF After algorithm carries out weight calculation, COMPREHENSIVE CALCULATING is carried out in conjunction with the position impact factor α and part of speech feature factor-beta of each Feature Words, Weight Weight after being optimized with this, procedural representation are as follows:
Weight=α * Wtf*β。
2. a kind of short text feature extracting method blended based on multiple features factor according to claim 1, feature It is, the TF-IDF algorithm optimization step are as follows:
1) preliminary text feature term vector matrix is constructed using preprocessed text data set;
2) feature term vector is obtained one by one, and solves part of speech feature factor β value using WordFea_Factor function;
3) it calculates Feature Words vector length Len and character pair lexeme sets Loc;
4) the position impact factor α value for defining the equations character pair word in 1 is combined;
5) the weighted value W for solving traditional TF-IDF algorithmtfIt is multiplied with α, β value, finally obtains the power of TF-IDF algorithm after optimization Weight Weight;
Wherein, WordLoc_Factor indicates Feature Words position impact factor calculating process, is occurred according to Feature Words in this The specific gravity that position accounts for entire sentence length solves α, and WordFea_Factor indicates part of speech feature factor calculating process, according to feature The part of speech grade of word carries out assignment.
CN201910211517.8A 2019-03-20 2019-03-20 A kind of short text feature extracting method blended based on multiple features factor Pending CN109977206A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910211517.8A CN109977206A (en) 2019-03-20 2019-03-20 A kind of short text feature extracting method blended based on multiple features factor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910211517.8A CN109977206A (en) 2019-03-20 2019-03-20 A kind of short text feature extracting method blended based on multiple features factor

Publications (1)

Publication Number Publication Date
CN109977206A true CN109977206A (en) 2019-07-05

Family

ID=67079596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910211517.8A Pending CN109977206A (en) 2019-03-20 2019-03-20 A kind of short text feature extracting method blended based on multiple features factor

Country Status (1)

Country Link
CN (1) CN109977206A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472240A (en) * 2019-07-26 2019-11-19 北京影谱科技股份有限公司 Text feature and device based on TF-IDF
CN110516249A (en) * 2019-08-29 2019-11-29 新华三信息安全技术有限公司 A kind of Sentiment orientation information obtaining method and device
CN111046282A (en) * 2019-12-06 2020-04-21 贝壳技术有限公司 Text label setting method, device, medium and electronic equipment
CN111046169A (en) * 2019-12-24 2020-04-21 东软集团股份有限公司 Method, device and equipment for extracting subject term and storage medium
CN111444704A (en) * 2020-03-27 2020-07-24 中南大学 Network security keyword extraction method based on deep neural network
CN111626040A (en) * 2020-05-28 2020-09-04 数网金融有限公司 Method for determining sentence similarity, related equipment and readable storage medium
CN112380868A (en) * 2020-12-10 2021-02-19 广东泰迪智能科技股份有限公司 Petition-purpose multi-classification device based on event triples and method thereof

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744953A (en) * 2014-01-02 2014-04-23 中国科学院计算机网络信息中心 Network hotspot mining method based on Chinese text emotion recognition
CN105022805A (en) * 2015-07-02 2015-11-04 四川大学 Emotional analysis method based on SO-PMI (Semantic Orientation-Pointwise Mutual Information) commodity evaluation information
CN108763477A (en) * 2018-05-29 2018-11-06 厦门快商通信息技术有限公司 A kind of short text classification method and system
CN109284352A (en) * 2018-09-30 2019-01-29 哈尔滨工业大学 A kind of querying method of the assessment class document random length words and phrases based on inverted index

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744953A (en) * 2014-01-02 2014-04-23 中国科学院计算机网络信息中心 Network hotspot mining method based on Chinese text emotion recognition
CN105022805A (en) * 2015-07-02 2015-11-04 四川大学 Emotional analysis method based on SO-PMI (Semantic Orientation-Pointwise Mutual Information) commodity evaluation information
CN108763477A (en) * 2018-05-29 2018-11-06 厦门快商通信息技术有限公司 A kind of short text classification method and system
CN109284352A (en) * 2018-09-30 2019-01-29 哈尔滨工业大学 A kind of querying method of the assessment class document random length words and phrases based on inverted index

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Y. YANG: "Research and Realization of Internet Public Opinion Analysis Based on Improved TF - IDF Algorithm", 《2017 16TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS TO BUSINESS, ENGINEERING AND SCIENCE (DCABES)》 *
Z. DE-YANG 等: "Research on micro-blog emotional tendency based on keyword extra", 《2018 37TH CHINESE CONTROL CONFERENCE (CCC)》 *
吴维 等: "基于多特征与复合分类法的中文微博情感分析", 《北京信息科技大学学报(自然科学版)》 *
琚春华 等: "基于多特征融合的跨域情感分类模型研究", 《知识管理论坛》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472240A (en) * 2019-07-26 2019-11-19 北京影谱科技股份有限公司 Text feature and device based on TF-IDF
CN110516249A (en) * 2019-08-29 2019-11-29 新华三信息安全技术有限公司 A kind of Sentiment orientation information obtaining method and device
CN111046282A (en) * 2019-12-06 2020-04-21 贝壳技术有限公司 Text label setting method, device, medium and electronic equipment
CN111046169A (en) * 2019-12-24 2020-04-21 东软集团股份有限公司 Method, device and equipment for extracting subject term and storage medium
CN111046169B (en) * 2019-12-24 2024-03-26 东软集团股份有限公司 Method, device, equipment and storage medium for extracting subject term
CN111444704A (en) * 2020-03-27 2020-07-24 中南大学 Network security keyword extraction method based on deep neural network
CN111444704B (en) * 2020-03-27 2023-09-19 中南大学 Network safety keyword extraction method based on deep neural network
CN111626040A (en) * 2020-05-28 2020-09-04 数网金融有限公司 Method for determining sentence similarity, related equipment and readable storage medium
CN112380868A (en) * 2020-12-10 2021-02-19 广东泰迪智能科技股份有限公司 Petition-purpose multi-classification device based on event triples and method thereof
CN112380868B (en) * 2020-12-10 2024-02-13 广东泰迪智能科技股份有限公司 Multi-classification device and method for interview destination based on event triplets

Similar Documents

Publication Publication Date Title
CN109977206A (en) A kind of short text feature extracting method blended based on multiple features factor
CN111767741B (en) Text emotion analysis method based on deep learning and TFIDF algorithm
Rychalska et al. Samsung Poland NLP Team at SemEval-2016 Task 1: Necessity for diversity; combining recursive autoencoders, WordNet and ensemble methods to measure semantic similarity.
CN106383817A (en) Paper title generation method capable of utilizing distributed semantic information
CN109960786A (en) Chinese Measurement of word similarity based on convergence strategy
CN106503049A (en) A kind of microblog emotional sorting technique for merging multiple affection resources based on SVM
CN102955772B (en) A kind of similarity calculating method based on semanteme and device
CN105843897A (en) Vertical domain-oriented intelligent question and answer system
CN104331394A (en) Text classification method based on viewpoint
CN103226580A (en) Interactive-text-oriented topic detection method
CN102122297A (en) Semantic-based Chinese network text emotion extracting method
CN109002473A (en) A kind of sentiment analysis method based on term vector and part of speech
CN108874896A (en) A kind of humorous recognition methods based on neural network and humorous feature
CN110008465A (en) The measure of sentence semantics distance
CN106446147A (en) Emotion analysis method based on structuring features
Koptyra et al. Clarin-emo: Training emotion recognition models using human annotation and chatgpt
Lee et al. Who speaks like a style of vitamin: Towards syntax-aware dialogue summarization using multi-task learning
CN110324278A (en) Account main body consistency detecting method, device and equipment
Qu et al. Emotion Classification for Spanish with XLM-RoBERTa and TextCNN.
Wang et al. YNU-HPCC at semeval-2018 task 2: Multi-ensemble Bi-GRU model with attention mechanism for multilingual emoji prediction
Rajendran et al. Is something better than nothing? automatically predicting stance-based arguments using deep learning and small labelled dataset
CN111813927A (en) Sentence similarity calculation method based on topic model and LSTM
CN115146031A (en) Short text position detection method based on deep learning and assistant features
Tavan et al. Identifying Ironic Content Spreaders on Twitter using Psychometrics, Contextual and Ironic Features with Gradient Boosting Classifier.
CN108959269B (en) A kind of sentence auto ordering method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190705