CN110348497A - A kind of document representation method based on the building of WT-GloVe term vector - Google Patents

A kind of document representation method based on the building of WT-GloVe term vector Download PDF

Info

Publication number
CN110348497A
CN110348497A CN201910573695.5A CN201910573695A CN110348497A CN 110348497 A CN110348497 A CN 110348497A CN 201910573695 A CN201910573695 A CN 201910573695A CN 110348497 A CN110348497 A CN 110348497A
Authority
CN
China
Prior art keywords
word
occurs
occur
case
glove
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910573695.5A
Other languages
Chinese (zh)
Other versions
CN110348497B (en
Inventor
姚全珠
古倩
费蓉
赵佳瑜
李莎莎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN201910573695.5A priority Critical patent/CN110348497B/en
Publication of CN110348497A publication Critical patent/CN110348497A/en
Application granted granted Critical
Publication of CN110348497B publication Critical patent/CN110348497B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Abstract

The invention discloses a kind of document representation methods based on the building of WT-GloVe term vector, its significance level is assessed by the word distance computation to network text unique characteristics first, itself contribution degree to classification is differentiated according to the distribution between class of feature, the two is combined as to the characteristic weighing model of word spacing and distribution between class, referred to as WDID-TFIDF;Then unrelated word is filtered according to the self shortcoming of GloVe model, to improve term vector training quality;The characteristic weighing value of equivalent spacing and distribution between class is finally selected according to result and carries out dot product, obtains weighted words vector model, as finally obtained document representation method.The present invention solves the problems, such as that traditional document representation method existing in the prior art calculates complicated or text information and indicates not comprehensive enough.

Description

A kind of document representation method based on the building of WT-GloVe term vector
Technical field
The invention belongs to natural language processing, data mining and Text Classification fields, and in particular to one kind is based on WT- The document representation method of GloVe term vector building.
Background technique
The internet industry of high speed development promotes a large amount of appearance of the industries such as social networks, mobile Internet, global range The Websites quantity of interior sustainable growth leads to the generation of explosive information content.The garbage information filtering of Email, question answering system Asked questions sort out, the identification of query information in search engine, the just negative Judgment by emotion of commodity of shopping website, in government system Mass viewpoint analysis, the new topic discovery of social media and network public-opinion monitoring etc. require at ultra-large text data set The continuous renewal of reason technology.At the same time, higher standard also is proposed to the storage of computer and processing capacity.A large amount of texts How extraordinary potential knowledge in data efficient process and organizes massive information, and helping user efficiently to find required content is to work as Preceding a major challenge.The key technology that text classification is handled as information data, it has also become the research hotspot of academia, in multiple necks It is used widely in domain.How accurately to indicate text information and construct suitable disaggregated model to have become two big cores for classification task Heart problem.
Traditional text representation is normally based on vector space model also or TF-IDF model, they are a large amount of by study Text feature provides the simple expression to text, which is that low-frequency word assigns relatively high weight, assigns for higher-frequency word Relatively low weight is given.According to information theory, this mode has weighed the information that each word is conveyed in vocabulary, weighting side Method includes the logarithm re-scaling of each word frequency rate in document.Finally, the logarithm has linearized the finger of part of speech type by corpus Number distribution.
With the expansion of data scale, text feature dimension is caused to can reach into ten thousand dimensions even higher.As classical text The vector space model of one of representation method, obtained text vector higher-dimension are simultaneously sparse.And the semantic information tool of its character representation There is atomicity, the semantic relation measurement between feature can not be carried out.In the textual form indicated based on vector space model, dimension Represent feature quantity.
Text classification already becomes the research hotspot being concerned and field from proposition so far, and many scholars are in text table Show, the different aspects such as Spatial Dimension and classifier expand deep discussion.To sum up the improvement about classification method is substantially Start with from both direction: first, the improvement based on traditional text sorting technique;Second, being based on neural network file classification method Improvement.
Summary of the invention
The object of the present invention is to provide a kind of document representation methods based on the building of WT-GloVe term vector, solve existing Traditional document representation method, which calculates complicated or text information, present in technology indicates not comprehensive enough problem.
The technical scheme adopted by the invention is that a kind of document representation method based on the building of WT-GloVe term vector, tool Body follows the steps below to implement:
Step 1 assesses its significance level by the word distance computation to network text unique characteristics, according between the class of feature Distribution differentiates itself contribution degree to classification, the two is combined as to the characteristic weighing model of word spacing and distribution between class, referred to as WDID-TFIDF;
Step 2 is filtered unrelated word according to the self shortcoming of GloVe model, to improve term vector training quality;
Step 3 selects in step 1 equivalent spacing and the characteristic weighing value of distribution between class simultaneously according to step 2 acquired results Dot product is carried out, obtains weighted words vector model, as finally obtained document representation method.
The features of the present invention also characterized in that
Step 1 is specifically implemented according to the following steps:
Data set 20NewsGroups is loaded, module needed for importing provides GloVe model, and setting training data stores road Diameter, coded format;Defined function introduces the general deactivated vocabulary of English, segments to loaded data set, the text that will acquire Content reads in file by row, carries out Text Pretreatment using spacy module tool, completes part of speech label and is convenient for follow-up mistake Filter, statistical model calculate term weight function and generator matrix using WDID-TFIDF.
Step 1 is specifically implemented according to the following steps:
Data-oriented collection 20NewsGroups, firstly, the data carried out including going stop words, the analysis of stem morphology are pre- Treatment process, acquired results include station location marker and word frequency statistics data;Secondly, by the data set D={ d after participle1, d2,...,dn, generic set C={ c1,c2,...,ci,...,ck, wherein di={ x1,x2,...,xj,...,xm, xj ∈ciIt is read in by row, Feature Words xjTo classification ciClass between discrimination ID indicate are as follows:
Take Feature Words xjMaximum value in of all categories, as its contribution degree to the category;
Wherein, W (xj|ci) value indicate Feature Words xjFor classification ciSeparating capacity, Feature Words xjIn ciMiddle appearance Frequency is TF (xj|ci), it is not belonging to classification ciIt but include xjTextual data be
To gained WD(xj| c) it is normalized:
Feature Words x is obtained at this timej∈ciClass between distinguish angle value, Feature Words xjWord spacing WD calculate are as follows:
Distance=Lj-Fj
Wherein, LjIt is characterized word xjThe serial number that last time occurs in the text, FjIt is characterized word xjThe sequence occurred for the first time Number, count is that the text after word segmentation processing segments sum;
For any one Feature Words xj∈ci, the WDID-TFIDF term weight function of contribution degree between word-based spacing and class Value W (xj) calculate are as follows:
Total data is read according to input content, calculates each term weight function, generates diThe weight matrix of text representation
Step 2 is specifically implemented according to the following steps:
The term vector for choosing corresponding content in text set, including term vector dimension, word window size, minimum statistics Word frequency;For each word w in dictionaryi, calculate and other words w in textmCos θ value, when cos θ is less than 0 be added set S (m);The top n word in set S (m) is selected, the word k in the context of given window size and the target word of selection are calculated wi、wmBetween the probability ratio λ occurred jointly;The unrelated word or noise word in generator matrix are filtered according to co-occurrence probabilities ratio, New co-occurrence matrix M is obtained, and inputs in GloVe and obtains new term vector.
Step 2 is specifically implemented according to the following steps:
In GloVe loss function, each element X in co-occurrence matrix XijIndicate word j in the context window of target word i The number of interior appearance in mouthful,It is matrix X in the sum of the i-th row, i.e., all clictions up and down go out in target word i window Existing total degree, Pij=P (j | i)=Xij/XiIndicate that word j appears in the probability around word i, if w1For ice, w2For steam:
Work as w1, w2With in corresponding context occur jointly word be gas when, ice occur in the case where gas occur it is general Rate value is 6.6 × 10-5, the parameter probability valuing that gas occurs in the case where steam occurs is 7.8 × 10-4, occur in ice In the case of gas occur probability with steam occur in the case where gas occur probability ratio be 8.5 × 10-2
Work as w1, w2With in corresponding context occur jointly word be solid when, ice occur in the case where solid occur Parameter probability valuing be 1.9 × 10-4, the parameter probability valuing that solid occurs in the case where steam occurs is 2.2 × 10-5, in ice The probability that solid occurs in the case where appearance is 8.9 with the probability ratio that solid occurs in the case where steam occurs;
Work as w1, w2With in corresponding context occur jointly word be fashion when, ice occur in the case where fashion The parameter probability valuing of appearance is 1.7 × 10-5, the parameter probability valuing that fashion occurs in the case where steam occurs is 1.8 × 10-5, The probability that fashion occurs in the case where ice occurs and the probability ratio that fashion occurs in the case where steam occurs It is 0.96;
Work as w1, w2With in corresponding context occur jointly word be water when, ice occur in the case where water occur Parameter probability valuing be 3.0 × 10-3, the parameter probability valuing that water occurs in the case where steam occurs is 2.2 × 10-3, in ice The probability that water occurs in the case where appearance is 1.36 with the probability ratio that water occurs in the case where steam occurs;
Work as w1, w2It is unrelated, and when context is different value of K, i.e. k=gas, solid, fashion, water work as k=gas When, gas is obviously uncorrelated to ice, and more relevant with steam, the probability that gas occurs in the case where ice occurs goes out in steam Gas occurs in the case where existing probability ratio P (k | ice)/P (k | steam) it is much smaller than 1;As k=solid, solid and ice Correlation, and solid and steam are uncorrelated, the probability that solid occurs in the case where ice occurs with the case where steam occurs The probability ratio that lower solid occurs is much larger than 1;Introduce word k and w1Or w2One of them semantic similarity gets over hour co-occurrence probabilities Ratio is remoter at a distance from 1;But as k=fashion, fashion and w1, w2It is all unrelated, and P (k | ice)/P (k | steam) it connects It is bordering on 1;As k=water, water and ice and steam are related, and the ratio of the two is also close to 1, i.e., semantic more similar Contribution probability ratio between word in context is closer to 1;
Work as wiAnd wmWhen semantic dissimilar and when the included word k of known contexts, acquisition word k is from co-occurrence probabilities ratio No is irrelevant information, for word wiAnd wmWhen including word k in the context of given window size, co-occurrence probabilities ratio are as follows:
If word wiAnd wmDissmilarity then has when giving cliction k up and down:
(1) as co-occurrence probabilities ratio λ ≈ 1, k is unrelated word at this time;
(2) as co-occurrence probabilities ratio λ > > 1 or λ < < 1, k and word w at this timeiOr wmOne of them is semantic similar;
Therefore, selection and word wiDissimilar wmUnrelated word is filtered, the calculation about term vector similarity is as follows:
Wherein,For word wi, wmCorresponding term vector;
I.e. if the cosine value of two words is smaller, the context of two words is more dissimilar, and semanteme difference is remoter, So the general formulae of dissimilar word is selected with cosine value, from all and wiSet S (m) of the COS distance less than 0 In the random N number of dissmilarity of selection word wmFilter wiUnrelated word in context reduces the number of nonzero element in co-occurrence matrix Unrelated word in amount filtering co-occurrence matrix, obtains new co-occurrence matrix M and is inputted in GloVe.
Step 3 is specifically implemented according to the following steps:
It is obtained according to step 2, in di={ x1,x2,...,xmFeature Words x in textjTerm vector be xj=(v1, j, v2, j..., vT, j), vi,jIt is Feature Words xjValue of the term vector in ith feature dimension, in conjunction with being computed gained in step 1 Each Feature Words WDID-TFIDF value, the text representation based on WT-GloVe term vector weighted model are as follows:
xj'=(v1,j,v2,j,...,vt,j)·W(xj)
Wherein, t is characterized word xjTerm vector dimension, finally obtain 20NewsGroups data set based on WT-GloVe The text representation of term vector building.
The invention has the advantages that a kind of document representation method based on the building of WT-GloVe term vector, by spy The word distance computation of itself is levied to assess its significance level, its contribution to classification is differentiated according to the distribution between class of Feature Words The two, is combined as the characteristic weighing model of word spacing and distribution between class by degree.When only use TF-IDF algorithm to term vector into Classifying quality can be improved when row weighting, but have ignored distribution situation of the Feature Words in each class.However, Feature Words are each The distribution situation of classification can reflect that Feature Words distinguish the ability and contribution degree of classification.The present invention is according to Feature Words in each classification Distribution situation is calculated as angle value is distinguished between the class of Feature Words;At the same time, word spacing is also added as one kind of characteristic item Power scheme.One word or phrase first appear the distance between end in the text, i.e. referred to as word spacing.It is spaced bigger, explanation The range that the word is mentioned in the text is wider, also more important to the theme of text.Therefore, the calculated value of word spacing can also generation Significance level of the table Feature Words to place text;Using word-based spacing and the characteristic weighing scheme of distribution between class to corpus into Row term weighting merges the unrelated filtered GloVe model of word and generates term vector expression jointly, obtains the important of wherein Feature Words Degree and semantic information constitute new term vector weighted model, improve final classification effect.
Detailed description of the invention
Fig. 1 is that the present invention is based on the flow charts of the document representation method research process of WT-GloVe term vector building;
Fig. 2 is the term vector weighted model experiment main flow chart based on WT-GloVe;
Fig. 3 is TF-IDF, Word2Vec, GloVe, Word2Vec_TFIDF and WT-GloVe model 5 under 9100 samples Accuracy rate between kind method compares;
Fig. 4 is TF-IDF, Word2Vec, GloVe, Word2Vec_TFIDF and WT-GloVe model 5 under 12283 samples Accuracy rate between kind method compares.
Specific embodiment
The following describes the present invention in detail with reference to the accompanying drawings and specific embodiments.
A kind of document representation method based on the building of WT-GloVe term vector of the present invention, flow chart is as shown in Figure 1, specifically press Implement according to following steps:
Step 1 assesses its significance level by the word distance computation to network text unique characteristics, according between the class of feature Distribution differentiates itself contribution degree to classification, the two is combined as to the characteristic weighing model of word spacing and distribution between class, referred to as WDID-TFIDF, step 1 are specifically implemented according to the following steps:
Data set 20NewsGroups is loaded, module needed for importing provides GloVe model, and setting training data stores road Diameter, coded format;Defined function introduces the general deactivated vocabulary of English, segments to loaded data set, the text that will acquire Content reads in file by row, carries out Text Pretreatment using spacy module tool, completes part of speech label and is convenient for follow-up mistake Filter, statistical model calculate term weight function and generator matrix using WDID-TFIDF, are specifically implemented according to the following steps:
Data-oriented collection 20NewsGroups, firstly, the data carried out including going stop words, the analysis of stem morphology are pre- Treatment process, acquired results include station location marker and word frequency statistics data;Secondly, by the data set D={ d after participle1, d2,...,dn, generic set C={ c1,c2,...,ci,...,ck, wherein di={ x1,x2,...,xj,...,xm, xj ∈ciIt is read in by row, Feature Words xjTo classification ciClass between discrimination ID indicate are as follows:
Take Feature Words xjMaximum value in of all categories, as its contribution degree to the category;
Wherein, W (xj|ci) value indicate Feature Words xjFor classification ciSeparating capacity, Feature Words xjIn ciMiddle appearance Frequency is TF (xj|ci), it is not belonging to classification ciIt but include xjTextual data be
To gained WD(xj| c) it is normalized:
Feature Words x is obtained at this timej∈ciClass between distinguish angle value, Feature Words xjWord spacing WD calculate are as follows:
Distance=Lj-Fj
Wherein, LjIt is characterized word xjThe serial number that last time occurs in the text, FjIt is characterized word xjThe sequence occurred for the first time Number, count is that the text after word segmentation processing segments sum;
For any one Feature Words xj∈ci, the WDID-TFIDF term weight function of contribution degree between word-based spacing and class Value W (xj) calculate are as follows:
Total data is read according to input content, calculates each term weight function, generates diThe weight matrix of text representation
Step 2 is filtered unrelated word according to the self shortcoming of GloVe model, to improve term vector training quality, such as schemes Shown in 2, it is specifically implemented according to the following steps:
The term vector for choosing corresponding content in text set, including term vector dimension, word window size, minimum statistics Word frequency;For each word w in dictionaryi, calculate and other words w in textmCos θ value, when cos θ is less than 0 be added set S (m);The top n word in set S (m) is selected, the word k in the context of given window size and the target word of selection are calculated wi、wmBetween the probability ratio λ occurred jointly;The unrelated word or noise word in generator matrix are filtered according to co-occurrence probabilities ratio, New co-occurrence matrix M is obtained, and inputs in GloVe and obtains new term vector, is specifically implemented according to the following steps:
In GloVe loss function, each element X in co-occurrence matrix XijIndicate word j in the context window of target word i The number of interior appearance in mouthful,It is matrix X in the sum of the i-th row, i.e., all clictions up and down in target word i window The total degree of appearance, Pij=P (j | i)=Xij/XiIndicate that word j appears in the probability around word i, if w1For ice, w2For steam: Table 1 gives extracts co-occurrence frequency of two target words relative to different word k in wikipedia.
1 target word w of table1, w2With the co-occurrence Word probability in corresponding context
Work as w1, w2With in corresponding context occur jointly word be gas when, ice occur in the case where gas occur it is general Rate value is 6.6 × 10-5, the parameter probability valuing that gas occurs in the case where steam occurs is 7.8 × 10-4, occur in ice In the case of gas occur probability with steam occur in the case where gas occur probability ratio be 8.5 × 10-2
Work as w1, w2With in corresponding context occur jointly word be solid when, ice occur in the case where solid occur Parameter probability valuing be 1.9 × 10-4, the parameter probability valuing that solid occurs in the case where steam occurs is 2.2 × 10-5, in ice The probability that solid occurs in the case where appearance is 8.9 with the probability ratio that solid occurs in the case where steam occurs;
Work as w1, w2With in corresponding context occur jointly word be fashion when, ice occur in the case where fashion The parameter probability valuing of appearance is 1.7 × 10-5, the parameter probability valuing that fashion occurs in the case where steam occurs is 1.8 × 10-5, The probability that fashion occurs in the case where ice occurs and the probability ratio that fashion occurs in the case where steam occurs It is 0.96;
Work as w1, w2With in corresponding context occur jointly word be water when, ice occur in the case where water occur Parameter probability valuing be 3.0 × 10-3, the parameter probability valuing that water occurs in the case where steam occurs is 2.2 × 10-3, in ice The probability that water occurs in the case where appearance is 1.36 with the probability ratio that water occurs in the case where steam occurs;
Work as w1, w2It is unrelated, and when context is different value of K, i.e. k=gas, solid, fashion, water work as k=gas When, gas is obviously uncorrelated to ice, and more relevant with steam, the probability that gas occurs in the case where ice occurs goes out in steam Gas occurs in the case where existing probability ratio P (k | ice)/P (k | steam) it is much smaller than 1;As k=solid, solid and ice Correlation, and solid and steam are uncorrelated, the probability that solid occurs in the case where ice occurs with the case where steam occurs The probability ratio that lower solid occurs is much larger than 1;Introduce word k and w1Or w2One of them semantic similarity gets over hour co-occurrence probabilities Ratio is remoter at a distance from 1;But as k=fashion, fashion and w1, w2It is all unrelated, and P (k | ice)/P (k | steam) it connects It is bordering on 1;As k=water, water and ice and steam are related, and the ratio of the two is also close to 1, i.e., semantic more similar Contribution probability ratio between word in context is closer to 1;
Work as wiAnd wmWhen semantic dissimilar and when the included word k of known contexts, acquisition word k is from co-occurrence probabilities ratio No is irrelevant information, for word wiAnd wmWhen including word k in the context of given window size, co-occurrence probabilities ratio are as follows:
If word wiAnd wmDissmilarity then has when giving cliction k up and down:
(1) as co-occurrence probabilities ratio λ ≈ 1, k is unrelated word at this time;
(2) as co-occurrence probabilities ratio λ > > 1 or λ < < 1, k and word w at this timeiOr wmOne of them is semantic similar;
Therefore, selection and word wiDissimilar wmUnrelated word is filtered, the calculation about term vector similarity is as follows:
Wherein,For word wi, wmCorresponding term vector;
I.e. if the cosine value of two words is smaller, the context of two words is more dissimilar, and semanteme difference is remoter, So the general formulae of dissimilar word is selected with cosine value, from all and wiSet S (m) of the COS distance less than 0 In the random N number of dissmilarity of selection word wmFilter wiUnrelated word in context reduces the number of nonzero element in co-occurrence matrix Unrelated word in amount filtering co-occurrence matrix, obtains new co-occurrence matrix M and is inputted in GloVe;
The term vector dimension features_num that corresponding content in text is chosen during GloVe model training is 300, on Hereafter window size context is 10, and minimum word frequency number min_count is 50, and it is 15 that λ, which is arranged, in co-occurrence probabilities ratio, introduces nothing Closing word number N is 15, finally, obtains the correspondence term vector of each word.
Step 3 selects in step 1 equivalent spacing and the characteristic weighing value of distribution between class simultaneously according to step 2 acquired results Dot product is carried out, obtains weighted words vector model, as finally obtained document representation method is specifically implemented according to the following steps: It is obtained according to step 2, in di={ x1,x2,...,xmFeature Words x in textjTerm vector be xj=(v1,j,v2,j,...,vt,j), vi,jIt is Feature Words xjValue of the term vector in ith feature dimension, in conjunction with being computed resulting each Feature Words in step 1 WDID-TFIDF value, the text representation based on WT-GloVe term vector weighted model are as follows:
xj'=(v1,j,v2,j,...,vt,j)·W(xj)
Wherein, t is characterized word xjTerm vector dimension, finally obtain 20NewsGroups data set based on WT-GloVe The text representation of term vector building.
As shown in figure 3, originally the performance of TF-IDF is all higher than the accuracy rate of other two kinds of algorithms.With the increasing of sample number Add, the speedup of TF-IDF slowly declines.When sample number is more than certain amount, TF-IDF accuracy rate is begun to decline.For GloVe and For WT-GloVe, the overall trend with the increase accuracy rate of sample is also increasing;The former two's speedup 500 is lower than tradition TF-IDF, rear the two speedup 500 are obviously improved.At this point, the accuracy rate of GloVe has reached 79.02%, WT-GloVe's Accuracy rate has reached 82.86%;When between 2000 to 6400, since 3200 backward based on the improved characteristic weighing scheme of tradition WDID-TFIDF accuracy rate same as TF-IDF is slightly decreased 0.88%, causes the speedup of WT-GloVe lower than GloVe.From WT- From the point of view of GloVe overall trend, accuracy rate performance is all relatively preferably.For Word2Vec, Word2Vec_TFIDF and WT- Between GloVe accuracy rate comparatively, the speedup of Word2Vec and Word2Vec_TFIDF is better than WT-GloVe when rigid beginning, The two accuracy rate is respectively 73.73% and 74.25%, is because GloVe is the ability under mass data collection for semantic assurance Its superiority can preferably be played.When the speedup of 1500 WT-GloVe backward be higher than at once Word2Vec_TFIDF and Word2Vec, accuracy rate 89.72%.As shown in the figure as a whole, though the accuracy rate of WT-GloVe is in a small amount of sample number Not as good as other algorithms, but performance later is totally better than TF-IDF, Word2Vec, Word2Vec_TFIDF and GloVe.12283 The accuracy rate of lower 5 kinds of methods of sample compares, TF-IDF, GloVe and WT-GloVe and W2V, W2V- under 12283 samples Accuracy rate between TFIDF and WT-GloVe compares.
As shown in figure 4, observation GloVe and WT-GloVe, the accuracy rate of 2000 the former two is lower than tradition TF-IDF, sample Respectively 80.32% and 82.58% when number is 2000, as both increases of sample accuracy rate is also increasing.WT-GloVe exists Speedup decreased significantly after 2000, the reason is that WDID-TFIDF is same as TF-IDF after a certain amount of data set, under accuracy rate 0.64% has been dropped, and speedup is not affected GloVe during this period.Increase GloVe's and WT-GloVe with sample number Speedup is all slowed down, and accuracy rate continues to increase, respectively 84.54% and 86.29% when 9000.For Word2Vec, Between Word2Vec_TFIDF and WT-GloVe accuracy rate comparatively, due to sample number increase, WT-GloVe is originally Accuracy rate is 74.64%, and always above Word2Vec and Word2Vec_TFIDF.

Claims (6)

1. a kind of document representation method based on the building of WT-GloVe term vector, which is characterized in that specifically real according to the following steps It applies:
Step 1 assesses its significance level by the word distance computation to network text unique characteristics, according to the distribution between class of feature Differentiate itself contribution degree to classification, the two is combined as to the characteristic weighing model of word spacing and distribution between class, referred to as WDID- TFIDF;
Step 2 is filtered unrelated word according to the self shortcoming of GloVe model, to improve term vector training quality;
Step 3 selects equivalent spacing and the characteristic weighing value of distribution between class in step 1 according to step 2 acquired results and carries out Dot product obtains weighted words vector model, as finally obtained document representation method.
2. a kind of document representation method based on the building of WT-GloVe term vector according to claim 1, which is characterized in that The step 1 is specifically implemented according to the following steps:
Data set 20NewsGroups is loaded, module needed for importing provides GloVe model, and training data store path is arranged, and compiles Code format;Defined function introduces the general deactivated vocabulary of English, segments to loaded data set, the content of text that will acquire File is read in by row, carries out Text Pretreatment using spacy module tool, completes part of speech label convenient for follow-up filtering, system Model is counted, calculates term weight function and generator matrix using WDID-TFIDF.
3. a kind of document representation method based on the building of WT-GloVe term vector according to claim 2, which is characterized in that The step 1 is specifically implemented according to the following steps:
Data-oriented collection 20NewsGroups, firstly, carrying out the data prediction including going stop words, the analysis of stem morphology Process, acquired results include station location marker and word frequency statistics data;Secondly, by the data set D={ d after participle1,d2,..., dn, generic set C={ c1,c2,...,ci,...,ck, wherein di={ x1,x2,...,xj,...,xm, xj∈ciBy row It reads in, Feature Words xjTo classification ciClass between discrimination ID indicate are as follows:
Take Feature Words xjMaximum value in of all categories, as its contribution degree to the category;
Wherein, W (xj|ci) value indicate Feature Words xjFor classification ciSeparating capacity, Feature Words xjIn ciThe frequency of middle appearance For TF (xj|ci), it is not belonging to classification ciIt but include xjTextual data be
To gained WD(xj| c) it is normalized:
Feature Words x is obtained at this timej∈ciClass between distinguish angle value, Feature Words xjWord spacing WD calculate are as follows:
Distance=Lj-Fj
Wherein, LjIt is characterized word xjThe serial number that last time occurs in the text, FjIt is characterized word xjThe serial number occurred for the first time, Count is that the text after word segmentation processing segments sum;
For any one Feature Words xj∈ci, the WDID-TFIDF term weight function value W of contribution degree between word-based spacing and class (xj) calculate are as follows:
Total data is read according to input content, calculates each term weight function, generates diThe weight matrix of text representation
4. a kind of document representation method based on the building of WT-GloVe term vector according to claim 3, which is characterized in that The step 2 is specifically implemented according to the following steps:
The term vector for choosing corresponding content in text set, including term vector dimension, word window size, minimum statistics time word Frequently;For each word w in dictionaryi, calculate and other words w in textmCos θ value, when cos θ is less than 0 be added set S (m); The top n word in set S (m) is selected, the word k in the context of given window size and the target word w of selection are calculatedi、wmIt Between the probability ratio λ occurred jointly;The unrelated word or noise word in generator matrix are filtered according to co-occurrence probabilities ratio, is obtained new Co-occurrence matrix M, and input in GloVe and obtain new term vector.
5. a kind of document representation method based on the building of WT-GloVe term vector according to claim 4, which is characterized in that The step 2 is specifically implemented according to the following steps:
In GloVe loss function, each element X in co-occurrence matrix XijIndicate word j in the contextual window of target word i The number of interior appearance,It is matrix X in the sum of the i-th row, i.e., all clictions up and down occur in target word i window Total degree, Pij=P (j | i)=Xij/XiIndicate that word j appears in the probability around word i, if w1For ice, w2For steam:
Work as w1, w2With in corresponding context occur jointly word be gas when, ice occur in the case where gas occur probability take Value is 6.6 × 10-5, the parameter probability valuing that gas occurs in the case where steam occurs is 7.8 × 10-4, the case where ice occurs The probability that lower gas occurs is 8.5 × 10 with the probability ratio that gas occurs in the case where steam occurs-2
Work as w1, w2With in corresponding context occur jointly word be solid when, ice occur in the case where solid occur it is general Rate value is 1.9 × 10-4, the parameter probability valuing that solid occurs in the case where steam occurs is 2.2 × 10-5, occur in ice In the case where solid occur probability with steam occur in the case where solid occur probability ratio be 8.9;
Work as w1, w2With in corresponding context occur jointly word be fashion when, ice occur in the case where fashion occur Parameter probability valuing be 1.7 × 10-5, the parameter probability valuing that fashion occurs in the case where steam occurs is 1.8 × 10-5, The probability that fashion occurs in the case that ice occurs and the probability ratio of the fashion appearance in the case where steam occurs are 0.96;
Work as w1, w2With in corresponding context occur jointly word be water when, ice occur in the case where water occur it is general Rate value is 3.0 × 10-3, the parameter probability valuing that water occurs in the case where steam occurs is 2.2 × 10-3, occur in ice In the case where water occur probability with steam occur in the case where water occur probability ratio be 1.36;
Work as w1, w2It is unrelated, and when context is different value of K, i.e. k=gas, solid, fashion, water, as k=gas, Gas is obviously uncorrelated to ice, more relevant with steam, and the probability that gas occurs in the case where ice occurs occurs in steam In the case where probability ratio P (k | ice)/P (k | steam) for occurring of gas be much smaller than 1;As k=solid, solid and ice phase Close, and solid and steam are uncorrelated, the probability that solid occurs in the case where ice occurs in the case where steam occurs The probability ratio that solid occurs is much larger than 1;Introduce word k and w1Or w2One of them semantic similarity gets over hour co-occurrence probabilities ratio It is worth remoter at a distance from 1;But as k=fashion, fashion and w1, w2It is all unrelated, and P (k | ice)/P (k | steam) it is close In 1;As k=water, water and ice and steam are related, and the ratio of the two is also close to 1, the i.e. more similar word of semanteme Between context contribution probability ratio closer to 1;
Work as wiAnd wmWhen semantic dissimilar and when the included word k of known contexts, from co-occurrence probabilities ratio acquisition word k whether be Irrelevant information, for word wiAnd wmWhen including word k in the context of given window size, co-occurrence probabilities ratio are as follows:
If word wiAnd wmDissmilarity then has when giving cliction k up and down:
(1) as co-occurrence probabilities ratio λ ≈ 1, k is unrelated word at this time;
(2) as co-occurrence probabilities ratio λ > > 1 or λ < < 1, k and word w at this timeiOr wmOne of them is semantic similar;
Therefore, selection and word wiDissimilar wmUnrelated word is filtered, the calculation about term vector similarity is as follows:
Wherein,For word wi, wmCorresponding term vector;
I.e. if the cosine value of two words is smaller, the context of two words is more dissimilar, and semanteme difference is remoter, so The general formulae that dissimilar word is selected with cosine value, from all and wiSet S (m) of the COS distance less than 0 in The word w of the N number of dissmilarity of selection of machinemFilter wiUnrelated word in context reduces the quantity mistake of nonzero element in co-occurrence matrix The unrelated word in co-occurrence matrix is filtered, new co-occurrence matrix M is obtained and is inputted in GloVe.
6. a kind of document representation method based on the building of WT-GloVe term vector according to claim 5, which is characterized in that The step 3 is specifically implemented according to the following steps:
It is obtained according to step 2, in di={ x1,x2,...,xmFeature Words x in textjTerm vector be xj=(v1, j, v2, j..., vT, j), vi,jIt is Feature Words xjValue of the term vector in ith feature dimension, in conjunction with being computed resulting each spy in step 1 Levy the WDID-TFIDF value of word, the text representation based on WT-GloVe term vector weighted model are as follows:
xj'=(v1,j,v2,j,...,vt,j)·W(xj)
Wherein, t is characterized word xjTerm vector dimension, finally obtain 20NewsGroups data set based on WT-GloVe word to Measure the text representation of building.
CN201910573695.5A 2019-06-28 2019-06-28 Text representation method constructed based on WT-GloVe word vector Active CN110348497B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910573695.5A CN110348497B (en) 2019-06-28 2019-06-28 Text representation method constructed based on WT-GloVe word vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910573695.5A CN110348497B (en) 2019-06-28 2019-06-28 Text representation method constructed based on WT-GloVe word vector

Publications (2)

Publication Number Publication Date
CN110348497A true CN110348497A (en) 2019-10-18
CN110348497B CN110348497B (en) 2021-09-10

Family

ID=68176994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910573695.5A Active CN110348497B (en) 2019-06-28 2019-06-28 Text representation method constructed based on WT-GloVe word vector

Country Status (1)

Country Link
CN (1) CN110348497B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143510A (en) * 2019-12-10 2020-05-12 广东电网有限责任公司 Searching method based on latent semantic analysis model
CN113486176A (en) * 2021-07-08 2021-10-08 桂林电子科技大学 News classification method based on secondary feature amplification

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706780A (en) * 2009-09-03 2010-05-12 北京交通大学 Image semantic retrieving method based on visual attention model
CN101944099A (en) * 2010-06-24 2011-01-12 西北工业大学 Method for automatically classifying text documents by utilizing body
US20130253906A1 (en) * 2012-03-26 2013-09-26 Verizon Patent And Licensing Inc. Environment sensitive predictive text entry
CN103336806A (en) * 2013-06-24 2013-10-02 北京工业大学 Method for sequencing keywords based on entropy difference between word-spacing-appearing internal mode and external mode
CN106156772A (en) * 2015-03-25 2016-11-23 佳能株式会社 For determining the method and apparatus of word spacing and for the method and system of participle
CN107577668A (en) * 2017-09-15 2018-01-12 电子科技大学 Social media non-standard word correcting method based on semanteme
CN109189925A (en) * 2018-08-16 2019-01-11 华南师范大学 Term vector model based on mutual information and based on the file classification method of CNN
CN109271517A (en) * 2018-09-29 2019-01-25 东北大学 IG TF-IDF Text eigenvector generates and file classification method
CN109933670A (en) * 2019-03-19 2019-06-25 中南大学 A kind of file classification method calculating semantic distance based on combinatorial matrix

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706780A (en) * 2009-09-03 2010-05-12 北京交通大学 Image semantic retrieving method based on visual attention model
CN101944099A (en) * 2010-06-24 2011-01-12 西北工业大学 Method for automatically classifying text documents by utilizing body
US20130253906A1 (en) * 2012-03-26 2013-09-26 Verizon Patent And Licensing Inc. Environment sensitive predictive text entry
CN103336806A (en) * 2013-06-24 2013-10-02 北京工业大学 Method for sequencing keywords based on entropy difference between word-spacing-appearing internal mode and external mode
CN106156772A (en) * 2015-03-25 2016-11-23 佳能株式会社 For determining the method and apparatus of word spacing and for the method and system of participle
CN107577668A (en) * 2017-09-15 2018-01-12 电子科技大学 Social media non-standard word correcting method based on semanteme
CN109189925A (en) * 2018-08-16 2019-01-11 华南师范大学 Term vector model based on mutual information and based on the file classification method of CNN
CN109271517A (en) * 2018-09-29 2019-01-25 东北大学 IG TF-IDF Text eigenvector generates and file classification method
CN109933670A (en) * 2019-03-19 2019-06-25 中南大学 A kind of file classification method calculating semantic distance based on combinatorial matrix

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
CASPER HANSEN 等: "Contextually Propagated Term Weights for Document Representation", 《ARXIV》 *
JEFFREY PENNINGTON 等: "GloVe: Global Vectors for Word Representation", 《PROCEEDINGS OF THE 2014 CONFERENCE ON EMPIRICAL METHODS IN NATURAL》 *
MATT J. KUSNER 等: "From Word Embeddings To Document Distances", 《PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON MACHINE LEARNING》 *
张栋 等: "基于答案辅助的半监督问题分类方法", 《计算机工程与科学》 *
李峰 等: "融合词向量的多特征句子相似度计算方法研究", 《计算机科学与探索》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143510A (en) * 2019-12-10 2020-05-12 广东电网有限责任公司 Searching method based on latent semantic analysis model
CN113486176A (en) * 2021-07-08 2021-10-08 桂林电子科技大学 News classification method based on secondary feature amplification

Also Published As

Publication number Publication date
CN110348497B (en) 2021-09-10

Similar Documents

Publication Publication Date Title
CN107608999A (en) A kind of Question Classification method suitable for automatically request-answering system
CN107315797A (en) A kind of Internet news is obtained and text emotion forecasting system
CN109670014B (en) Paper author name disambiguation method based on rule matching and machine learning
CN108763214B (en) Automatic construction method of emotion dictionary for commodity comments
CN103116637A (en) Text sentiment classification method facing Chinese Web comments
Probierz et al. Rapid detection of fake news based on machine learning methods
CN106682089A (en) RNNs-based method for automatic safety checking of short message
Yüksel et al. Turkish tweet classification with transformer encoder
KR20160149050A (en) Apparatus and method for selecting a pure play company by using text mining
CN109614490A (en) Money article proneness analysis method based on LSTM
Kurniawan et al. Indonesian twitter sentiment analysis using Word2Vec
CN115329085A (en) Social robot classification method and system
CN110348497A (en) A kind of document representation method based on the building of WT-GloVe term vector
CN114722198A (en) Method, system and related device for determining product classification code
Jayakody et al. Sentiment analysis on product reviews on twitter using Machine Learning Approaches
Anjum et al. Exploring humor in natural language processing: a comprehensive review of JOKER tasks at CLEF symposium 2023
CN111078874B (en) Foreign Chinese difficulty assessment method based on decision tree classification of random subspace
Kusum et al. Sentiment analysis using global vector and long short-term memory
Kavitha et al. A review on machine learning techniques for text classification
CN114896398A (en) Text classification system and method based on feature selection
CN113761123A (en) Keyword acquisition method and device, computing equipment and storage medium
Baria et al. Theoretical evaluation of machine and deep learning for detecting fake news
Suhasini et al. A Hybrid TF-IDF and N-Grams Based Feature Extraction Approach for Accurate Detection of Fake News on Twitter Data
Handayani et al. Sentiment Analysis of Bank BNI User Comments Using the Support Vector Machine Method
Intani et al. Automating Public Complaint Classification Through JakLapor Channel: A Case Study of Jakarta, Indonesia

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant