CN110348497A

CN110348497A - A kind of document representation method based on the building of WT-GloVe term vector

Info

Publication number: CN110348497A
Application number: CN201910573695.5A
Authority: CN
Inventors: 姚全珠; 古倩; 费蓉; 赵佳瑜; 李莎莎
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2019-10-18
Anticipated expiration: 2039-06-28
Also published as: CN110348497B

Abstract

The invention discloses a kind of document representation methods based on the building of WT-GloVe term vector, its significance level is assessed by the word distance computation to network text unique characteristics first, itself contribution degree to classification is differentiated according to the distribution between class of feature, the two is combined as to the characteristic weighing model of word spacing and distribution between class, referred to as WDID-TFIDF；Then unrelated word is filtered according to the self shortcoming of GloVe model, to improve term vector training quality；The characteristic weighing value of equivalent spacing and distribution between class is finally selected according to result and carries out dot product, obtains weighted words vector model, as finally obtained document representation method.The present invention solves the problems, such as that traditional document representation method existing in the prior art calculates complicated or text information and indicates not comprehensive enough.

Description

A kind of document representation method based on the building of WT-GloVe term vector

Technical field

The invention belongs to natural language processing, data mining and Text Classification fields, and in particular to one kind is based on WT- The document representation method of GloVe term vector building.

Background technique

The internet industry of high speed development promotes a large amount of appearance of the industries such as social networks, mobile Internet, global range The Websites quantity of interior sustainable growth leads to the generation of explosive information content.The garbage information filtering of Email, question answering system Asked questions sort out, the identification of query information in search engine, the just negative Judgment by emotion of commodity of shopping website, in government system Mass viewpoint analysis, the new topic discovery of social media and network public-opinion monitoring etc. require at ultra-large text data set The continuous renewal of reason technology.At the same time, higher standard also is proposed to the storage of computer and processing capacity.A large amount of texts How extraordinary potential knowledge in data efficient process and organizes massive information, and helping user efficiently to find required content is to work as Preceding a major challenge.The key technology that text classification is handled as information data, it has also become the research hotspot of academia, in multiple necks It is used widely in domain.How accurately to indicate text information and construct suitable disaggregated model to have become two big cores for classification task Heart problem.

Traditional text representation is normally based on vector space model also or TF-IDF model, they are a large amount of by study Text feature provides the simple expression to text, which is that low-frequency word assigns relatively high weight, assigns for higher-frequency word Relatively low weight is given.According to information theory, this mode has weighed the information that each word is conveyed in vocabulary, weighting side Method includes the logarithm re-scaling of each word frequency rate in document.Finally, the logarithm has linearized the finger of part of speech type by corpus Number distribution.

With the expansion of data scale, text feature dimension is caused to can reach into ten thousand dimensions even higher.As classical text The vector space model of one of representation method, obtained text vector higher-dimension are simultaneously sparse.And the semantic information tool of its character representation There is atomicity, the semantic relation measurement between feature can not be carried out.In the textual form indicated based on vector space model, dimension Represent feature quantity.

Text classification already becomes the research hotspot being concerned and field from proposition so far, and many scholars are in text table Show, the different aspects such as Spatial Dimension and classifier expand deep discussion.To sum up the improvement about classification method is substantially Start with from both direction: first, the improvement based on traditional text sorting technique；Second, being based on neural network file classification method Improvement.

Summary of the invention

The object of the present invention is to provide a kind of document representation methods based on the building of WT-GloVe term vector, solve existing Traditional document representation method, which calculates complicated or text information, present in technology indicates not comprehensive enough problem.

The technical scheme adopted by the invention is that a kind of document representation method based on the building of WT-GloVe term vector, tool Body follows the steps below to implement:

Step 1 assesses its significance level by the word distance computation to network text unique characteristics, according between the class of feature Distribution differentiates itself contribution degree to classification, the two is combined as to the characteristic weighing model of word spacing and distribution between class, referred to as WDID-TFIDF；

Step 2 is filtered unrelated word according to the self shortcoming of GloVe model, to improve term vector training quality；

Step 3 selects in step 1 equivalent spacing and the characteristic weighing value of distribution between class simultaneously according to step 2 acquired results Dot product is carried out, obtains weighted words vector model, as finally obtained document representation method.

The features of the present invention also characterized in that

Step 1 is specifically implemented according to the following steps:

Data set 20NewsGroups is loaded, module needed for importing provides GloVe model, and setting training data stores road Diameter, coded format；Defined function introduces the general deactivated vocabulary of English, segments to loaded data set, the text that will acquire Content reads in file by row, carries out Text Pretreatment using spacy module tool, completes part of speech label and is convenient for follow-up mistake Filter, statistical model calculate term weight function and generator matrix using WDID-TFIDF.

Step 1 is specifically implemented according to the following steps:

Data-oriented collection 20NewsGroups, firstly, the data carried out including going stop words, the analysis of stem morphology are pre- Treatment process, acquired results include station location marker and word frequency statistics data；Secondly, by the data set D={ d after participle₁, d₂,...,d_n, generic set C={ c₁,c₂,...,c_i,...,c_k, wherein d_i={ x₁,x₂,...,x_j,...,x_m, x_j ∈c_iIt is read in by row, Feature Words x_jTo classification c_iClass between discrimination ID indicate are as follows:

Take Feature Words x_jMaximum value in of all categories, as its contribution degree to the category；

Wherein, W (x_j|c_i) value indicate Feature Words x_jFor classification c_iSeparating capacity, Feature Words x_jIn c_iMiddle appearance Frequency is TF (x_j|c_i), it is not belonging to classification c_iIt but include x_jTextual data be

To gained W_D(x_j| c) it is normalized:

Feature Words x is obtained at this time_j∈c_iClass between distinguish angle value, Feature Words x_jWord spacing WD calculate are as follows:

Distance=L_j-F_j

Wherein, L_jIt is characterized word x_jThe serial number that last time occurs in the text, F_jIt is characterized word x_jThe sequence occurred for the first time Number, count is that the text after word segmentation processing segments sum；

For any one Feature Words x_j∈c_i, the WDID-TFIDF term weight function of contribution degree between word-based spacing and class Value W (x_j) calculate are as follows:

Total data is read according to input content, calculates each term weight function, generates d_iThe weight matrix of text representation

Step 2 is specifically implemented according to the following steps:

The term vector for choosing corresponding content in text set, including term vector dimension, word window size, minimum statistics Word frequency；For each word w in dictionary_i, calculate and other words w in text_mCos θ value, when cos θ is less than 0 be added set S (m)；The top n word in set S (m) is selected, the word k in the context of given window size and the target word of selection are calculated w_i、w_mBetween the probability ratio λ occurred jointly；The unrelated word or noise word in generator matrix are filtered according to co-occurrence probabilities ratio, New co-occurrence matrix M is obtained, and inputs in GloVe and obtains new term vector.

Step 2 is specifically implemented according to the following steps:

In GloVe loss function, each element X in co-occurrence matrix X_ijIndicate word j in the context window of target word i The number of interior appearance in mouthful,It is matrix X in the sum of the i-th row, i.e., all clictions up and down go out in target word i window Existing total degree, P_ij=P (j | i)=X_ij/X_iIndicate that word j appears in the probability around word i, if w₁For ice, w₂For steam:

Work as w₁, w₂With in corresponding context occur jointly word be gas when, ice occur in the case where gas occur it is general Rate value is 6.6 × 10^-5, the parameter probability valuing that gas occurs in the case where steam occurs is 7.8 × 10^-4, occur in ice In the case of gas occur probability with steam occur in the case where gas occur probability ratio be 8.5 × 10^-2；

Work as w₁, w₂With in corresponding context occur jointly word be solid when, ice occur in the case where solid occur Parameter probability valuing be 1.9 × 10^-4, the parameter probability valuing that solid occurs in the case where steam occurs is 2.2 × 10^-5, in ice The probability that solid occurs in the case where appearance is 8.9 with the probability ratio that solid occurs in the case where steam occurs；

Work as w₁, w₂With in corresponding context occur jointly word be fashion when, ice occur in the case where fashion The parameter probability valuing of appearance is 1.7 × 10^-5, the parameter probability valuing that fashion occurs in the case where steam occurs is 1.8 × 10^-5, The probability that fashion occurs in the case where ice occurs and the probability ratio that fashion occurs in the case where steam occurs It is 0.96；

Work as w₁, w₂With in corresponding context occur jointly word be water when, ice occur in the case where water occur Parameter probability valuing be 3.0 × 10^-3, the parameter probability valuing that water occurs in the case where steam occurs is 2.2 × 10^-3, in ice The probability that water occurs in the case where appearance is 1.36 with the probability ratio that water occurs in the case where steam occurs；

Work as w₁, w₂It is unrelated, and when context is different value of K, i.e. k=gas, solid, fashion, water work as k=gas When, gas is obviously uncorrelated to ice, and more relevant with steam, the probability that gas occurs in the case where ice occurs goes out in steam Gas occurs in the case where existing probability ratio P (k | ice)/P (k | steam) it is much smaller than 1；As k=solid, solid and ice Correlation, and solid and steam are uncorrelated, the probability that solid occurs in the case where ice occurs with the case where steam occurs The probability ratio that lower solid occurs is much larger than 1；Introduce word k and w₁Or w₂One of them semantic similarity gets over hour co-occurrence probabilities Ratio is remoter at a distance from 1；But as k=fashion, fashion and w₁, w₂It is all unrelated, and P (k | ice)/P (k | steam) it connects It is bordering on 1；As k=water, water and ice and steam are related, and the ratio of the two is also close to 1, i.e., semantic more similar Contribution probability ratio between word in context is closer to 1；

Work as w_iAnd w_mWhen semantic dissimilar and when the included word k of known contexts, acquisition word k is from co-occurrence probabilities ratio No is irrelevant information, for word w_iAnd w_mWhen including word k in the context of given window size, co-occurrence probabilities ratio are as follows:

If word w_iAnd w_mDissmilarity then has when giving cliction k up and down:

(1) as co-occurrence probabilities ratio λ ≈ 1, k is unrelated word at this time；

(2) as co-occurrence probabilities ratio λ > > 1 or λ < < 1, k and word w at this time_iOr w_mOne of them is semantic similar；

Therefore, selection and word w_iDissimilar w_mUnrelated word is filtered, the calculation about term vector similarity is as follows:

Wherein,For word w_i, w_mCorresponding term vector；

I.e. if the cosine value of two words is smaller, the context of two words is more dissimilar, and semanteme difference is remoter, So the general formulae of dissimilar word is selected with cosine value, from all and w_iSet S (m) of the COS distance less than 0 In the random N number of dissmilarity of selection word w_mFilter w_iUnrelated word in context reduces the number of nonzero element in co-occurrence matrix Unrelated word in amount filtering co-occurrence matrix, obtains new co-occurrence matrix M and is inputted in GloVe.

Step 3 is specifically implemented according to the following steps:

It is obtained according to step 2, in d_i={ x₁,x₂,...,x_mFeature Words x in text_jTerm vector be x_j=(v_{1, j}, v_{2, j}..., v_{T, j}), v_i,jIt is Feature Words x_jValue of the term vector in ith feature dimension, in conjunction with being computed gained in step 1 Each Feature Words WDID-TFIDF value, the text representation based on WT-GloVe term vector weighted model are as follows:

x_j'=(v_1,j,v_2,j,...,v_t,j)·W(x_j)

Wherein, t is characterized word x_jTerm vector dimension, finally obtain 20NewsGroups data set based on WT-GloVe The text representation of term vector building.

The invention has the advantages that a kind of document representation method based on the building of WT-GloVe term vector, by spy The word distance computation of itself is levied to assess its significance level, its contribution to classification is differentiated according to the distribution between class of Feature Words The two, is combined as the characteristic weighing model of word spacing and distribution between class by degree.When only use TF-IDF algorithm to term vector into Classifying quality can be improved when row weighting, but have ignored distribution situation of the Feature Words in each class.However, Feature Words are each The distribution situation of classification can reflect that Feature Words distinguish the ability and contribution degree of classification.The present invention is according to Feature Words in each classification Distribution situation is calculated as angle value is distinguished between the class of Feature Words；At the same time, word spacing is also added as one kind of characteristic item Power scheme.One word or phrase first appear the distance between end in the text, i.e. referred to as word spacing.It is spaced bigger, explanation The range that the word is mentioned in the text is wider, also more important to the theme of text.Therefore, the calculated value of word spacing can also generation Significance level of the table Feature Words to place text；Using word-based spacing and the characteristic weighing scheme of distribution between class to corpus into Row term weighting merges the unrelated filtered GloVe model of word and generates term vector expression jointly, obtains the important of wherein Feature Words Degree and semantic information constitute new term vector weighted model, improve final classification effect.

Detailed description of the invention

Fig. 1 is that the present invention is based on the flow charts of the document representation method research process of WT-GloVe term vector building；

Fig. 2 is the term vector weighted model experiment main flow chart based on WT-GloVe；

Fig. 3 is TF-IDF, Word2Vec, GloVe, Word2Vec_TFIDF and WT-GloVe model 5 under 9100 samples Accuracy rate between kind method compares；

Fig. 4 is TF-IDF, Word2Vec, GloVe, Word2Vec_TFIDF and WT-GloVe model 5 under 12283 samples Accuracy rate between kind method compares.

Specific embodiment

The following describes the present invention in detail with reference to the accompanying drawings and specific embodiments.

A kind of document representation method based on the building of WT-GloVe term vector of the present invention, flow chart is as shown in Figure 1, specifically press Implement according to following steps:

Step 1 assesses its significance level by the word distance computation to network text unique characteristics, according between the class of feature Distribution differentiates itself contribution degree to classification, the two is combined as to the characteristic weighing model of word spacing and distribution between class, referred to as WDID-TFIDF, step 1 are specifically implemented according to the following steps:

Data set 20NewsGroups is loaded, module needed for importing provides GloVe model, and setting training data stores road Diameter, coded format；Defined function introduces the general deactivated vocabulary of English, segments to loaded data set, the text that will acquire Content reads in file by row, carries out Text Pretreatment using spacy module tool, completes part of speech label and is convenient for follow-up mistake Filter, statistical model calculate term weight function and generator matrix using WDID-TFIDF, are specifically implemented according to the following steps:

To gained W_D(x_j| c) it is normalized:

Distance=L_j-F_j

Step 2 is filtered unrelated word according to the self shortcoming of GloVe model, to improve term vector training quality, such as schemes Shown in 2, it is specifically implemented according to the following steps:

The term vector for choosing corresponding content in text set, including term vector dimension, word window size, minimum statistics Word frequency；For each word w in dictionary_i, calculate and other words w in text_mCos θ value, when cos θ is less than 0 be added set S (m)；The top n word in set S (m) is selected, the word k in the context of given window size and the target word of selection are calculated w_i、w_mBetween the probability ratio λ occurred jointly；The unrelated word or noise word in generator matrix are filtered according to co-occurrence probabilities ratio, New co-occurrence matrix M is obtained, and inputs in GloVe and obtains new term vector, is specifically implemented according to the following steps:

In GloVe loss function, each element X in co-occurrence matrix X_ijIndicate word j in the context window of target word i The number of interior appearance in mouthful,It is matrix X in the sum of the i-th row, i.e., all clictions up and down in target word i window The total degree of appearance, P_ij=P (j | i)=X_ij/X_iIndicate that word j appears in the probability around word i, if w₁For ice, w₂For steam: Table 1 gives extracts co-occurrence frequency of two target words relative to different word k in wikipedia.

1 target word w of table₁, w₂With the co-occurrence Word probability in corresponding context

If word w_iAnd w_mDissmilarity then has when giving cliction k up and down:

Wherein,For word w_i, w_mCorresponding term vector；

I.e. if the cosine value of two words is smaller, the context of two words is more dissimilar, and semanteme difference is remoter, So the general formulae of dissimilar word is selected with cosine value, from all and w_iSet S (m) of the COS distance less than 0 In the random N number of dissmilarity of selection word w_mFilter w_iUnrelated word in context reduces the number of nonzero element in co-occurrence matrix Unrelated word in amount filtering co-occurrence matrix, obtains new co-occurrence matrix M and is inputted in GloVe；

The term vector dimension features_num that corresponding content in text is chosen during GloVe model training is 300, on Hereafter window size context is 10, and minimum word frequency number min_count is 50, and it is 15 that λ, which is arranged, in co-occurrence probabilities ratio, introduces nothing Closing word number N is 15, finally, obtains the correspondence term vector of each word.

Step 3 selects in step 1 equivalent spacing and the characteristic weighing value of distribution between class simultaneously according to step 2 acquired results Dot product is carried out, obtains weighted words vector model, as finally obtained document representation method is specifically implemented according to the following steps: It is obtained according to step 2, in d_i={ x₁,x₂,...,x_mFeature Words x in text_jTerm vector be x_j=(v_1,j,v_2,j,...,v_t,j), v_i,jIt is Feature Words x_jValue of the term vector in ith feature dimension, in conjunction with being computed resulting each Feature Words in step 1 WDID-TFIDF value, the text representation based on WT-GloVe term vector weighted model are as follows:

x_j'=(v_1,j,v_2,j,...,v_t,j)·W(x_j)

As shown in figure 3, originally the performance of TF-IDF is all higher than the accuracy rate of other two kinds of algorithms.With the increasing of sample number Add, the speedup of TF-IDF slowly declines.When sample number is more than certain amount, TF-IDF accuracy rate is begun to decline.For GloVe and For WT-GloVe, the overall trend with the increase accuracy rate of sample is also increasing；The former two's speedup 500 is lower than tradition TF-IDF, rear the two speedup 500 are obviously improved.At this point, the accuracy rate of GloVe has reached 79.02%, WT-GloVe's Accuracy rate has reached 82.86%；When between 2000 to 6400, since 3200 backward based on the improved characteristic weighing scheme of tradition WDID-TFIDF accuracy rate same as TF-IDF is slightly decreased 0.88%, causes the speedup of WT-GloVe lower than GloVe.From WT- From the point of view of GloVe overall trend, accuracy rate performance is all relatively preferably.For Word2Vec, Word2Vec_TFIDF and WT- Between GloVe accuracy rate comparatively, the speedup of Word2Vec and Word2Vec_TFIDF is better than WT-GloVe when rigid beginning, The two accuracy rate is respectively 73.73% and 74.25%, is because GloVe is the ability under mass data collection for semantic assurance Its superiority can preferably be played.When the speedup of 1500 WT-GloVe backward be higher than at once Word2Vec_TFIDF and Word2Vec, accuracy rate 89.72%.As shown in the figure as a whole, though the accuracy rate of WT-GloVe is in a small amount of sample number Not as good as other algorithms, but performance later is totally better than TF-IDF, Word2Vec, Word2Vec_TFIDF and GloVe.12283 The accuracy rate of lower 5 kinds of methods of sample compares, TF-IDF, GloVe and WT-GloVe and W2V, W2V- under 12283 samples Accuracy rate between TFIDF and WT-GloVe compares.

As shown in figure 4, observation GloVe and WT-GloVe, the accuracy rate of 2000 the former two is lower than tradition TF-IDF, sample Respectively 80.32% and 82.58% when number is 2000, as both increases of sample accuracy rate is also increasing.WT-GloVe exists Speedup decreased significantly after 2000, the reason is that WDID-TFIDF is same as TF-IDF after a certain amount of data set, under accuracy rate 0.64% has been dropped, and speedup is not affected GloVe during this period.Increase GloVe's and WT-GloVe with sample number Speedup is all slowed down, and accuracy rate continues to increase, respectively 84.54% and 86.29% when 9000.For Word2Vec, Between Word2Vec_TFIDF and WT-GloVe accuracy rate comparatively, due to sample number increase, WT-GloVe is originally Accuracy rate is 74.64%, and always above Word2Vec and Word2Vec_TFIDF.

Claims

1. a kind of document representation method based on the building of WT-GloVe term vector, which is characterized in that specifically real according to the following steps It applies:

Step 1 assesses its significance level by the word distance computation to network text unique characteristics, according to the distribution between class of feature Differentiate itself contribution degree to classification, the two is combined as to the characteristic weighing model of word spacing and distribution between class, referred to as WDID- TFIDF；

Step 3 selects equivalent spacing and the characteristic weighing value of distribution between class in step 1 according to step 2 acquired results and carries out Dot product obtains weighted words vector model, as finally obtained document representation method.

2. a kind of document representation method based on the building of WT-GloVe term vector according to claim 1, which is characterized in that The step 1 is specifically implemented according to the following steps:

Data set 20NewsGroups is loaded, module needed for importing provides GloVe model, and training data store path is arranged, and compiles Code format；Defined function introduces the general deactivated vocabulary of English, segments to loaded data set, the content of text that will acquire File is read in by row, carries out Text Pretreatment using spacy module tool, completes part of speech label convenient for follow-up filtering, system Model is counted, calculates term weight function and generator matrix using WDID-TFIDF.

3. a kind of document representation method based on the building of WT-GloVe term vector according to claim 2, which is characterized in that The step 1 is specifically implemented according to the following steps:

Data-oriented collection 20NewsGroups, firstly, carrying out the data prediction including going stop words, the analysis of stem morphology Process, acquired results include station location marker and word frequency statistics data；Secondly, by the data set D={ d after participle₁,d₂,..., d_n, generic set C={ c₁,c₂,...,c_i,...,c_k, wherein d_i={ x₁,x₂,...,x_j,...,x_m, x_j∈c_iBy row It reads in, Feature Words x_jTo classification c_iClass between discrimination ID indicate are as follows:

Wherein, W (x_j|c_i) value indicate Feature Words x_jFor classification c_iSeparating capacity, Feature Words x_jIn c_iThe frequency of middle appearance For TF (x_j|c_i), it is not belonging to classification c_iIt but include x_jTextual data be

To gained W_D(x_j| c) it is normalized:

Distance=L_j-F_j

Wherein, L_jIt is characterized word x_jThe serial number that last time occurs in the text, F_jIt is characterized word x_jThe serial number occurred for the first time, Count is that the text after word segmentation processing segments sum；

For any one Feature Words x_j∈c_i, the WDID-TFIDF term weight function value W of contribution degree between word-based spacing and class (x_j) calculate are as follows:

4. a kind of document representation method based on the building of WT-GloVe term vector according to claim 3, which is characterized in that The step 2 is specifically implemented according to the following steps:

The term vector for choosing corresponding content in text set, including term vector dimension, word window size, minimum statistics time word Frequently；For each word w in dictionary_i, calculate and other words w in text_mCos θ value, when cos θ is less than 0 be added set S (m)； The top n word in set S (m) is selected, the word k in the context of given window size and the target word w of selection are calculated_i、w_mIt Between the probability ratio λ occurred jointly；The unrelated word or noise word in generator matrix are filtered according to co-occurrence probabilities ratio, is obtained new Co-occurrence matrix M, and input in GloVe and obtain new term vector.

5. a kind of document representation method based on the building of WT-GloVe term vector according to claim 4, which is characterized in that The step 2 is specifically implemented according to the following steps:

In GloVe loss function, each element X in co-occurrence matrix X_ijIndicate word j in the contextual window of target word i The number of interior appearance,It is matrix X in the sum of the i-th row, i.e., all clictions up and down occur in target word i window Total degree, P_ij=P (j | i)=X_ij/X_iIndicate that word j appears in the probability around word i, if w₁For ice, w₂For steam:

Work as w₁, w₂With in corresponding context occur jointly word be gas when, ice occur in the case where gas occur probability take Value is 6.6 × 10^-5, the parameter probability valuing that gas occurs in the case where steam occurs is 7.8 × 10^-4, the case where ice occurs The probability that lower gas occurs is 8.5 × 10 with the probability ratio that gas occurs in the case where steam occurs^-2；

Work as w₁, w₂With in corresponding context occur jointly word be solid when, ice occur in the case where solid occur it is general Rate value is 1.9 × 10^-4, the parameter probability valuing that solid occurs in the case where steam occurs is 2.2 × 10^-5, occur in ice In the case where solid occur probability with steam occur in the case where solid occur probability ratio be 8.9；

Work as w₁, w₂With in corresponding context occur jointly word be fashion when, ice occur in the case where fashion occur Parameter probability valuing be 1.7 × 10^-5, the parameter probability valuing that fashion occurs in the case where steam occurs is 1.8 × 10^-5, The probability that fashion occurs in the case that ice occurs and the probability ratio of the fashion appearance in the case where steam occurs are 0.96；

Work as w₁, w₂With in corresponding context occur jointly word be water when, ice occur in the case where water occur it is general Rate value is 3.0 × 10^-3, the parameter probability valuing that water occurs in the case where steam occurs is 2.2 × 10^-3, occur in ice In the case where water occur probability with steam occur in the case where water occur probability ratio be 1.36；

Work as w₁, w₂It is unrelated, and when context is different value of K, i.e. k=gas, solid, fashion, water, as k=gas, Gas is obviously uncorrelated to ice, more relevant with steam, and the probability that gas occurs in the case where ice occurs occurs in steam In the case where probability ratio P (k | ice)/P (k | steam) for occurring of gas be much smaller than 1；As k=solid, solid and ice phase Close, and solid and steam are uncorrelated, the probability that solid occurs in the case where ice occurs in the case where steam occurs The probability ratio that solid occurs is much larger than 1；Introduce word k and w₁Or w₂One of them semantic similarity gets over hour co-occurrence probabilities ratio It is worth remoter at a distance from 1；But as k=fashion, fashion and w₁, w₂It is all unrelated, and P (k | ice)/P (k | steam) it is close In 1；As k=water, water and ice and steam are related, and the ratio of the two is also close to 1, the i.e. more similar word of semanteme Between context contribution probability ratio closer to 1；

Work as w_iAnd w_mWhen semantic dissimilar and when the included word k of known contexts, from co-occurrence probabilities ratio acquisition word k whether be Irrelevant information, for word w_iAnd w_mWhen including word k in the context of given window size, co-occurrence probabilities ratio are as follows:

If word w_iAnd w_mDissmilarity then has when giving cliction k up and down:

Wherein,For word w_i, w_mCorresponding term vector；

I.e. if the cosine value of two words is smaller, the context of two words is more dissimilar, and semanteme difference is remoter, so The general formulae that dissimilar word is selected with cosine value, from all and w_iSet S (m) of the COS distance less than 0 in The word w of the N number of dissmilarity of selection of machine_mFilter w_iUnrelated word in context reduces the quantity mistake of nonzero element in co-occurrence matrix The unrelated word in co-occurrence matrix is filtered, new co-occurrence matrix M is obtained and is inputted in GloVe.

6. a kind of document representation method based on the building of WT-GloVe term vector according to claim 5, which is characterized in that The step 3 is specifically implemented according to the following steps:

It is obtained according to step 2, in d_i={ x₁,x₂,...,x_mFeature Words x in text_jTerm vector be x_j=(v_{1, j}, v_{2, j}..., v_{T, j}), v_i,jIt is Feature Words x_jValue of the term vector in ith feature dimension, in conjunction with being computed resulting each spy in step 1 Levy the WDID-TFIDF value of word, the text representation based on WT-GloVe term vector weighted model are as follows:

x_j'=(v_1,j,v_2,j,...,v_t,j)·W(x_j)

Wherein, t is characterized word x_jTerm vector dimension, finally obtain 20NewsGroups data set based on WT-GloVe word to Measure the text representation of building.