CN110348497A - A kind of document representation method based on the building of WT-GloVe term vector - Google Patents
A kind of document representation method based on the building of WT-GloVe term vector Download PDFInfo
- Publication number
- CN110348497A CN110348497A CN201910573695.5A CN201910573695A CN110348497A CN 110348497 A CN110348497 A CN 110348497A CN 201910573695 A CN201910573695 A CN 201910573695A CN 110348497 A CN110348497 A CN 110348497A
- Authority
- CN
- China
- Prior art keywords
- word
- occurs
- occur
- case
- glove
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
Abstract
The invention discloses a kind of document representation methods based on the building of WT-GloVe term vector, its significance level is assessed by the word distance computation to network text unique characteristics first, itself contribution degree to classification is differentiated according to the distribution between class of feature, the two is combined as to the characteristic weighing model of word spacing and distribution between class, referred to as WDID-TFIDF;Then unrelated word is filtered according to the self shortcoming of GloVe model, to improve term vector training quality;The characteristic weighing value of equivalent spacing and distribution between class is finally selected according to result and carries out dot product, obtains weighted words vector model, as finally obtained document representation method.The present invention solves the problems, such as that traditional document representation method existing in the prior art calculates complicated or text information and indicates not comprehensive enough.
Description
Technical field
The invention belongs to natural language processing, data mining and Text Classification fields, and in particular to one kind is based on WT-
The document representation method of GloVe term vector building.
Background technique
The internet industry of high speed development promotes a large amount of appearance of the industries such as social networks, mobile Internet, global range
The Websites quantity of interior sustainable growth leads to the generation of explosive information content.The garbage information filtering of Email, question answering system
Asked questions sort out, the identification of query information in search engine, the just negative Judgment by emotion of commodity of shopping website, in government system
Mass viewpoint analysis, the new topic discovery of social media and network public-opinion monitoring etc. require at ultra-large text data set
The continuous renewal of reason technology.At the same time, higher standard also is proposed to the storage of computer and processing capacity.A large amount of texts
How extraordinary potential knowledge in data efficient process and organizes massive information, and helping user efficiently to find required content is to work as
Preceding a major challenge.The key technology that text classification is handled as information data, it has also become the research hotspot of academia, in multiple necks
It is used widely in domain.How accurately to indicate text information and construct suitable disaggregated model to have become two big cores for classification task
Heart problem.
Traditional text representation is normally based on vector space model also or TF-IDF model, they are a large amount of by study
Text feature provides the simple expression to text, which is that low-frequency word assigns relatively high weight, assigns for higher-frequency word
Relatively low weight is given.According to information theory, this mode has weighed the information that each word is conveyed in vocabulary, weighting side
Method includes the logarithm re-scaling of each word frequency rate in document.Finally, the logarithm has linearized the finger of part of speech type by corpus
Number distribution.
With the expansion of data scale, text feature dimension is caused to can reach into ten thousand dimensions even higher.As classical text
The vector space model of one of representation method, obtained text vector higher-dimension are simultaneously sparse.And the semantic information tool of its character representation
There is atomicity, the semantic relation measurement between feature can not be carried out.In the textual form indicated based on vector space model, dimension
Represent feature quantity.
Text classification already becomes the research hotspot being concerned and field from proposition so far, and many scholars are in text table
Show, the different aspects such as Spatial Dimension and classifier expand deep discussion.To sum up the improvement about classification method is substantially
Start with from both direction: first, the improvement based on traditional text sorting technique;Second, being based on neural network file classification method
Improvement.
Summary of the invention
The object of the present invention is to provide a kind of document representation methods based on the building of WT-GloVe term vector, solve existing
Traditional document representation method, which calculates complicated or text information, present in technology indicates not comprehensive enough problem.
The technical scheme adopted by the invention is that a kind of document representation method based on the building of WT-GloVe term vector, tool
Body follows the steps below to implement:
Step 1 assesses its significance level by the word distance computation to network text unique characteristics, according between the class of feature
Distribution differentiates itself contribution degree to classification, the two is combined as to the characteristic weighing model of word spacing and distribution between class, referred to as
WDID-TFIDF;
Step 2 is filtered unrelated word according to the self shortcoming of GloVe model, to improve term vector training quality;
Step 3 selects in step 1 equivalent spacing and the characteristic weighing value of distribution between class simultaneously according to step 2 acquired results
Dot product is carried out, obtains weighted words vector model, as finally obtained document representation method.
The features of the present invention also characterized in that
Step 1 is specifically implemented according to the following steps:
Data set 20NewsGroups is loaded, module needed for importing provides GloVe model, and setting training data stores road
Diameter, coded format;Defined function introduces the general deactivated vocabulary of English, segments to loaded data set, the text that will acquire
Content reads in file by row, carries out Text Pretreatment using spacy module tool, completes part of speech label and is convenient for follow-up mistake
Filter, statistical model calculate term weight function and generator matrix using WDID-TFIDF.
Step 1 is specifically implemented according to the following steps:
Data-oriented collection 20NewsGroups, firstly, the data carried out including going stop words, the analysis of stem morphology are pre-
Treatment process, acquired results include station location marker and word frequency statistics data;Secondly, by the data set D={ d after participle1,
d2,...,dn, generic set C={ c1,c2,...,ci,...,ck, wherein di={ x1,x2,...,xj,...,xm, xj
∈ciIt is read in by row, Feature Words xjTo classification ciClass between discrimination ID indicate are as follows:
Take Feature Words xjMaximum value in of all categories, as its contribution degree to the category;
Wherein, W (xj|ci) value indicate Feature Words xjFor classification ciSeparating capacity, Feature Words xjIn ciMiddle appearance
Frequency is TF (xj|ci), it is not belonging to classification ciIt but include xjTextual data be
To gained WD(xj| c) it is normalized:
Feature Words x is obtained at this timej∈ciClass between distinguish angle value, Feature Words xjWord spacing WD calculate are as follows:
Distance=Lj-Fj
Wherein, LjIt is characterized word xjThe serial number that last time occurs in the text, FjIt is characterized word xjThe sequence occurred for the first time
Number, count is that the text after word segmentation processing segments sum;
For any one Feature Words xj∈ci, the WDID-TFIDF term weight function of contribution degree between word-based spacing and class
Value W (xj) calculate are as follows:
Total data is read according to input content, calculates each term weight function, generates diThe weight matrix of text representation
Step 2 is specifically implemented according to the following steps:
The term vector for choosing corresponding content in text set, including term vector dimension, word window size, minimum statistics
Word frequency;For each word w in dictionaryi, calculate and other words w in textmCos θ value, when cos θ is less than 0 be added set S
(m);The top n word in set S (m) is selected, the word k in the context of given window size and the target word of selection are calculated
wi、wmBetween the probability ratio λ occurred jointly;The unrelated word or noise word in generator matrix are filtered according to co-occurrence probabilities ratio,
New co-occurrence matrix M is obtained, and inputs in GloVe and obtains new term vector.
Step 2 is specifically implemented according to the following steps:
In GloVe loss function, each element X in co-occurrence matrix XijIndicate word j in the context window of target word i
The number of interior appearance in mouthful,It is matrix X in the sum of the i-th row, i.e., all clictions up and down go out in target word i window
Existing total degree, Pij=P (j | i)=Xij/XiIndicate that word j appears in the probability around word i, if w1For ice, w2For steam:
Work as w1, w2With in corresponding context occur jointly word be gas when, ice occur in the case where gas occur it is general
Rate value is 6.6 × 10-5, the parameter probability valuing that gas occurs in the case where steam occurs is 7.8 × 10-4, occur in ice
In the case of gas occur probability with steam occur in the case where gas occur probability ratio be 8.5 × 10-2;
Work as w1, w2With in corresponding context occur jointly word be solid when, ice occur in the case where solid occur
Parameter probability valuing be 1.9 × 10-4, the parameter probability valuing that solid occurs in the case where steam occurs is 2.2 × 10-5, in ice
The probability that solid occurs in the case where appearance is 8.9 with the probability ratio that solid occurs in the case where steam occurs;
Work as w1, w2With in corresponding context occur jointly word be fashion when, ice occur in the case where fashion
The parameter probability valuing of appearance is 1.7 × 10-5, the parameter probability valuing that fashion occurs in the case where steam occurs is 1.8 × 10-5,
The probability that fashion occurs in the case where ice occurs and the probability ratio that fashion occurs in the case where steam occurs
It is 0.96;
Work as w1, w2With in corresponding context occur jointly word be water when, ice occur in the case where water occur
Parameter probability valuing be 3.0 × 10-3, the parameter probability valuing that water occurs in the case where steam occurs is 2.2 × 10-3, in ice
The probability that water occurs in the case where appearance is 1.36 with the probability ratio that water occurs in the case where steam occurs;
Work as w1, w2It is unrelated, and when context is different value of K, i.e. k=gas, solid, fashion, water work as k=gas
When, gas is obviously uncorrelated to ice, and more relevant with steam, the probability that gas occurs in the case where ice occurs goes out in steam
Gas occurs in the case where existing probability ratio P (k | ice)/P (k | steam) it is much smaller than 1;As k=solid, solid and ice
Correlation, and solid and steam are uncorrelated, the probability that solid occurs in the case where ice occurs with the case where steam occurs
The probability ratio that lower solid occurs is much larger than 1;Introduce word k and w1Or w2One of them semantic similarity gets over hour co-occurrence probabilities
Ratio is remoter at a distance from 1;But as k=fashion, fashion and w1, w2It is all unrelated, and P (k | ice)/P (k | steam) it connects
It is bordering on 1;As k=water, water and ice and steam are related, and the ratio of the two is also close to 1, i.e., semantic more similar
Contribution probability ratio between word in context is closer to 1;
Work as wiAnd wmWhen semantic dissimilar and when the included word k of known contexts, acquisition word k is from co-occurrence probabilities ratio
No is irrelevant information, for word wiAnd wmWhen including word k in the context of given window size, co-occurrence probabilities ratio are as follows:
If word wiAnd wmDissmilarity then has when giving cliction k up and down:
(1) as co-occurrence probabilities ratio λ ≈ 1, k is unrelated word at this time;
(2) as co-occurrence probabilities ratio λ > > 1 or λ < < 1, k and word w at this timeiOr wmOne of them is semantic similar;
Therefore, selection and word wiDissimilar wmUnrelated word is filtered, the calculation about term vector similarity is as follows:
Wherein,For word wi, wmCorresponding term vector;
I.e. if the cosine value of two words is smaller, the context of two words is more dissimilar, and semanteme difference is remoter,
So the general formulae of dissimilar word is selected with cosine value, from all and wiSet S (m) of the COS distance less than 0
In the random N number of dissmilarity of selection word wmFilter wiUnrelated word in context reduces the number of nonzero element in co-occurrence matrix
Unrelated word in amount filtering co-occurrence matrix, obtains new co-occurrence matrix M and is inputted in GloVe.
Step 3 is specifically implemented according to the following steps:
It is obtained according to step 2, in di={ x1,x2,...,xmFeature Words x in textjTerm vector be xj=(v1, j,
v2, j..., vT, j), vi,jIt is Feature Words xjValue of the term vector in ith feature dimension, in conjunction with being computed gained in step 1
Each Feature Words WDID-TFIDF value, the text representation based on WT-GloVe term vector weighted model are as follows:
xj'=(v1,j,v2,j,...,vt,j)·W(xj)
Wherein, t is characterized word xjTerm vector dimension, finally obtain 20NewsGroups data set based on WT-GloVe
The text representation of term vector building.
The invention has the advantages that a kind of document representation method based on the building of WT-GloVe term vector, by spy
The word distance computation of itself is levied to assess its significance level, its contribution to classification is differentiated according to the distribution between class of Feature Words
The two, is combined as the characteristic weighing model of word spacing and distribution between class by degree.When only use TF-IDF algorithm to term vector into
Classifying quality can be improved when row weighting, but have ignored distribution situation of the Feature Words in each class.However, Feature Words are each
The distribution situation of classification can reflect that Feature Words distinguish the ability and contribution degree of classification.The present invention is according to Feature Words in each classification
Distribution situation is calculated as angle value is distinguished between the class of Feature Words;At the same time, word spacing is also added as one kind of characteristic item
Power scheme.One word or phrase first appear the distance between end in the text, i.e. referred to as word spacing.It is spaced bigger, explanation
The range that the word is mentioned in the text is wider, also more important to the theme of text.Therefore, the calculated value of word spacing can also generation
Significance level of the table Feature Words to place text;Using word-based spacing and the characteristic weighing scheme of distribution between class to corpus into
Row term weighting merges the unrelated filtered GloVe model of word and generates term vector expression jointly, obtains the important of wherein Feature Words
Degree and semantic information constitute new term vector weighted model, improve final classification effect.
Detailed description of the invention
Fig. 1 is that the present invention is based on the flow charts of the document representation method research process of WT-GloVe term vector building;
Fig. 2 is the term vector weighted model experiment main flow chart based on WT-GloVe;
Fig. 3 is TF-IDF, Word2Vec, GloVe, Word2Vec_TFIDF and WT-GloVe model 5 under 9100 samples
Accuracy rate between kind method compares;
Fig. 4 is TF-IDF, Word2Vec, GloVe, Word2Vec_TFIDF and WT-GloVe model 5 under 12283 samples
Accuracy rate between kind method compares.
Specific embodiment
The following describes the present invention in detail with reference to the accompanying drawings and specific embodiments.
A kind of document representation method based on the building of WT-GloVe term vector of the present invention, flow chart is as shown in Figure 1, specifically press
Implement according to following steps:
Step 1 assesses its significance level by the word distance computation to network text unique characteristics, according between the class of feature
Distribution differentiates itself contribution degree to classification, the two is combined as to the characteristic weighing model of word spacing and distribution between class, referred to as
WDID-TFIDF, step 1 are specifically implemented according to the following steps:
Data set 20NewsGroups is loaded, module needed for importing provides GloVe model, and setting training data stores road
Diameter, coded format;Defined function introduces the general deactivated vocabulary of English, segments to loaded data set, the text that will acquire
Content reads in file by row, carries out Text Pretreatment using spacy module tool, completes part of speech label and is convenient for follow-up mistake
Filter, statistical model calculate term weight function and generator matrix using WDID-TFIDF, are specifically implemented according to the following steps:
Data-oriented collection 20NewsGroups, firstly, the data carried out including going stop words, the analysis of stem morphology are pre-
Treatment process, acquired results include station location marker and word frequency statistics data;Secondly, by the data set D={ d after participle1,
d2,...,dn, generic set C={ c1,c2,...,ci,...,ck, wherein di={ x1,x2,...,xj,...,xm, xj
∈ciIt is read in by row, Feature Words xjTo classification ciClass between discrimination ID indicate are as follows:
Take Feature Words xjMaximum value in of all categories, as its contribution degree to the category;
Wherein, W (xj|ci) value indicate Feature Words xjFor classification ciSeparating capacity, Feature Words xjIn ciMiddle appearance
Frequency is TF (xj|ci), it is not belonging to classification ciIt but include xjTextual data be
To gained WD(xj| c) it is normalized:
Feature Words x is obtained at this timej∈ciClass between distinguish angle value, Feature Words xjWord spacing WD calculate are as follows:
Distance=Lj-Fj
Wherein, LjIt is characterized word xjThe serial number that last time occurs in the text, FjIt is characterized word xjThe sequence occurred for the first time
Number, count is that the text after word segmentation processing segments sum;
For any one Feature Words xj∈ci, the WDID-TFIDF term weight function of contribution degree between word-based spacing and class
Value W (xj) calculate are as follows:
Total data is read according to input content, calculates each term weight function, generates diThe weight matrix of text representation
Step 2 is filtered unrelated word according to the self shortcoming of GloVe model, to improve term vector training quality, such as schemes
Shown in 2, it is specifically implemented according to the following steps:
The term vector for choosing corresponding content in text set, including term vector dimension, word window size, minimum statistics
Word frequency;For each word w in dictionaryi, calculate and other words w in textmCos θ value, when cos θ is less than 0 be added set S
(m);The top n word in set S (m) is selected, the word k in the context of given window size and the target word of selection are calculated
wi、wmBetween the probability ratio λ occurred jointly;The unrelated word or noise word in generator matrix are filtered according to co-occurrence probabilities ratio,
New co-occurrence matrix M is obtained, and inputs in GloVe and obtains new term vector, is specifically implemented according to the following steps:
In GloVe loss function, each element X in co-occurrence matrix XijIndicate word j in the context window of target word i
The number of interior appearance in mouthful,It is matrix X in the sum of the i-th row, i.e., all clictions up and down in target word i window
The total degree of appearance, Pij=P (j | i)=Xij/XiIndicate that word j appears in the probability around word i, if w1For ice, w2For steam:
Table 1 gives extracts co-occurrence frequency of two target words relative to different word k in wikipedia.
1 target word w of table1, w2With the co-occurrence Word probability in corresponding context
Work as w1, w2With in corresponding context occur jointly word be gas when, ice occur in the case where gas occur it is general
Rate value is 6.6 × 10-5, the parameter probability valuing that gas occurs in the case where steam occurs is 7.8 × 10-4, occur in ice
In the case of gas occur probability with steam occur in the case where gas occur probability ratio be 8.5 × 10-2;
Work as w1, w2With in corresponding context occur jointly word be solid when, ice occur in the case where solid occur
Parameter probability valuing be 1.9 × 10-4, the parameter probability valuing that solid occurs in the case where steam occurs is 2.2 × 10-5, in ice
The probability that solid occurs in the case where appearance is 8.9 with the probability ratio that solid occurs in the case where steam occurs;
Work as w1, w2With in corresponding context occur jointly word be fashion when, ice occur in the case where fashion
The parameter probability valuing of appearance is 1.7 × 10-5, the parameter probability valuing that fashion occurs in the case where steam occurs is 1.8 × 10-5,
The probability that fashion occurs in the case where ice occurs and the probability ratio that fashion occurs in the case where steam occurs
It is 0.96;
Work as w1, w2With in corresponding context occur jointly word be water when, ice occur in the case where water occur
Parameter probability valuing be 3.0 × 10-3, the parameter probability valuing that water occurs in the case where steam occurs is 2.2 × 10-3, in ice
The probability that water occurs in the case where appearance is 1.36 with the probability ratio that water occurs in the case where steam occurs;
Work as w1, w2It is unrelated, and when context is different value of K, i.e. k=gas, solid, fashion, water work as k=gas
When, gas is obviously uncorrelated to ice, and more relevant with steam, the probability that gas occurs in the case where ice occurs goes out in steam
Gas occurs in the case where existing probability ratio P (k | ice)/P (k | steam) it is much smaller than 1;As k=solid, solid and ice
Correlation, and solid and steam are uncorrelated, the probability that solid occurs in the case where ice occurs with the case where steam occurs
The probability ratio that lower solid occurs is much larger than 1;Introduce word k and w1Or w2One of them semantic similarity gets over hour co-occurrence probabilities
Ratio is remoter at a distance from 1;But as k=fashion, fashion and w1, w2It is all unrelated, and P (k | ice)/P (k | steam) it connects
It is bordering on 1;As k=water, water and ice and steam are related, and the ratio of the two is also close to 1, i.e., semantic more similar
Contribution probability ratio between word in context is closer to 1;
Work as wiAnd wmWhen semantic dissimilar and when the included word k of known contexts, acquisition word k is from co-occurrence probabilities ratio
No is irrelevant information, for word wiAnd wmWhen including word k in the context of given window size, co-occurrence probabilities ratio are as follows:
If word wiAnd wmDissmilarity then has when giving cliction k up and down:
(1) as co-occurrence probabilities ratio λ ≈ 1, k is unrelated word at this time;
(2) as co-occurrence probabilities ratio λ > > 1 or λ < < 1, k and word w at this timeiOr wmOne of them is semantic similar;
Therefore, selection and word wiDissimilar wmUnrelated word is filtered, the calculation about term vector similarity is as follows:
Wherein,For word wi, wmCorresponding term vector;
I.e. if the cosine value of two words is smaller, the context of two words is more dissimilar, and semanteme difference is remoter,
So the general formulae of dissimilar word is selected with cosine value, from all and wiSet S (m) of the COS distance less than 0
In the random N number of dissmilarity of selection word wmFilter wiUnrelated word in context reduces the number of nonzero element in co-occurrence matrix
Unrelated word in amount filtering co-occurrence matrix, obtains new co-occurrence matrix M and is inputted in GloVe;
The term vector dimension features_num that corresponding content in text is chosen during GloVe model training is 300, on
Hereafter window size context is 10, and minimum word frequency number min_count is 50, and it is 15 that λ, which is arranged, in co-occurrence probabilities ratio, introduces nothing
Closing word number N is 15, finally, obtains the correspondence term vector of each word.
Step 3 selects in step 1 equivalent spacing and the characteristic weighing value of distribution between class simultaneously according to step 2 acquired results
Dot product is carried out, obtains weighted words vector model, as finally obtained document representation method is specifically implemented according to the following steps:
It is obtained according to step 2, in di={ x1,x2,...,xmFeature Words x in textjTerm vector be xj=(v1,j,v2,j,...,vt,j),
vi,jIt is Feature Words xjValue of the term vector in ith feature dimension, in conjunction with being computed resulting each Feature Words in step 1
WDID-TFIDF value, the text representation based on WT-GloVe term vector weighted model are as follows:
xj'=(v1,j,v2,j,...,vt,j)·W(xj)
Wherein, t is characterized word xjTerm vector dimension, finally obtain 20NewsGroups data set based on WT-GloVe
The text representation of term vector building.
As shown in figure 3, originally the performance of TF-IDF is all higher than the accuracy rate of other two kinds of algorithms.With the increasing of sample number
Add, the speedup of TF-IDF slowly declines.When sample number is more than certain amount, TF-IDF accuracy rate is begun to decline.For GloVe and
For WT-GloVe, the overall trend with the increase accuracy rate of sample is also increasing;The former two's speedup 500 is lower than tradition
TF-IDF, rear the two speedup 500 are obviously improved.At this point, the accuracy rate of GloVe has reached 79.02%, WT-GloVe's
Accuracy rate has reached 82.86%;When between 2000 to 6400, since 3200 backward based on the improved characteristic weighing scheme of tradition
WDID-TFIDF accuracy rate same as TF-IDF is slightly decreased 0.88%, causes the speedup of WT-GloVe lower than GloVe.From WT-
From the point of view of GloVe overall trend, accuracy rate performance is all relatively preferably.For Word2Vec, Word2Vec_TFIDF and WT-
Between GloVe accuracy rate comparatively, the speedup of Word2Vec and Word2Vec_TFIDF is better than WT-GloVe when rigid beginning,
The two accuracy rate is respectively 73.73% and 74.25%, is because GloVe is the ability under mass data collection for semantic assurance
Its superiority can preferably be played.When the speedup of 1500 WT-GloVe backward be higher than at once Word2Vec_TFIDF and
Word2Vec, accuracy rate 89.72%.As shown in the figure as a whole, though the accuracy rate of WT-GloVe is in a small amount of sample number
Not as good as other algorithms, but performance later is totally better than TF-IDF, Word2Vec, Word2Vec_TFIDF and GloVe.12283
The accuracy rate of lower 5 kinds of methods of sample compares, TF-IDF, GloVe and WT-GloVe and W2V, W2V- under 12283 samples
Accuracy rate between TFIDF and WT-GloVe compares.
As shown in figure 4, observation GloVe and WT-GloVe, the accuracy rate of 2000 the former two is lower than tradition TF-IDF, sample
Respectively 80.32% and 82.58% when number is 2000, as both increases of sample accuracy rate is also increasing.WT-GloVe exists
Speedup decreased significantly after 2000, the reason is that WDID-TFIDF is same as TF-IDF after a certain amount of data set, under accuracy rate
0.64% has been dropped, and speedup is not affected GloVe during this period.Increase GloVe's and WT-GloVe with sample number
Speedup is all slowed down, and accuracy rate continues to increase, respectively 84.54% and 86.29% when 9000.For Word2Vec,
Between Word2Vec_TFIDF and WT-GloVe accuracy rate comparatively, due to sample number increase, WT-GloVe is originally
Accuracy rate is 74.64%, and always above Word2Vec and Word2Vec_TFIDF.
Claims (6)
1. a kind of document representation method based on the building of WT-GloVe term vector, which is characterized in that specifically real according to the following steps
It applies:
Step 1 assesses its significance level by the word distance computation to network text unique characteristics, according to the distribution between class of feature
Differentiate itself contribution degree to classification, the two is combined as to the characteristic weighing model of word spacing and distribution between class, referred to as WDID-
TFIDF;
Step 2 is filtered unrelated word according to the self shortcoming of GloVe model, to improve term vector training quality;
Step 3 selects equivalent spacing and the characteristic weighing value of distribution between class in step 1 according to step 2 acquired results and carries out
Dot product obtains weighted words vector model, as finally obtained document representation method.
2. a kind of document representation method based on the building of WT-GloVe term vector according to claim 1, which is characterized in that
The step 1 is specifically implemented according to the following steps:
Data set 20NewsGroups is loaded, module needed for importing provides GloVe model, and training data store path is arranged, and compiles
Code format;Defined function introduces the general deactivated vocabulary of English, segments to loaded data set, the content of text that will acquire
File is read in by row, carries out Text Pretreatment using spacy module tool, completes part of speech label convenient for follow-up filtering, system
Model is counted, calculates term weight function and generator matrix using WDID-TFIDF.
3. a kind of document representation method based on the building of WT-GloVe term vector according to claim 2, which is characterized in that
The step 1 is specifically implemented according to the following steps:
Data-oriented collection 20NewsGroups, firstly, carrying out the data prediction including going stop words, the analysis of stem morphology
Process, acquired results include station location marker and word frequency statistics data;Secondly, by the data set D={ d after participle1,d2,...,
dn, generic set C={ c1,c2,...,ci,...,ck, wherein di={ x1,x2,...,xj,...,xm, xj∈ciBy row
It reads in, Feature Words xjTo classification ciClass between discrimination ID indicate are as follows:
Take Feature Words xjMaximum value in of all categories, as its contribution degree to the category;
Wherein, W (xj|ci) value indicate Feature Words xjFor classification ciSeparating capacity, Feature Words xjIn ciThe frequency of middle appearance
For TF (xj|ci), it is not belonging to classification ciIt but include xjTextual data be
To gained WD(xj| c) it is normalized:
Feature Words x is obtained at this timej∈ciClass between distinguish angle value, Feature Words xjWord spacing WD calculate are as follows:
Distance=Lj-Fj
Wherein, LjIt is characterized word xjThe serial number that last time occurs in the text, FjIt is characterized word xjThe serial number occurred for the first time,
Count is that the text after word segmentation processing segments sum;
For any one Feature Words xj∈ci, the WDID-TFIDF term weight function value W of contribution degree between word-based spacing and class
(xj) calculate are as follows:
Total data is read according to input content, calculates each term weight function, generates diThe weight matrix of text representation
4. a kind of document representation method based on the building of WT-GloVe term vector according to claim 3, which is characterized in that
The step 2 is specifically implemented according to the following steps:
The term vector for choosing corresponding content in text set, including term vector dimension, word window size, minimum statistics time word
Frequently;For each word w in dictionaryi, calculate and other words w in textmCos θ value, when cos θ is less than 0 be added set S (m);
The top n word in set S (m) is selected, the word k in the context of given window size and the target word w of selection are calculatedi、wmIt
Between the probability ratio λ occurred jointly;The unrelated word or noise word in generator matrix are filtered according to co-occurrence probabilities ratio, is obtained new
Co-occurrence matrix M, and input in GloVe and obtain new term vector.
5. a kind of document representation method based on the building of WT-GloVe term vector according to claim 4, which is characterized in that
The step 2 is specifically implemented according to the following steps:
In GloVe loss function, each element X in co-occurrence matrix XijIndicate word j in the contextual window of target word i
The number of interior appearance,It is matrix X in the sum of the i-th row, i.e., all clictions up and down occur in target word i window
Total degree, Pij=P (j | i)=Xij/XiIndicate that word j appears in the probability around word i, if w1For ice, w2For steam:
Work as w1, w2With in corresponding context occur jointly word be gas when, ice occur in the case where gas occur probability take
Value is 6.6 × 10-5, the parameter probability valuing that gas occurs in the case where steam occurs is 7.8 × 10-4, the case where ice occurs
The probability that lower gas occurs is 8.5 × 10 with the probability ratio that gas occurs in the case where steam occurs-2;
Work as w1, w2With in corresponding context occur jointly word be solid when, ice occur in the case where solid occur it is general
Rate value is 1.9 × 10-4, the parameter probability valuing that solid occurs in the case where steam occurs is 2.2 × 10-5, occur in ice
In the case where solid occur probability with steam occur in the case where solid occur probability ratio be 8.9;
Work as w1, w2With in corresponding context occur jointly word be fashion when, ice occur in the case where fashion occur
Parameter probability valuing be 1.7 × 10-5, the parameter probability valuing that fashion occurs in the case where steam occurs is 1.8 × 10-5,
The probability that fashion occurs in the case that ice occurs and the probability ratio of the fashion appearance in the case where steam occurs are
0.96;
Work as w1, w2With in corresponding context occur jointly word be water when, ice occur in the case where water occur it is general
Rate value is 3.0 × 10-3, the parameter probability valuing that water occurs in the case where steam occurs is 2.2 × 10-3, occur in ice
In the case where water occur probability with steam occur in the case where water occur probability ratio be 1.36;
Work as w1, w2It is unrelated, and when context is different value of K, i.e. k=gas, solid, fashion, water, as k=gas,
Gas is obviously uncorrelated to ice, more relevant with steam, and the probability that gas occurs in the case where ice occurs occurs in steam
In the case where probability ratio P (k | ice)/P (k | steam) for occurring of gas be much smaller than 1;As k=solid, solid and ice phase
Close, and solid and steam are uncorrelated, the probability that solid occurs in the case where ice occurs in the case where steam occurs
The probability ratio that solid occurs is much larger than 1;Introduce word k and w1Or w2One of them semantic similarity gets over hour co-occurrence probabilities ratio
It is worth remoter at a distance from 1;But as k=fashion, fashion and w1, w2It is all unrelated, and P (k | ice)/P (k | steam) it is close
In 1;As k=water, water and ice and steam are related, and the ratio of the two is also close to 1, the i.e. more similar word of semanteme
Between context contribution probability ratio closer to 1;
Work as wiAnd wmWhen semantic dissimilar and when the included word k of known contexts, from co-occurrence probabilities ratio acquisition word k whether be
Irrelevant information, for word wiAnd wmWhen including word k in the context of given window size, co-occurrence probabilities ratio are as follows:
If word wiAnd wmDissmilarity then has when giving cliction k up and down:
(1) as co-occurrence probabilities ratio λ ≈ 1, k is unrelated word at this time;
(2) as co-occurrence probabilities ratio λ > > 1 or λ < < 1, k and word w at this timeiOr wmOne of them is semantic similar;
Therefore, selection and word wiDissimilar wmUnrelated word is filtered, the calculation about term vector similarity is as follows:
Wherein,For word wi, wmCorresponding term vector;
I.e. if the cosine value of two words is smaller, the context of two words is more dissimilar, and semanteme difference is remoter, so
The general formulae that dissimilar word is selected with cosine value, from all and wiSet S (m) of the COS distance less than 0 in
The word w of the N number of dissmilarity of selection of machinemFilter wiUnrelated word in context reduces the quantity mistake of nonzero element in co-occurrence matrix
The unrelated word in co-occurrence matrix is filtered, new co-occurrence matrix M is obtained and is inputted in GloVe.
6. a kind of document representation method based on the building of WT-GloVe term vector according to claim 5, which is characterized in that
The step 3 is specifically implemented according to the following steps:
It is obtained according to step 2, in di={ x1,x2,...,xmFeature Words x in textjTerm vector be xj=(v1, j, v2, j...,
vT, j), vi,jIt is Feature Words xjValue of the term vector in ith feature dimension, in conjunction with being computed resulting each spy in step 1
Levy the WDID-TFIDF value of word, the text representation based on WT-GloVe term vector weighted model are as follows:
xj'=(v1,j,v2,j,...,vt,j)·W(xj)
Wherein, t is characterized word xjTerm vector dimension, finally obtain 20NewsGroups data set based on WT-GloVe word to
Measure the text representation of building.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910573695.5A CN110348497B (en) | 2019-06-28 | 2019-06-28 | Text representation method constructed based on WT-GloVe word vector |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910573695.5A CN110348497B (en) | 2019-06-28 | 2019-06-28 | Text representation method constructed based on WT-GloVe word vector |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110348497A true CN110348497A (en) | 2019-10-18 |
CN110348497B CN110348497B (en) | 2021-09-10 |
Family
ID=68176994
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910573695.5A Active CN110348497B (en) | 2019-06-28 | 2019-06-28 | Text representation method constructed based on WT-GloVe word vector |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110348497B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111143510A (en) * | 2019-12-10 | 2020-05-12 | 广东电网有限责任公司 | Searching method based on latent semantic analysis model |
CN113486176A (en) * | 2021-07-08 | 2021-10-08 | 桂林电子科技大学 | News classification method based on secondary feature amplification |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101706780A (en) * | 2009-09-03 | 2010-05-12 | 北京交通大学 | Image semantic retrieving method based on visual attention model |
CN101944099A (en) * | 2010-06-24 | 2011-01-12 | 西北工业大学 | Method for automatically classifying text documents by utilizing body |
US20130253906A1 (en) * | 2012-03-26 | 2013-09-26 | Verizon Patent And Licensing Inc. | Environment sensitive predictive text entry |
CN103336806A (en) * | 2013-06-24 | 2013-10-02 | 北京工业大学 | Method for sequencing keywords based on entropy difference between word-spacing-appearing internal mode and external mode |
CN106156772A (en) * | 2015-03-25 | 2016-11-23 | 佳能株式会社 | For determining the method and apparatus of word spacing and for the method and system of participle |
CN107577668A (en) * | 2017-09-15 | 2018-01-12 | 电子科技大学 | Social media non-standard word correcting method based on semanteme |
CN109189925A (en) * | 2018-08-16 | 2019-01-11 | 华南师范大学 | Term vector model based on mutual information and based on the file classification method of CNN |
CN109271517A (en) * | 2018-09-29 | 2019-01-25 | 东北大学 | IG TF-IDF Text eigenvector generates and file classification method |
CN109933670A (en) * | 2019-03-19 | 2019-06-25 | 中南大学 | A kind of file classification method calculating semantic distance based on combinatorial matrix |
-
2019
- 2019-06-28 CN CN201910573695.5A patent/CN110348497B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101706780A (en) * | 2009-09-03 | 2010-05-12 | 北京交通大学 | Image semantic retrieving method based on visual attention model |
CN101944099A (en) * | 2010-06-24 | 2011-01-12 | 西北工业大学 | Method for automatically classifying text documents by utilizing body |
US20130253906A1 (en) * | 2012-03-26 | 2013-09-26 | Verizon Patent And Licensing Inc. | Environment sensitive predictive text entry |
CN103336806A (en) * | 2013-06-24 | 2013-10-02 | 北京工业大学 | Method for sequencing keywords based on entropy difference between word-spacing-appearing internal mode and external mode |
CN106156772A (en) * | 2015-03-25 | 2016-11-23 | 佳能株式会社 | For determining the method and apparatus of word spacing and for the method and system of participle |
CN107577668A (en) * | 2017-09-15 | 2018-01-12 | 电子科技大学 | Social media non-standard word correcting method based on semanteme |
CN109189925A (en) * | 2018-08-16 | 2019-01-11 | 华南师范大学 | Term vector model based on mutual information and based on the file classification method of CNN |
CN109271517A (en) * | 2018-09-29 | 2019-01-25 | 东北大学 | IG TF-IDF Text eigenvector generates and file classification method |
CN109933670A (en) * | 2019-03-19 | 2019-06-25 | 中南大学 | A kind of file classification method calculating semantic distance based on combinatorial matrix |
Non-Patent Citations (5)
Title |
---|
CASPER HANSEN 等: "Contextually Propagated Term Weights for Document Representation", 《ARXIV》 * |
JEFFREY PENNINGTON 等: "GloVe: Global Vectors for Word Representation", 《PROCEEDINGS OF THE 2014 CONFERENCE ON EMPIRICAL METHODS IN NATURAL》 * |
MATT J. KUSNER 等: "From Word Embeddings To Document Distances", 《PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON MACHINE LEARNING》 * |
张栋 等: "基于答案辅助的半监督问题分类方法", 《计算机工程与科学》 * |
李峰 等: "融合词向量的多特征句子相似度计算方法研究", 《计算机科学与探索》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111143510A (en) * | 2019-12-10 | 2020-05-12 | 广东电网有限责任公司 | Searching method based on latent semantic analysis model |
CN113486176A (en) * | 2021-07-08 | 2021-10-08 | 桂林电子科技大学 | News classification method based on secondary feature amplification |
Also Published As
Publication number | Publication date |
---|---|
CN110348497B (en) | 2021-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107608999A (en) | A kind of Question Classification method suitable for automatically request-answering system | |
CN107315797A (en) | A kind of Internet news is obtained and text emotion forecasting system | |
CN109670014B (en) | Paper author name disambiguation method based on rule matching and machine learning | |
CN108763214B (en) | Automatic construction method of emotion dictionary for commodity comments | |
CN103116637A (en) | Text sentiment classification method facing Chinese Web comments | |
Probierz et al. | Rapid detection of fake news based on machine learning methods | |
CN106682089A (en) | RNNs-based method for automatic safety checking of short message | |
Yüksel et al. | Turkish tweet classification with transformer encoder | |
KR20160149050A (en) | Apparatus and method for selecting a pure play company by using text mining | |
CN109614490A (en) | Money article proneness analysis method based on LSTM | |
Kurniawan et al. | Indonesian twitter sentiment analysis using Word2Vec | |
CN115329085A (en) | Social robot classification method and system | |
CN110348497A (en) | A kind of document representation method based on the building of WT-GloVe term vector | |
CN114722198A (en) | Method, system and related device for determining product classification code | |
Jayakody et al. | Sentiment analysis on product reviews on twitter using Machine Learning Approaches | |
Anjum et al. | Exploring humor in natural language processing: a comprehensive review of JOKER tasks at CLEF symposium 2023 | |
CN111078874B (en) | Foreign Chinese difficulty assessment method based on decision tree classification of random subspace | |
Kusum et al. | Sentiment analysis using global vector and long short-term memory | |
Kavitha et al. | A review on machine learning techniques for text classification | |
CN114896398A (en) | Text classification system and method based on feature selection | |
CN113761123A (en) | Keyword acquisition method and device, computing equipment and storage medium | |
Baria et al. | Theoretical evaluation of machine and deep learning for detecting fake news | |
Suhasini et al. | A Hybrid TF-IDF and N-Grams Based Feature Extraction Approach for Accurate Detection of Fake News on Twitter Data | |
Handayani et al. | Sentiment Analysis of Bank BNI User Comments Using the Support Vector Machine Method | |
Intani et al. | Automating Public Complaint Classification Through JakLapor Channel: A Case Study of Jakarta, Indonesia |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |