CN106294684A - The file classification method of term vector and terminal unit - Google Patents
The file classification method of term vector and terminal unit Download PDFInfo
- Publication number
- CN106294684A CN106294684A CN201610639589.9A CN201610639589A CN106294684A CN 106294684 A CN106294684 A CN 106294684A CN 201610639589 A CN201610639589 A CN 201610639589A CN 106294684 A CN106294684 A CN 106294684A
- Authority
- CN
- China
- Prior art keywords
- text
- term vector
- sorted
- training sample
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to the communications field, disclose file classification method and the terminal unit of a kind of term vector.In embodiment of the present invention, by continuous word bag model CBOW, the term vector matrix of each term vector is comprised after calculating text participle, the term vector that each training sample of known text type comprises respectively is searched based on this term vector matrix, and calculate the characteristic vector of the training sample of every kind of text type, finally according to the characteristic vector of text to be sorted and the characteristic vector of the training sample of various text type, determine the type of text to be sorted.In this way, during to term vector, several words and the contact of the most several word before current word can be considered, whole text feature is made to have the feature of semanteme, when training sample is trained, efficiency is costly and time consuming few, and when determining the type of text to be sorted, amount of calculation is little, calculation is simple and efficient and precision is high.
Description
Technical field
The present invention relates to field of information processing, particularly to file classification method and the terminal unit of a kind of term vector.
Background technology
Text classification refers to that training text collection, as training text collection, is entered by the text that a group is crossed by expert classification in advance
Row analysis draws classification mode, is classified other texts with the classification mode derived, and it is mainly used in information retrieval, machine
Device translation, automatic abstract and information filtering etc..Wherein, the method for text training has a lot, as based on continuous word bag model CBOW
Hierarchical classification device Hierarchical Softmax and two kinds of training methods of negative sampling.
During realizing the present invention, inventors herein have recognized that, use based on Hierarchical Softmax
CBOW training method more favourable to rare words, the faster classification to this class text can be realized, based on negative sampling algorithm
CBOW training method is favourable to the text classification of common word and low dimensional vector, and meanwhile, CBOW based on two kinds of algorithms of different are in instruction
When practicing, the window size being generally selected is about 5, but above two method needs during determining text type in a large number
Calculating, be unfavorable for quickly realizing.
Summary of the invention
The purpose of embodiment of the present invention is to provide file classification method and the terminal unit of a kind of term vector so that word
During vectorization, several words and the contact of the most several word before current word can be considered, make whole text feature have semantic special
Property, when being trained training sample, efficiency is costly and time consuming few, and when determining the type of text to be sorted, amount of calculation is little, calculating
Mode is simple and efficient and precision is high.
For solving above-mentioned technical problem, embodiments of the present invention provide the file classification method of a kind of term vector, bag
Include:
Data after L text participle are inputted in continuous word bag model CBOW, calculate and comprise each term vector wmnWord
Vector matrix Wmn;
After the training sample participle of M known text type, from described term vector matrix WmnEach training sample of middle lookup
The term vector w that this comprises respectivelymn;
The term vector w comprised according to each training sample describedmn, calculate the feature of the training sample of every kind of text type
Vector Tk;
Characteristic vector according to text to be sorted and characteristic vector T of the described training sample of various text typek, determine
The type of text to be sorted;
Wherein, M≤L, m are word number, and n is the dimension of term vector, k=1,2 ..., K, K represent text type quantity.
Embodiments of the present invention additionally provide a kind of terminal unit, comprise:
Term vector computing module, after the data after L text participle are inputted continuous word bag model CBOW, calculates bag
Containing each term vector wmnTerm vector matrix Wmn;
Search module, for by after the training sample participle of M known text type, from described term vector matrix WmnIn look into
Look for the term vector w that each training sample comprises respectivelymn;
Training sample characteristic vector computing module, for the term vector w comprised according to each training sample describedmn, meter
Calculate characteristic vector T of the training sample of every kind of text typek;
Text type to be sorted determines module, for the characteristic vector according to text to be sorted and the institute of various text type
State characteristic vector T of training samplek, determine the type of text to be sorted;
Wherein, M≤L, m are word number, and n is the dimension of term vector, k=1,2 ..., K, K represent text type quantity.
Embodiment of the present invention in terms of existing technologies, by continuous word bag model CBOW, after calculating text participle
Comprise the term vector matrix of each term vector, search each training sample of known text type respectively based on this term vector matrix
The term vector comprised, and calculate the characteristic vector of the training sample of every kind of text type, finally according to the spy of text to be sorted
Levy the characteristic vector of the training sample of the various text type of vector sum, determine the type of text to be sorted.In this way, right
During term vector, several words and the contact of the most several word before current word can be considered, make whole text feature have semanteme
Characteristic, when being trained training sample, efficiency is costly and time consuming few, and when determining the type of text to be sorted, amount of calculation is little, meter
Calculation mode is simple and efficient and precision is high.
It addition, use the calculation that adds and be averaging, calculate every kind of text type described training sample feature to
Amount Tk。
By the calculation added and be averaging, when calculating the characteristic vector of described training sample of every kind of text type,
Operand is little and to calculate process simple and efficient.
It addition, the characteristic vector of text to be sorted calculates in the following manner: after described text participle to be sorted, from institute
Predicate vector matrix WmnEach term vector w that the described text to be sorted of middle lookup is comprisedmn;According to described text to be sorted
Term vector wmn, use the calculation adding and being averaging, calculate characteristic vector D of described text to be sorted.
By searching term vector matrix WmnIn each term vector w of being comprised of text to be sortedmn, can make full use of
Existing term vector matrix Wmn, thus realize the quick and precisely lookup to each term vector, indirectly improve whole text classification
The efficiency of journey.
It addition, according to the characteristic vector of text to be sorted and characteristic vector T of described training samplek, determine literary composition to be sorted
This type, specifically includes: calculate the characteristic vector of described text to be sorted and the described training sample of various text type
Characteristic vector TkCosine similarity value;The type of described text to be sorted by described cosine similarity closest to right when 1
The text type of the training sample answered.
Utilizing the mode calculating cosine similarity value to determine the type of text to be sorted, operand is little and calculation is simple
Victory, precision when determining text type to be sorted is high.
Accompanying drawing explanation
Fig. 1 is the file classification method flow chart of a kind of term vector according to first embodiment of the invention;
Fig. 2 is the structural representation of the CBOW model according to first embodiment of the invention;
Fig. 3 is the schematic network structure of the CBOW model according to first embodiment of the invention;
Fig. 4 is the accompanying drawings that word looked into by the CBOW model according to first embodiment of the invention;
Fig. 5 be the negative sampling algorithm according to first embodiment of the invention insinuate set up schematic diagram;
Fig. 6 is the structural representation of a kind of terminal unit according to third embodiment of the invention;
Fig. 7 is the structural representation of a kind of terminal unit according to four embodiment of the invention.
Detailed description of the invention
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with the accompanying drawing each reality to the present invention
The mode of executing is explained in detail.But, it will be understood by those skilled in the art that in each embodiment of the present invention,
In order to make reader be more fully understood that, the application proposes many ins and outs.But, even if there is no these ins and outs and base
Many variations and amendment in following embodiment, it is also possible to realize the application technical scheme required for protection.
First embodiment of the present invention relates to the file classification method of a kind of term vector.Idiographic flow is as shown in Figure 1.
In a step 101, the data after L text participle are inputted in CBOW, calculates its term vector matrix.
Specifically, L text being carried out participle, the data obtained are as the input of CBOW, thus calculate this L literary composition
This term vector and comprise the term vector matrix of each term vector, wherein, term vector is denoted as wmn, term vector matrix is denoted as Wmn, m is
Word number, n is the dimension of term vector, term vector matrix WmnConcrete form as follows:
Wherein,Represent the vector set of m word, WmnIn every a line represent a word vector expression.
Additionally, CBOW described in embodiment of the present invention, it is CBOW based on negative sampling algorithm.Conventional CBOW has base
In CBOW and the CBOW two types based on negative sampling algorithm of Hierarchical Softmax, separately below to based on
The CBOW and CBOW based on negative sampling algorithm of Hierarchical Softmax are introduced.
Wherein, CBOW model based on Hierarchical Softmax, comprise input layer, projection layer and output layer, its
It is in known current word wtContext wt-2, wt-1, wt+1, wt+2On the premise of predict current word wt, as shown in Figure 2.Assume sample
(Context (w), w) is made up of c word each before and after w, then the term vector v of 2c word during input layer comprises Context (w)
(Context(w)1V), v (Context (w)2V), v (Context (w)2cv)∈Rm, m represents the length of term vector.Projection layer will
It is cumulative that 2c vector of input does summation, i.e.Its structural representation is as shown in Figure 3.Output
The corresponding binary tree of layer, it is to work as leafy node with the word occurred in language material, works as with the number of times that each word occurs in language material
The Hofman tree that weights construct, in this Hofman tree, N=(| D |) altogether is individual for leafy node, in the most corresponding dictionary D
Word s, non-leaf node N-1 those nodes of black (figure get the bid into).
Hierarchical Softmax is for putting forward a high performance key technology, in Hofman tree in term vector
Certain leafy node, it is assumed that the word w in its corresponding dictionary D, note:
1)pwRepresent the path arriving w correspondence leafy node from root node;
2)lwRepresent path pwIn comprise the number of node;
3)Represent path pwIn lwIndividual node, whereinRepresent root node,Represent that word w is corresponding
Node;
4)Representing the huffman coding of word w, it is by lw-1 coding is constituted,Represent path
pwThe coding (the most corresponding coding of root node) that middle jth node is corresponding;
5)Represent path pwThe vector that middle non-leaf node is corresponding,Represent path pwMiddle jth
The vector that individual non-leaf node is corresponding.
The example of existing word w=" football ", illustrates how, under the network structure shown in Fig. 3, to utilize vector xw∈RmAnd
Hofman tree carrys out defined function p (w | Context (w)), and detailed process is as shown in Figure 4.Article four, 5 nodes that dotted line limit strings together
Just constitute path pw, its length lw=5,For path pwOn 5 nodes, andCorresponding root node,Be respectively 1,0,0,1, i.e. the huffman coding of " football " is 1001, additionally,Table respectively
Show path pwThe vector that upper 4 non-leaf nodes are corresponding." football " this leaf node, middle warp altogether is arrived from root node
Manage 4 branches (the corresponding branch in the limit of every dotted line), and branch all can be considered and carried out two classification each time.Both
It it is so that the angle from two classification considers a problem, then each non-leaf node, it is necessary to formulate for child's node around
One classification, i.e. which is positive class (label is 1), and which is negative class (label is 0), beyond the node that digs up the roots, each node in tree
An all corresponding value is the huffman coding of 0 or 1, and therefore, huffman coding is 1 by the most natural a kind of way
Node is defined as positive class, and the node being encoded to 0 is defined as negative class (also can be defined as positive class by 0,1 is defined as negative class), namely
When being classified by one node, assigning to the left side and bear class exactly, assigning to the right is exactly positive class, arranges at this:According to logistic regression, the probability that node is divided into positive class isThe probability being divided into negative class is thenWherein, θ is undetermined parameter, here non-leaf node
Corresponding thoseJust can play the part of the role of parameter θ.
For arriving, from root node, 4 two classification that " football " this leafy node is experienced, every subseries is tied
The probability of fruit writes out and is exactly:
1) the 1st time:
2) the 2nd time:
3) the 3rd time:
4) the 4th:
ThenSo far, by the example of w=" football ":
For any word w in dictionary D, Hofman tree must exist the counterfoil node path p to word w correspondence nodew(and this
Path is unique).Path pwOn there is lw-1 branch, regards two classification as by each branch, and classification each time just produces
These probability multiplications are p (w | Context (w)) by one probability.
Or, write as the form of overall expression formula:
Assume Yw=(yw,1,yw,2,…,yw,N)TBeing the vector of an a length of N, its component can not represent probability, if it is desired to
Want YwComponent yw,iRepresent that when context is Context (w), next one word is by chance the probability of i-th word in dictionary D, then need
Do the normalized of a softmax, then haveWherein iwRepresent that word w is at dictionary D
In index.Therefore
Formula (1) is substituted into log-likelihood functionIn, then:
NoteFor the object function of CBOW model,
Term vector utilize stochastic gradient rise method to optimize above-mentioned function.The way of stochastic gradient rise method is as follows:
(all relevant parameters in object function w), are just done and once update by Context (w) often to take a sample.See
Observation of the eyes scalar functionsUnderstanding, the parameter in this function includes vectorBe given function l (w, j) about
The gradient of these vectors.
Wherein, l (w, j) aboutGradient be:
Then,More new formula writeable be:η represents learning rate.
(w, j) to x in like manner can to obtain lwGradient be:
Here, xwRepresent is the cumulative of the term vector of each word in Context (w), and final purpose is to ask each in dictionary D
The term vector of word, then
In CBOW model based on negative sampling algorithm, it is known that context Context (w) of word w, need predict w, therefore,
Being exactly positive sample for given Context (w), word w, other words are exactly negative sample.Assume to have selected what a bearing about w
Sample set NEG (w) ≠ φ, forDefinitionRepresent wordLabel, the mark of i.e. positive sample
Label are 1, and the label of negative sample is 0.
For a given positive sample (Context (w), w), it is desirable to maximize
Wherein,
Can also be write as the form of overall expression formula:
Here xwStill represent the vectorial sum of each word in Context (w), and θu∈RmRepresent word u corresponding to
Amount.Formula (3) is substituted into formula (2), has:
Wherein,Represent when context be (Context (w), time w), it was predicted that centre word is the probability of w, andThen represent when context is Context (w), it was predicted that centre word is the probability of u, then maximize g
Time (w), namely maximizeMinimize all of simultaneouslyNamely increase the general of positive sample
The probability of negative sample is reduced while rate.For a given corpus C, functionMesh as global optimization
Mark, calculates for simplifying, takes the logarithm G, and the most final object function is:
Note above formula be l (w, u) then:
Also with stochastic gradient rise method, its parameter is optimized, then:
(w, u) about θ for luGradient be:
Then, θuMore new formula writeable be:
(w, u) about x for lwGradient be:
Then, utilizeCan obtainMore new formula be:
The most simply introducing the negative sampling algorithm used in embodiment of the present invention, the word in dictionary D is at language
It is low that the number of times occurred in material C has height to have, and for those high frequency words, the probability being chosen as negative sample just should be bigger, otherwise, right
In those low-frequency words, selected probability will be less.It is essentially the problem of cum rights sampling, specific algorithm process, can retouch
State for:
Assume corresponding line segment l (w) of each word w in dictionary D, a length of:
Here (sum term in denominator is used for doing normalizing the number of times that one word of counter () expression occurs in language material C
Change), now these line segments ending is connected and is stitched together, form the unit segment of a length of 1.If it is random toward this
Get ready on individual unit segment, then the probability that the line segment (corresponding high frequency words) that length is the longest is hit is the biggest.
NoteHere wjRepresent jth word in dictionary D, then withFor cuing open
Partial node can get a non-equidistant subdivision in district [0,1], Ii=(li-1,li], i-1,2 ..., N is its N number of region subdivision.
The equidistant subdivision being further introduced on interval [0,1], subdivision node isWherein M > > N, as shown in Figure 5.
By inside subdivision nodeProject on non-equidistant subdivision, as shown in the dotted line in Fig. 6, thenWith
IntervalMapping relations be:
Table (i)=wk,where mi∈Ik, i=1,2 ..., M-1
According to these mapping relations, its sampling process is: generate random integers r between [1, M-1], Table every time
R () sample, just gets self w if sampled in sampling processi, then skip.
In a step 102, from WmnIn, search the term vector that training sample comprises.
Specifically, from term vector matrix WmnIn, search the word that comprises respectively of M training sample of known text type to
Amount, first the training sample of M known text type is carried out participle (M≤L here, be in order to prevent training sample participle after
The term vector matrix W that obtains in step 101 of resultmnMiddle lookup less than), then at word moment matrix WmnEach training of middle lookup
The term vector w that sample comprises respectivelymn, can the term vector w of each training sample of known every kind of text typemn。
In step 103, the characteristic vector of the training sample of every kind of text type is calculated.
Specifically, the term vector w of each training sample obtained according to step 102mn, use the calculating adding and being averaging
Mode, calculates characteristic vector T of the training sample of every kind of text typek, wherein, k=1,2 ..., K, K represent text type number
Amount.Assume have amusement class, science and technology class, finance and economic, use T respectively1、T2、T3Represent amusement class, science and technology class and finance and economic feature to
Amount, then have:
T1=[avg (w11+w21+…+we1)avg(w12+w22+…+we2)…avg(w1n+w3n+…+wen)]=[Went1
Went2 … Wentn]
T2=[avg (w11+w31+…+wi1)avg(w12+w32+…+wi2)…avg(w1n+w3n+…+win)]=[Wtech1
Wtech3 … Wtechn]
T3=[avg (w11+…+wf1+…+wi1)avg(w12+…+wf2+…+wi2)…avg(w1n+…+wfn+…+win)]
=[Wfina1 Wfina2 … Wfinan]
Wherein, e represents the e word, i i-th word, the f word of f,
Went1=avg (w11+w21+…+we1),
Went2=avg (w12+w22+…+wen),
Wentn=avg (w1n+w3n+…+wen),
Wtech1=avg (w11+w31+…+wi1),
Wtech2=avg (w12+w32+…+wi2),
Wtechn=avg (w1n+w3n+…+win),
Wfina1=avg (w11+…+wf1+…+wi1),
Wfina2=avg (w12+…+wf2+…+wi2),
Wfinan=avg (w1n+…+wfn+…+win),
If there being other classifications, each type of characteristic vector in like manner can be obtained.
It should be noted that T1∈[w1,w2,…,we], T2∈[w1,w3,…,wi], T3∈[w1,…,wf,…,wi], its
In, each apoplexy due to endogenous wind w1,w2,…wmFor the set of word, each different text type are formed by different or that part is identical word combination,
And characteristic vector T1、T2、T3In, the element in each avg (), with WmnThe order of middle m word is unrelated, it is not required that continuously, only
Relevant with the word of the composition in current text type, therefore, when seeking each class text type, from matrix WmnMiddle lookup is correlated with
Word also calculates accordingly.
At step 104, from WmnThe term vector that middle lookup text to be sorted comprises.
Specifically, first text to be sorted is carried out participle, then at word moment matrix WmnThis text to be sorted of middle lookup
Each term vector w comprisedmn。
In step 105, the characteristic vector of text to be sorted is calculated.
Specifically, each term vector w that the text to be sorted obtained according to step 104 is comprisedmn, use and add and ask flat
Equal calculation, calculates the characteristic vector of text to be sorted.
After assuming current text k participle to be sorted, word set is combined into (w1,w2,…,wl), represent that be made up of this treats point l word
Class sample, by term vector matrix WmnMiddle lookup corresponding word (w1,w2,…,wl) vector, obtainThe characteristic vector of the most current text to be sorted is:
Dk=[avg (w11+w21+…+wl1)avg(w12+w22+…+wl2)avg(w1n+w2n+…+wln)]=[d11 d12 …
d1n]
Wherein, the subscript n of d represents the dimension of term vector, and 1 represents current first text, when there being multiple text, at this
The value of 1 can be natural number.
In step 106, cosine similarity value is calculated.
Specifically, the characteristic vector of text to be sorted is calculatedWith the feature of the training sample of various text type to
Amount TkCosine similarity valueWherein,
Current text to be sorted with the cosine similarity value of the characteristic vector of the training sample of types of entertainment is:
Current text to be sorted with the cosine similarity value of the characteristic vector of the training sample of science and technology type is:
Current text to be sorted with the cosine similarity value of the characteristic vector of the training sample of financial type is:
In like manner, the cosine similarity value of sample to be sorted and the characteristic vector of the training sample of other text type can be obtained.
In step 107, the type of text to be sorted is determined.
Specifically, the cosine similarity value obtained according to step 106 determines the type of text to be sorted, and wherein, this is treated
The type of classification samples is exactly the cosine similarity value text type closest to training sample corresponding when 1.
What deserves to be explained is, when there being several texts to be sorted, just constitute a text set to be sortedX represents and treats
The text type of classifying text, k represents the kth text of text type to be sorted, x ∈ [1,2,3 ..., K], k ∈ [1,2,
3,...,M].When classifying text collection treated by needsIn other text to be sorted any when classifying, it is only necessary to perform step
Rapid 104 to step 107.
It is seen that, in the present embodiment, by continuous word bag mould based on layering Hierarchical Softmax
Type CBOW, comprises the term vector matrix of each term vector after calculating text participle, search known text based on this term vector matrix
The term vector that each training sample of type comprises respectively, and calculate the characteristic vector of the training sample of every kind of text type,
Finally according to the characteristic vector of text to be sorted and the characteristic vector of the training sample of various text type, determine text to be sorted
Type.In this way, during to term vector, several words and the contact of the most several word before current word can be considered,
Making whole text feature have the feature of semanteme, when being trained training sample, efficiency is costly and time consuming few, according to text to be sorted
Characteristic vector T of training sample of characteristic vector and various text typekCosine similarity value, determine text to be sorted
During type, calculation is simple and direct and precision is high.
Second embodiment of the present invention relates to the file classification method of a kind of term vector.Second embodiment is real first
Do further improvement on the basis of executing mode, mainly theed improvement is that: in second embodiment of the invention, give step
In rapid 102,104, from term vector matrix WmnIn quickly search the optimization method of required term vector, the method is particularly as follows: be in advance
Term vector matrix WmnIn each term vector wmnSet up respective index, then according to this index in vector matrix WmnMiddle lookup
The term vector w that each training sample or text to be sorted are comprisedmn.Idiographic flow is as shown in Figure 1.
Present embodiment not only can reach the technique effect of the first embodiment, and passing through is term vector matrix in advance
WmnIn each term vector wmnSet up the mode of respective index, can be more convenient, quickly at word moment matrix WmnMiddle lookup institute
Each term vector w neededmn, not only increase search efficiency, the most indirectly improve the efficiency of whole text classification.
Third embodiment of the invention relates to a kind of terminal unit, including: term vector computing module 10, search module 11,
Training sample characteristic vector computing module 12 and text type to be sorted determine module 13, and wherein, text type to be sorted determines
Module 13 specifically includes again: term vector obtains submodule 131, characteristic vector calculating sub module 132, cosine similarity calculating submodule
Block 133 and determine submodule 134, as shown in Figure 6.
Term vector computing module 10, after the data after L text participle are inputted continuous word bag model CBOW, calculates
Comprise each term vector wmnTerm vector matrix Wmn。
Search module 11, for by after the training sample participle of M known text type, from term vector matrix WmnMiddle lookup
The term vector w that each training sample comprises respectivelymn。
Training sample characteristic vector computing module 12, for the term vector w comprised according to each training samplemn, calculate
Characteristic vector T of the training sample of every kind of text typek。
Text type to be sorted determines module 13, for according to the characteristic vector of text to be sorted and various text type
Characteristic vector T of described training samplek, determine the type of text to be sorted.Wherein,
Term vector obtains submodule 131, for from term vector matrix WmnEach word that middle lookup text to be sorted is comprised to
Amount wmn。
Characteristic vector calculating sub module 132, for each term vector w comprised according to this text to be sortedmn, use add and
The calculation being averaging, calculates the characteristic vector of this text to be sorted.
Cosine similarity calculating sub module 133, for calculating the characteristic vector of text to be sorted and various text type
Characteristic vector T of described training samplekCosine similarity value.
Determine submodule 134, for determining the type of text to be sorted, wherein, cosine phase according to this cosine similarity value
It is the type of this text to be sorted closest to the text type of training sample corresponding when 1 like angle value.
It is seen that, present embodiment is the system embodiment corresponding with the first embodiment, and present embodiment can be with
First embodiment is worked in coordination enforcement.The relevant technical details mentioned in first embodiment the most still has
Effect, in order to reduce repetition, repeats no more here.Correspondingly, the relevant technical details mentioned in present embodiment is also applicable in
In first embodiment.
It is noted that each module involved in present embodiment is logic module, in actual applications, one
Individual logical block can be a physical location, it is also possible to be a part for a physical location, it is also possible to multiple physics lists
The combination of unit realizes.Additionally, for the innovative part highlighting the present invention, will be with solution institute of the present invention in present embodiment
The unit that the technical problem relation of proposition is the closest introduces, but this is not intended that in present embodiment the list that there is not other
Unit.
Four embodiment of the invention relates to a kind of terminal unit, and the 4th embodiment is on the basis of the 3rd embodiment
Done further improvement, mainly theed improvement is that: in four embodiment of the invention, term vector computing module 10 it
After, also include: set up index module 14, as shown in Figure 7.
Set up index module 14, be used for as term vector matrix WmnIn each term vector wmnSet up respective index so that
Searching module 11 can be more convenient, quickly according to this index, in term vector matrix W with term vector acquisition submodule 131mnIn,
Search each training sample or term vector w that text to be sorted is comprisedmn。
Owing to the second embodiment is the most corresponding with present embodiment, therefore present embodiment can be mutual with the second embodiment
Match enforcement.The relevant technical details mentioned in second embodiment is the most effective, implements second
The technique effect that can reach in mode is the most too it is achieved that in order to reduce repetition, the most superfluous
State.Correspondingly, the relevant technical details mentioned in present embodiment is also applicable in the second embodiment.
It will be appreciated by those skilled in the art that all or part of step realizing in above-described embodiment method can be by
Program instructs relevant hardware and completes, and this program is stored in a storage medium, including some instructions with so that one
Individual equipment (can be single-chip microcomputer, chip etc.) or processor (processor) perform method described in each embodiment of the application
All or part of step.And aforesaid storage medium includes: USB flash disk, portable hard drive, read only memory (ROM, Read-Only
Memory), random access memory (RAM, Random Access Memory), GPU, magnetic disc or CD etc. are various can deposit
The medium of storage program code.
It will be understood by those skilled in the art that the respective embodiments described above are to realize the specific embodiment of the present invention,
And in actual applications, can to it, various changes can be made in the form and details, without departing from the spirit and scope of the present invention.
Claims (10)
1. the file classification method of a term vector, it is characterised in that including:
Data after L text participle are inputted in continuous word bag model CBOW, calculate and comprise each term vector wmnTerm vector
Matrix Wmn;
After the training sample participle of M known text type, from described term vector matrix WmnEach training sample of middle lookup divides
The term vector w not comprisedmn;
The term vector w comprised according to each training sample describedmn, calculate the characteristic vector of the training sample of every kind of text type
Tk;
Characteristic vector according to text to be sorted and characteristic vector T of the described training sample of various text typek, determine and treat point
The type of class text;
Wherein, M≤L, m are word number, and n is the dimension of term vector, k=1,2 ..., K, K represent text type quantity.
The file classification method of term vector the most according to claim 1, it is characterised in that use the calculating adding and being averaging
Mode, calculates characteristic vector T of the described training sample of every kind of text typek。
The file classification method of term vector the most according to claim 2, it is characterised in that the feature of described text to be sorted
Vector calculates in the following manner:
After described text participle to be sorted, from described term vector matrix WmnThe described text to be sorted of middle lookup comprised each
Term vector wmn;
Term vector w according to described text to be sortedmn, use the calculation adding and being averaging, calculate described text to be sorted
Characteristic vector.
The file classification method of term vector the most according to claim 1, it is characterised in that described according to text to be sorted
Characteristic vector and characteristic vector T of described training samplek, determine the type of text to be sorted, specifically include:
Calculate characteristic vector T of the characteristic vector of described text to be sorted and the described training sample of various text typekCosine
Similarity value;
The type of described text to be sorted is the described cosine similarity text class closest to training sample corresponding when 1
Type.
5. according to the file classification method of the term vector described in claim 1 or 3, it is characterised in that from described term vector
Matrix WmnThe term vector w that each training sample of middle lookup comprises respectivelymn, specifically include:
Pre-build described term vector matrix WmnIndex;
According to described index in described term vector matrix WmnThe term vector w that each training sample of middle lookup comprises respectivelymn。
The file classification method of term vector the most according to claim 1, it is characterised in that described CBOW is based on negative sampling
The CBOW of algorithm.
7. a terminal unit, it is characterised in that including:
Term vector computing module, after the data after L text participle are inputted continuous word bag model CBOW, calculates and comprises respectively
Individual term vector wmnTerm vector matrix Wmn;
Search module, for by after the training sample participle of M known text type, from described term vector matrix WmnMiddle lookup is each
The term vector w that individual training sample comprises respectivelymn;
Training sample characteristic vector computing module, for the term vector w comprised according to each training sample describedmn, calculate every
Plant characteristic vector T of the training sample of text typek;
Text type to be sorted determines module, for the characteristic vector according to text to be sorted and the described instruction of various text type
Practice characteristic vector T of samplek, determine the type of text to be sorted;
Wherein, M≤L, m are word number, and n is the dimension of term vector, k=1,2 ..., K, K represent text type quantity.
Terminal unit the most according to claim 7, it is characterised in that described training sample characteristic vector computing module, adopts
By the calculation added and be averaging, calculate characteristic vector T of the described training sample of every kind of text typek。
Terminal unit the most according to claim 8, it is characterised in that described text type to be sorted determines that module includes:
Term vector obtains submodule, for by after described text participle to be sorted, from described term vector matrix WmnDescribed in middle lookup
Each term vector w that text to be sorted is comprisedmn;
Characteristic vector calculating sub module, for the term vector w according to described text to be sortedmn, use the calculating adding and being averaging
Mode, calculates the characteristic vector of described text to be sorted.
Terminal unit the most according to claim 7, it is characterised in that described text type to be sorted determines that module includes:
Cosine similarity value calculating sub module, for calculating the characteristic vector of described text to be sorted and the institute of various text type
State characteristic vector T of training samplekCosine similarity value;
Determine submodule, for by described cosine similarity closest to the text type of training sample corresponding when 1, as
The type of described text to be sorted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610639589.9A CN106294684A (en) | 2016-08-06 | 2016-08-06 | The file classification method of term vector and terminal unit |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610639589.9A CN106294684A (en) | 2016-08-06 | 2016-08-06 | The file classification method of term vector and terminal unit |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106294684A true CN106294684A (en) | 2017-01-04 |
Family
ID=57665678
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610639589.9A Pending CN106294684A (en) | 2016-08-06 | 2016-08-06 | The file classification method of term vector and terminal unit |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106294684A (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107153642A (en) * | 2017-05-16 | 2017-09-12 | 华北电力大学 | A kind of analysis method based on neural network recognization text comments Sentiment orientation |
CN107491444A (en) * | 2017-08-18 | 2017-12-19 | 南京大学 | Parallelization word alignment method based on bilingual word embedded technology |
CN107544957A (en) * | 2017-07-05 | 2018-01-05 | 华北电力大学 | A kind of Sentiment orientation analysis method of business product target word |
CN107577708A (en) * | 2017-07-31 | 2018-01-12 | 北京北信源软件股份有限公司 | Class base construction method and system based on SparkMLlib document classifications |
CN107908716A (en) * | 2017-11-10 | 2018-04-13 | 国网山东省电力公司电力科学研究院 | 95598 work order text mining method and apparatus of word-based vector model |
CN108959329A (en) * | 2017-05-27 | 2018-12-07 | 腾讯科技(北京)有限公司 | A kind of file classification method, device, medium and equipment |
CN109284377A (en) * | 2018-09-13 | 2019-01-29 | 云南电网有限责任公司 | A kind of file classification method and device based on vector space |
CN109388706A (en) * | 2017-08-10 | 2019-02-26 | 华东师范大学 | A kind of problem fine grit classification method, system and device |
WO2019056692A1 (en) * | 2017-09-25 | 2019-03-28 | 平安科技(深圳)有限公司 | News sentence clustering method based on semantic similarity, device, and storage medium |
CN109615153A (en) * | 2017-09-26 | 2019-04-12 | 阿里巴巴集团控股有限公司 | Businessman's methods of risk assessment, device, equipment and storage medium |
CN109637607A (en) * | 2018-12-24 | 2019-04-16 | 广州天鹏计算机科技有限公司 | Medical data classifying method, device, computer equipment and storage medium |
CN109670190A (en) * | 2018-12-25 | 2019-04-23 | 北京百度网讯科技有限公司 | Translation model construction method and device |
CN109800422A (en) * | 2018-12-20 | 2019-05-24 | 北京明略软件系统有限公司 | Method, system, terminal and the storage medium that a kind of pair of tables of data is classified |
CN109947945A (en) * | 2019-03-19 | 2019-06-28 | 合肥工业大学 | Word-based vector sum integrates the textstream classification method of SVM |
CN110858219A (en) * | 2018-08-17 | 2020-03-03 | 菜鸟智能物流控股有限公司 | Logistics object information processing method and device and computer system |
CN111353282A (en) * | 2020-03-09 | 2020-06-30 | 腾讯科技(深圳)有限公司 | Model training method, text rewriting method, device and storage medium |
CN111709251A (en) * | 2020-06-12 | 2020-09-25 | 哈尔滨工程大学 | Formal concept similarity rapid measurement method with general semantics and domain semantics |
CN112257419A (en) * | 2020-11-06 | 2021-01-22 | 开普云信息科技股份有限公司 | Intelligent retrieval method and device for calculating patent document similarity based on word frequency and semantics, electronic equipment and storage medium thereof |
CN113111174A (en) * | 2020-04-28 | 2021-07-13 | 北京明亿科技有限公司 | Group identification method, device, equipment and medium based on deep learning model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104933183A (en) * | 2015-07-03 | 2015-09-23 | 重庆邮电大学 | Inquiring term rewriting method merging term vector model and naive Bayes |
CN105335352A (en) * | 2015-11-30 | 2016-02-17 | 武汉大学 | Entity identification method based on Weibo emotion |
US20160110343A1 (en) * | 2014-10-21 | 2016-04-21 | At&T Intellectual Property I, L.P. | Unsupervised topic modeling for short texts |
CN105824904A (en) * | 2016-03-15 | 2016-08-03 | 浙江大学 | Chinese herbal medicine plant picture capturing method based on professional term vector of traditional Chinese medicine and pharmacy field |
-
2016
- 2016-08-06 CN CN201610639589.9A patent/CN106294684A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160110343A1 (en) * | 2014-10-21 | 2016-04-21 | At&T Intellectual Property I, L.P. | Unsupervised topic modeling for short texts |
CN104933183A (en) * | 2015-07-03 | 2015-09-23 | 重庆邮电大学 | Inquiring term rewriting method merging term vector model and naive Bayes |
CN105335352A (en) * | 2015-11-30 | 2016-02-17 | 武汉大学 | Entity identification method based on Weibo emotion |
CN105824904A (en) * | 2016-03-15 | 2016-08-03 | 浙江大学 | Chinese herbal medicine plant picture capturing method based on professional term vector of traditional Chinese medicine and pharmacy field |
Non-Patent Citations (1)
Title |
---|
江大鹏: "基于词向量的短文本分类方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107153642A (en) * | 2017-05-16 | 2017-09-12 | 华北电力大学 | A kind of analysis method based on neural network recognization text comments Sentiment orientation |
CN108959329B (en) * | 2017-05-27 | 2023-05-16 | 腾讯科技(北京)有限公司 | Text classification method, device, medium and equipment |
CN108959329A (en) * | 2017-05-27 | 2018-12-07 | 腾讯科技(北京)有限公司 | A kind of file classification method, device, medium and equipment |
CN107544957A (en) * | 2017-07-05 | 2018-01-05 | 华北电力大学 | A kind of Sentiment orientation analysis method of business product target word |
CN107577708A (en) * | 2017-07-31 | 2018-01-12 | 北京北信源软件股份有限公司 | Class base construction method and system based on SparkMLlib document classifications |
CN109388706A (en) * | 2017-08-10 | 2019-02-26 | 华东师范大学 | A kind of problem fine grit classification method, system and device |
CN107491444A (en) * | 2017-08-18 | 2017-12-19 | 南京大学 | Parallelization word alignment method based on bilingual word embedded technology |
WO2019056692A1 (en) * | 2017-09-25 | 2019-03-28 | 平安科技(深圳)有限公司 | News sentence clustering method based on semantic similarity, device, and storage medium |
CN109615153A (en) * | 2017-09-26 | 2019-04-12 | 阿里巴巴集团控股有限公司 | Businessman's methods of risk assessment, device, equipment and storage medium |
CN109615153B (en) * | 2017-09-26 | 2023-06-16 | 阿里巴巴集团控股有限公司 | Merchant risk assessment method, device, equipment and storage medium |
CN107908716A (en) * | 2017-11-10 | 2018-04-13 | 国网山东省电力公司电力科学研究院 | 95598 work order text mining method and apparatus of word-based vector model |
CN110858219A (en) * | 2018-08-17 | 2020-03-03 | 菜鸟智能物流控股有限公司 | Logistics object information processing method and device and computer system |
CN109284377A (en) * | 2018-09-13 | 2019-01-29 | 云南电网有限责任公司 | A kind of file classification method and device based on vector space |
CN109800422A (en) * | 2018-12-20 | 2019-05-24 | 北京明略软件系统有限公司 | Method, system, terminal and the storage medium that a kind of pair of tables of data is classified |
CN109637607A (en) * | 2018-12-24 | 2019-04-16 | 广州天鹏计算机科技有限公司 | Medical data classifying method, device, computer equipment and storage medium |
CN109670190B (en) * | 2018-12-25 | 2023-05-16 | 北京百度网讯科技有限公司 | Translation model construction method and device |
CN109670190A (en) * | 2018-12-25 | 2019-04-23 | 北京百度网讯科技有限公司 | Translation model construction method and device |
CN109947945A (en) * | 2019-03-19 | 2019-06-28 | 合肥工业大学 | Word-based vector sum integrates the textstream classification method of SVM |
CN111353282A (en) * | 2020-03-09 | 2020-06-30 | 腾讯科技(深圳)有限公司 | Model training method, text rewriting method, device and storage medium |
CN111353282B (en) * | 2020-03-09 | 2023-08-22 | 腾讯科技(深圳)有限公司 | Model training, text rewriting method, device and storage medium |
CN113111174A (en) * | 2020-04-28 | 2021-07-13 | 北京明亿科技有限公司 | Group identification method, device, equipment and medium based on deep learning model |
CN111709251A (en) * | 2020-06-12 | 2020-09-25 | 哈尔滨工程大学 | Formal concept similarity rapid measurement method with general semantics and domain semantics |
CN112257419A (en) * | 2020-11-06 | 2021-01-22 | 开普云信息科技股份有限公司 | Intelligent retrieval method and device for calculating patent document similarity based on word frequency and semantics, electronic equipment and storage medium thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106294684A (en) | The file classification method of term vector and terminal unit | |
CN106326346A (en) | Text classification method and terminal device | |
CN110020438B (en) | Sequence identification based enterprise or organization Chinese name entity disambiguation method and device | |
CN104199857B (en) | A kind of tax document hierarchy classification method based on multi-tag classification | |
CN106980683A (en) | Blog text snippet generation method based on deep learning | |
CN109739978A (en) | A kind of Text Clustering Method, text cluster device and terminal device | |
US11775594B2 (en) | Method for disambiguating between authors with same name on basis of network representation and semantic representation | |
CN105893609A (en) | Mobile APP recommendation method based on weighted mixing | |
CN107679082A (en) | Question and answer searching method, device and electronic equipment | |
Ju et al. | An efficient method for document categorization based on word2vec and latent semantic analysis | |
CN105740236A (en) | Writing feature and sequence feature combined Chinese sentiment new word recognition method and system | |
CN105893362A (en) | A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points | |
CN104484380A (en) | Personalized search method and personalized search device | |
CN107491434A (en) | Text snippet automatic generation method and device based on semantic dependency | |
CN110413993A (en) | A kind of semantic classification method, system and medium based on sparse weight neural network | |
Motwani et al. | A study on initial centroids selection for partitional clustering algorithms | |
Rooshenas et al. | Discriminative structure learning of arithmetic circuits | |
Cao et al. | Stacked residual recurrent neural network with word weight for text classification | |
CN111061876B (en) | Event public opinion data analysis method and device | |
Stemle et al. | Using language learner data for metaphor detection | |
CN109241298A (en) | Semantic data stores dispatching method | |
CN106802787A (en) | MapReduce optimization methods based on GPU sequences | |
Hwang et al. | Recent deep learning methods for tabular data | |
CN115879450B (en) | Gradual text generation method, system, computer equipment and storage medium | |
CN107329951A (en) | Build name entity mark resources bank method, device, storage medium and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170104 |
|
RJ01 | Rejection of invention patent application after publication |