CN106294684A - The file classification method of term vector and terminal unit - Google Patents

The file classification method of term vector and terminal unit Download PDF

Info

Publication number
CN106294684A
CN106294684A CN201610639589.9A CN201610639589A CN106294684A CN 106294684 A CN106294684 A CN 106294684A CN 201610639589 A CN201610639589 A CN 201610639589A CN 106294684 A CN106294684 A CN 106294684A
Authority
CN
China
Prior art keywords
text
term vector
sorted
training sample
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610639589.9A
Other languages
Chinese (zh)
Inventor
周诚
赵世亭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Gaoxin Computer Systems Co Ltd
Original Assignee
Shanghai Gaoxin Computer Systems Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Gaoxin Computer Systems Co Ltd filed Critical Shanghai Gaoxin Computer Systems Co Ltd
Priority to CN201610639589.9A priority Critical patent/CN106294684A/en
Publication of CN106294684A publication Critical patent/CN106294684A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to the communications field, disclose file classification method and the terminal unit of a kind of term vector.In embodiment of the present invention, by continuous word bag model CBOW, the term vector matrix of each term vector is comprised after calculating text participle, the term vector that each training sample of known text type comprises respectively is searched based on this term vector matrix, and calculate the characteristic vector of the training sample of every kind of text type, finally according to the characteristic vector of text to be sorted and the characteristic vector of the training sample of various text type, determine the type of text to be sorted.In this way, during to term vector, several words and the contact of the most several word before current word can be considered, whole text feature is made to have the feature of semanteme, when training sample is trained, efficiency is costly and time consuming few, and when determining the type of text to be sorted, amount of calculation is little, calculation is simple and efficient and precision is high.

Description

The file classification method of term vector and terminal unit
Technical field
The present invention relates to field of information processing, particularly to file classification method and the terminal unit of a kind of term vector.
Background technology
Text classification refers to that training text collection, as training text collection, is entered by the text that a group is crossed by expert classification in advance Row analysis draws classification mode, is classified other texts with the classification mode derived, and it is mainly used in information retrieval, machine Device translation, automatic abstract and information filtering etc..Wherein, the method for text training has a lot, as based on continuous word bag model CBOW Hierarchical classification device Hierarchical Softmax and two kinds of training methods of negative sampling.
During realizing the present invention, inventors herein have recognized that, use based on Hierarchical Softmax CBOW training method more favourable to rare words, the faster classification to this class text can be realized, based on negative sampling algorithm CBOW training method is favourable to the text classification of common word and low dimensional vector, and meanwhile, CBOW based on two kinds of algorithms of different are in instruction When practicing, the window size being generally selected is about 5, but above two method needs during determining text type in a large number Calculating, be unfavorable for quickly realizing.
Summary of the invention
The purpose of embodiment of the present invention is to provide file classification method and the terminal unit of a kind of term vector so that word During vectorization, several words and the contact of the most several word before current word can be considered, make whole text feature have semantic special Property, when being trained training sample, efficiency is costly and time consuming few, and when determining the type of text to be sorted, amount of calculation is little, calculating Mode is simple and efficient and precision is high.
For solving above-mentioned technical problem, embodiments of the present invention provide the file classification method of a kind of term vector, bag Include:
Data after L text participle are inputted in continuous word bag model CBOW, calculate and comprise each term vector wmnWord Vector matrix Wmn
After the training sample participle of M known text type, from described term vector matrix WmnEach training sample of middle lookup The term vector w that this comprises respectivelymn
The term vector w comprised according to each training sample describedmn, calculate the feature of the training sample of every kind of text type Vector Tk
Characteristic vector according to text to be sorted and characteristic vector T of the described training sample of various text typek, determine The type of text to be sorted;
Wherein, M≤L, m are word number, and n is the dimension of term vector, k=1,2 ..., K, K represent text type quantity.
Embodiments of the present invention additionally provide a kind of terminal unit, comprise:
Term vector computing module, after the data after L text participle are inputted continuous word bag model CBOW, calculates bag Containing each term vector wmnTerm vector matrix Wmn
Search module, for by after the training sample participle of M known text type, from described term vector matrix WmnIn look into Look for the term vector w that each training sample comprises respectivelymn
Training sample characteristic vector computing module, for the term vector w comprised according to each training sample describedmn, meter Calculate characteristic vector T of the training sample of every kind of text typek
Text type to be sorted determines module, for the characteristic vector according to text to be sorted and the institute of various text type State characteristic vector T of training samplek, determine the type of text to be sorted;
Wherein, M≤L, m are word number, and n is the dimension of term vector, k=1,2 ..., K, K represent text type quantity.
Embodiment of the present invention in terms of existing technologies, by continuous word bag model CBOW, after calculating text participle Comprise the term vector matrix of each term vector, search each training sample of known text type respectively based on this term vector matrix The term vector comprised, and calculate the characteristic vector of the training sample of every kind of text type, finally according to the spy of text to be sorted Levy the characteristic vector of the training sample of the various text type of vector sum, determine the type of text to be sorted.In this way, right During term vector, several words and the contact of the most several word before current word can be considered, make whole text feature have semanteme Characteristic, when being trained training sample, efficiency is costly and time consuming few, and when determining the type of text to be sorted, amount of calculation is little, meter Calculation mode is simple and efficient and precision is high.
It addition, use the calculation that adds and be averaging, calculate every kind of text type described training sample feature to Amount Tk
By the calculation added and be averaging, when calculating the characteristic vector of described training sample of every kind of text type, Operand is little and to calculate process simple and efficient.
It addition, the characteristic vector of text to be sorted calculates in the following manner: after described text participle to be sorted, from institute Predicate vector matrix WmnEach term vector w that the described text to be sorted of middle lookup is comprisedmn;According to described text to be sorted Term vector wmn, use the calculation adding and being averaging, calculate characteristic vector D of described text to be sorted.
By searching term vector matrix WmnIn each term vector w of being comprised of text to be sortedmn, can make full use of Existing term vector matrix Wmn, thus realize the quick and precisely lookup to each term vector, indirectly improve whole text classification The efficiency of journey.
It addition, according to the characteristic vector of text to be sorted and characteristic vector T of described training samplek, determine literary composition to be sorted This type, specifically includes: calculate the characteristic vector of described text to be sorted and the described training sample of various text type Characteristic vector TkCosine similarity value;The type of described text to be sorted by described cosine similarity closest to right when 1 The text type of the training sample answered.
Utilizing the mode calculating cosine similarity value to determine the type of text to be sorted, operand is little and calculation is simple Victory, precision when determining text type to be sorted is high.
Accompanying drawing explanation
Fig. 1 is the file classification method flow chart of a kind of term vector according to first embodiment of the invention;
Fig. 2 is the structural representation of the CBOW model according to first embodiment of the invention;
Fig. 3 is the schematic network structure of the CBOW model according to first embodiment of the invention;
Fig. 4 is the accompanying drawings that word looked into by the CBOW model according to first embodiment of the invention;
Fig. 5 be the negative sampling algorithm according to first embodiment of the invention insinuate set up schematic diagram;
Fig. 6 is the structural representation of a kind of terminal unit according to third embodiment of the invention;
Fig. 7 is the structural representation of a kind of terminal unit according to four embodiment of the invention.
Detailed description of the invention
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with the accompanying drawing each reality to the present invention The mode of executing is explained in detail.But, it will be understood by those skilled in the art that in each embodiment of the present invention, In order to make reader be more fully understood that, the application proposes many ins and outs.But, even if there is no these ins and outs and base Many variations and amendment in following embodiment, it is also possible to realize the application technical scheme required for protection.
First embodiment of the present invention relates to the file classification method of a kind of term vector.Idiographic flow is as shown in Figure 1.
In a step 101, the data after L text participle are inputted in CBOW, calculates its term vector matrix.
Specifically, L text being carried out participle, the data obtained are as the input of CBOW, thus calculate this L literary composition This term vector and comprise the term vector matrix of each term vector, wherein, term vector is denoted as wmn, term vector matrix is denoted as Wmn, m is Word number, n is the dimension of term vector, term vector matrix WmnConcrete form as follows:
Wherein,Represent the vector set of m word, WmnIn every a line represent a word vector expression.
Additionally, CBOW described in embodiment of the present invention, it is CBOW based on negative sampling algorithm.Conventional CBOW has base In CBOW and the CBOW two types based on negative sampling algorithm of Hierarchical Softmax, separately below to based on The CBOW and CBOW based on negative sampling algorithm of Hierarchical Softmax are introduced.
Wherein, CBOW model based on Hierarchical Softmax, comprise input layer, projection layer and output layer, its It is in known current word wtContext wt-2, wt-1, wt+1, wt+2On the premise of predict current word wt, as shown in Figure 2.Assume sample (Context (w), w) is made up of c word each before and after w, then the term vector v of 2c word during input layer comprises Context (w) (Context(w)1V), v (Context (w)2V), v (Context (w)2cv)∈Rm, m represents the length of term vector.Projection layer will It is cumulative that 2c vector of input does summation, i.e.Its structural representation is as shown in Figure 3.Output The corresponding binary tree of layer, it is to work as leafy node with the word occurred in language material, works as with the number of times that each word occurs in language material The Hofman tree that weights construct, in this Hofman tree, N=(| D |) altogether is individual for leafy node, in the most corresponding dictionary D Word s, non-leaf node N-1 those nodes of black (figure get the bid into).
Hierarchical Softmax is for putting forward a high performance key technology, in Hofman tree in term vector Certain leafy node, it is assumed that the word w in its corresponding dictionary D, note:
1)pwRepresent the path arriving w correspondence leafy node from root node;
2)lwRepresent path pwIn comprise the number of node;
3)Represent path pwIn lwIndividual node, whereinRepresent root node,Represent that word w is corresponding Node;
4)Representing the huffman coding of word w, it is by lw-1 coding is constituted,Represent path pwThe coding (the most corresponding coding of root node) that middle jth node is corresponding;
5)Represent path pwThe vector that middle non-leaf node is corresponding,Represent path pwMiddle jth The vector that individual non-leaf node is corresponding.
The example of existing word w=" football ", illustrates how, under the network structure shown in Fig. 3, to utilize vector xw∈RmAnd Hofman tree carrys out defined function p (w | Context (w)), and detailed process is as shown in Figure 4.Article four, 5 nodes that dotted line limit strings together Just constitute path pw, its length lw=5,For path pwOn 5 nodes, andCorresponding root node,Be respectively 1,0,0,1, i.e. the huffman coding of " football " is 1001, additionally,Table respectively Show path pwThe vector that upper 4 non-leaf nodes are corresponding." football " this leaf node, middle warp altogether is arrived from root node Manage 4 branches (the corresponding branch in the limit of every dotted line), and branch all can be considered and carried out two classification each time.Both It it is so that the angle from two classification considers a problem, then each non-leaf node, it is necessary to formulate for child's node around One classification, i.e. which is positive class (label is 1), and which is negative class (label is 0), beyond the node that digs up the roots, each node in tree An all corresponding value is the huffman coding of 0 or 1, and therefore, huffman coding is 1 by the most natural a kind of way Node is defined as positive class, and the node being encoded to 0 is defined as negative class (also can be defined as positive class by 0,1 is defined as negative class), namely When being classified by one node, assigning to the left side and bear class exactly, assigning to the right is exactly positive class, arranges at this:According to logistic regression, the probability that node is divided into positive class isThe probability being divided into negative class is thenWherein, θ is undetermined parameter, here non-leaf node Corresponding thoseJust can play the part of the role of parameter θ.
For arriving, from root node, 4 two classification that " football " this leafy node is experienced, every subseries is tied The probability of fruit writes out and is exactly:
1) the 1st time:
2) the 2nd time:
3) the 3rd time:
4) the 4th:
ThenSo far, by the example of w=" football ": For any word w in dictionary D, Hofman tree must exist the counterfoil node path p to word w correspondence nodew(and this Path is unique).Path pwOn there is lw-1 branch, regards two classification as by each branch, and classification each time just produces These probability multiplications are p (w | Context (w)) by one probability.
p ( w | C o n t e x t ( w ) ) = Π j = 2 l w p ( d j w | x w , θ j - 1 w ) - - - ( 1 )
p ( d j w | x w , θ j - 1 w ) = σ ( x w T θ j - 1 w ) , d j w = 0 1 - σ ( x w T θ j - 1 w ) , d j w = 1
Or, write as the form of overall expression formula:
p ( d j w | x w , θ j - 1 w ) = [ σ ( x w T θ j - 1 w ) ] 1 - d j w [ 1 - σ ( x w T θ j - 1 w ) ] d j w
Assume Yw=(yw,1,yw,2,…,yw,N)TBeing the vector of an a length of N, its component can not represent probability, if it is desired to Want YwComponent yw,iRepresent that when context is Context (w), next one word is by chance the probability of i-th word in dictionary D, then need Do the normalized of a softmax, then haveWherein iwRepresent that word w is at dictionary D In index.Therefore
Formula (1) is substituted into log-likelihood functionIn, then:
NoteFor the object function of CBOW model, Term vector utilize stochastic gradient rise method to optimize above-mentioned function.The way of stochastic gradient rise method is as follows:
(all relevant parameters in object function w), are just done and once update by Context (w) often to take a sample.See Observation of the eyes scalar functionsUnderstanding, the parameter in this function includes vectorBe given function l (w, j) about The gradient of these vectors.
Wherein, l (w, j) aboutGradient be:
Then,More new formula writeable be:η represents learning rate.
(w, j) to x in like manner can to obtain lwGradient be:
Here, xwRepresent is the cumulative of the term vector of each word in Context (w), and final purpose is to ask each in dictionary D The term vector of word, then
In CBOW model based on negative sampling algorithm, it is known that context Context (w) of word w, need predict w, therefore, Being exactly positive sample for given Context (w), word w, other words are exactly negative sample.Assume to have selected what a bearing about w Sample set NEG (w) ≠ φ, forDefinitionRepresent wordLabel, the mark of i.e. positive sample Label are 1, and the label of negative sample is 0.
For a given positive sample (Context (w), w), it is desirable to maximize
g ( w ) = Π u ∈ { w } ∪ N E G ( w ) p ( u | C o n t e x t ( w ) ) - - - ( 2 )
Wherein,
Can also be write as the form of overall expression formula:
p ( u | C o n t e x t ( w ) ) = [ σ ( x w T θ u ) ] L w ( u ) · [ 1 - σ ( x w T θ u ) ] 1 - L w ( u ) - - - ( 3 )
Here xwStill represent the vectorial sum of each word in Context (w), and θu∈RmRepresent word u corresponding to Amount.Formula (3) is substituted into formula (2), has:
g ( w ) = σ ( x w T θ w ) Π u ∈ N E G ( w ) [ 1 - σ ( x w T θ u ) ]
Wherein,Represent when context be (Context (w), time w), it was predicted that centre word is the probability of w, andThen represent when context is Context (w), it was predicted that centre word is the probability of u, then maximize g Time (w), namely maximizeMinimize all of simultaneouslyNamely increase the general of positive sample The probability of negative sample is reduced while rate.For a given corpus C, functionMesh as global optimization Mark, calculates for simplifying, takes the logarithm G, and the most final object function is:
Note above formula be l (w, u) then:
Also with stochastic gradient rise method, its parameter is optimized, then:
(w, u) about θ for luGradient be:
Then, θuMore new formula writeable be:
θ u : = θ u + η [ L w ( u ) - σ ( x w T θ ) ] x w
(w, u) about x for lwGradient be:
Then, utilizeCan obtainMore new formula be:
The most simply introducing the negative sampling algorithm used in embodiment of the present invention, the word in dictionary D is at language It is low that the number of times occurred in material C has height to have, and for those high frequency words, the probability being chosen as negative sample just should be bigger, otherwise, right In those low-frequency words, selected probability will be less.It is essentially the problem of cum rights sampling, specific algorithm process, can retouch State for:
Assume corresponding line segment l (w) of each word w in dictionary D, a length of:
l e n ( w ) = c o u n t e r ( w ) Σ u ∈ D c o u n t e r ( u )
Here (sum term in denominator is used for doing normalizing the number of times that one word of counter () expression occurs in language material C Change), now these line segments ending is connected and is stitched together, form the unit segment of a length of 1.If it is random toward this Get ready on individual unit segment, then the probability that the line segment (corresponding high frequency words) that length is the longest is hit is the biggest.
NoteHere wjRepresent jth word in dictionary D, then withFor cuing open Partial node can get a non-equidistant subdivision in district [0,1], Ii=(li-1,li], i-1,2 ..., N is its N number of region subdivision. The equidistant subdivision being further introduced on interval [0,1], subdivision node isWherein M > > N, as shown in Figure 5.
By inside subdivision nodeProject on non-equidistant subdivision, as shown in the dotted line in Fig. 6, thenWith IntervalMapping relations be:
Table (i)=wk,where mi∈Ik, i=1,2 ..., M-1
According to these mapping relations, its sampling process is: generate random integers r between [1, M-1], Table every time R () sample, just gets self w if sampled in sampling processi, then skip.
In a step 102, from WmnIn, search the term vector that training sample comprises.
Specifically, from term vector matrix WmnIn, search the word that comprises respectively of M training sample of known text type to Amount, first the training sample of M known text type is carried out participle (M≤L here, be in order to prevent training sample participle after The term vector matrix W that obtains in step 101 of resultmnMiddle lookup less than), then at word moment matrix WmnEach training of middle lookup The term vector w that sample comprises respectivelymn, can the term vector w of each training sample of known every kind of text typemn
In step 103, the characteristic vector of the training sample of every kind of text type is calculated.
Specifically, the term vector w of each training sample obtained according to step 102mn, use the calculating adding and being averaging Mode, calculates characteristic vector T of the training sample of every kind of text typek, wherein, k=1,2 ..., K, K represent text type number Amount.Assume have amusement class, science and technology class, finance and economic, use T respectively1、T2、T3Represent amusement class, science and technology class and finance and economic feature to Amount, then have:
T1=[avg (w11+w21+…+we1)avg(w12+w22+…+we2)…avg(w1n+w3n+…+wen)]=[Went1 Went2 … Wentn]
T2=[avg (w11+w31+…+wi1)avg(w12+w32+…+wi2)…avg(w1n+w3n+…+win)]=[Wtech1 Wtech3 … Wtechn]
T3=[avg (w11+…+wf1+…+wi1)avg(w12+…+wf2+…+wi2)…avg(w1n+…+wfn+…+win)] =[Wfina1 Wfina2 … Wfinan]
Wherein, e represents the e word, i i-th word, the f word of f,
Went1=avg (w11+w21+…+we1),
Went2=avg (w12+w22+…+wen),
Wentn=avg (w1n+w3n+…+wen),
Wtech1=avg (w11+w31+…+wi1),
Wtech2=avg (w12+w32+…+wi2),
Wtechn=avg (w1n+w3n+…+win),
Wfina1=avg (w11+…+wf1+…+wi1),
Wfina2=avg (w12+…+wf2+…+wi2),
Wfinan=avg (w1n+…+wfn+…+win),
If there being other classifications, each type of characteristic vector in like manner can be obtained.
It should be noted that T1∈[w1,w2,…,we], T2∈[w1,w3,…,wi], T3∈[w1,…,wf,…,wi], its In, each apoplexy due to endogenous wind w1,w2,…wmFor the set of word, each different text type are formed by different or that part is identical word combination, And characteristic vector T1、T2、T3In, the element in each avg (), with WmnThe order of middle m word is unrelated, it is not required that continuously, only Relevant with the word of the composition in current text type, therefore, when seeking each class text type, from matrix WmnMiddle lookup is correlated with Word also calculates accordingly.
At step 104, from WmnThe term vector that middle lookup text to be sorted comprises.
Specifically, first text to be sorted is carried out participle, then at word moment matrix WmnThis text to be sorted of middle lookup Each term vector w comprisedmn
In step 105, the characteristic vector of text to be sorted is calculated.
Specifically, each term vector w that the text to be sorted obtained according to step 104 is comprisedmn, use and add and ask flat Equal calculation, calculates the characteristic vector of text to be sorted.
After assuming current text k participle to be sorted, word set is combined into (w1,w2,…,wl), represent that be made up of this treats point l word Class sample, by term vector matrix WmnMiddle lookup corresponding word (w1,w2,…,wl) vector, obtainThe characteristic vector of the most current text to be sorted is:
Dk=[avg (w11+w21+…+wl1)avg(w12+w22+…+wl2)avg(w1n+w2n+…+wln)]=[d11 d12 … d1n]
Wherein, the subscript n of d represents the dimension of term vector, and 1 represents current first text, when there being multiple text, at this The value of 1 can be natural number.
In step 106, cosine similarity value is calculated.
Specifically, the characteristic vector of text to be sorted is calculatedWith the feature of the training sample of various text type to Amount TkCosine similarity valueWherein,
Current text to be sorted with the cosine similarity value of the characteristic vector of the training sample of types of entertainment is:
c o s ( T 1 , D x k ) = W e n t 1 d 11 + W e n t 2 d 12 + ... + W e n t n d 1 n W e n t 1 2 + W e n t 2 2 + ... + W e n t n 2 × d 11 2 + d 12 2 + ... + d 1 n 2
Current text to be sorted with the cosine similarity value of the characteristic vector of the training sample of science and technology type is:
c o s ( T 2 , D x k ) = W t e c h 1 d 11 + W t e c h 2 d 12 + ... + W t e c h d 1 n W t e c h 1 2 + W t e c h 2 2 + ... + W t e c h n 2 × d 11 2 + d 12 2 + ... + d 1 n 2
Current text to be sorted with the cosine similarity value of the characteristic vector of the training sample of financial type is:
c o s ( T 3 , D x k ) = W f i n a 1 d 11 + W f i n a 2 d 12 + ... + W f i n a d 1 n W f i n a 1 2 + W f i n a 2 2 + ... + W f i n a n 2 × d 11 2 + d 12 2 + ... + d 1 n 2
In like manner, the cosine similarity value of sample to be sorted and the characteristic vector of the training sample of other text type can be obtained.
In step 107, the type of text to be sorted is determined.
Specifically, the cosine similarity value obtained according to step 106 determines the type of text to be sorted, and wherein, this is treated The type of classification samples is exactly the cosine similarity value text type closest to training sample corresponding when 1.
What deserves to be explained is, when there being several texts to be sorted, just constitute a text set to be sortedX represents and treats The text type of classifying text, k represents the kth text of text type to be sorted, x ∈ [1,2,3 ..., K], k ∈ [1,2, 3,...,M].When classifying text collection treated by needsIn other text to be sorted any when classifying, it is only necessary to perform step Rapid 104 to step 107.
It is seen that, in the present embodiment, by continuous word bag mould based on layering Hierarchical Softmax Type CBOW, comprises the term vector matrix of each term vector after calculating text participle, search known text based on this term vector matrix The term vector that each training sample of type comprises respectively, and calculate the characteristic vector of the training sample of every kind of text type, Finally according to the characteristic vector of text to be sorted and the characteristic vector of the training sample of various text type, determine text to be sorted Type.In this way, during to term vector, several words and the contact of the most several word before current word can be considered, Making whole text feature have the feature of semanteme, when being trained training sample, efficiency is costly and time consuming few, according to text to be sorted Characteristic vector T of training sample of characteristic vector and various text typekCosine similarity value, determine text to be sorted During type, calculation is simple and direct and precision is high.
Second embodiment of the present invention relates to the file classification method of a kind of term vector.Second embodiment is real first Do further improvement on the basis of executing mode, mainly theed improvement is that: in second embodiment of the invention, give step In rapid 102,104, from term vector matrix WmnIn quickly search the optimization method of required term vector, the method is particularly as follows: be in advance Term vector matrix WmnIn each term vector wmnSet up respective index, then according to this index in vector matrix WmnMiddle lookup The term vector w that each training sample or text to be sorted are comprisedmn.Idiographic flow is as shown in Figure 1.
Present embodiment not only can reach the technique effect of the first embodiment, and passing through is term vector matrix in advance WmnIn each term vector wmnSet up the mode of respective index, can be more convenient, quickly at word moment matrix WmnMiddle lookup institute Each term vector w neededmn, not only increase search efficiency, the most indirectly improve the efficiency of whole text classification.
Third embodiment of the invention relates to a kind of terminal unit, including: term vector computing module 10, search module 11, Training sample characteristic vector computing module 12 and text type to be sorted determine module 13, and wherein, text type to be sorted determines Module 13 specifically includes again: term vector obtains submodule 131, characteristic vector calculating sub module 132, cosine similarity calculating submodule Block 133 and determine submodule 134, as shown in Figure 6.
Term vector computing module 10, after the data after L text participle are inputted continuous word bag model CBOW, calculates Comprise each term vector wmnTerm vector matrix Wmn
Search module 11, for by after the training sample participle of M known text type, from term vector matrix WmnMiddle lookup The term vector w that each training sample comprises respectivelymn
Training sample characteristic vector computing module 12, for the term vector w comprised according to each training samplemn, calculate Characteristic vector T of the training sample of every kind of text typek
Text type to be sorted determines module 13, for according to the characteristic vector of text to be sorted and various text type Characteristic vector T of described training samplek, determine the type of text to be sorted.Wherein,
Term vector obtains submodule 131, for from term vector matrix WmnEach word that middle lookup text to be sorted is comprised to Amount wmn
Characteristic vector calculating sub module 132, for each term vector w comprised according to this text to be sortedmn, use add and The calculation being averaging, calculates the characteristic vector of this text to be sorted.
Cosine similarity calculating sub module 133, for calculating the characteristic vector of text to be sorted and various text type Characteristic vector T of described training samplekCosine similarity value.
Determine submodule 134, for determining the type of text to be sorted, wherein, cosine phase according to this cosine similarity value It is the type of this text to be sorted closest to the text type of training sample corresponding when 1 like angle value.
It is seen that, present embodiment is the system embodiment corresponding with the first embodiment, and present embodiment can be with First embodiment is worked in coordination enforcement.The relevant technical details mentioned in first embodiment the most still has Effect, in order to reduce repetition, repeats no more here.Correspondingly, the relevant technical details mentioned in present embodiment is also applicable in In first embodiment.
It is noted that each module involved in present embodiment is logic module, in actual applications, one Individual logical block can be a physical location, it is also possible to be a part for a physical location, it is also possible to multiple physics lists The combination of unit realizes.Additionally, for the innovative part highlighting the present invention, will be with solution institute of the present invention in present embodiment The unit that the technical problem relation of proposition is the closest introduces, but this is not intended that in present embodiment the list that there is not other Unit.
Four embodiment of the invention relates to a kind of terminal unit, and the 4th embodiment is on the basis of the 3rd embodiment Done further improvement, mainly theed improvement is that: in four embodiment of the invention, term vector computing module 10 it After, also include: set up index module 14, as shown in Figure 7.
Set up index module 14, be used for as term vector matrix WmnIn each term vector wmnSet up respective index so that Searching module 11 can be more convenient, quickly according to this index, in term vector matrix W with term vector acquisition submodule 131mnIn, Search each training sample or term vector w that text to be sorted is comprisedmn
Owing to the second embodiment is the most corresponding with present embodiment, therefore present embodiment can be mutual with the second embodiment Match enforcement.The relevant technical details mentioned in second embodiment is the most effective, implements second The technique effect that can reach in mode is the most too it is achieved that in order to reduce repetition, the most superfluous State.Correspondingly, the relevant technical details mentioned in present embodiment is also applicable in the second embodiment.
It will be appreciated by those skilled in the art that all or part of step realizing in above-described embodiment method can be by Program instructs relevant hardware and completes, and this program is stored in a storage medium, including some instructions with so that one Individual equipment (can be single-chip microcomputer, chip etc.) or processor (processor) perform method described in each embodiment of the application All or part of step.And aforesaid storage medium includes: USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), GPU, magnetic disc or CD etc. are various can deposit The medium of storage program code.
It will be understood by those skilled in the art that the respective embodiments described above are to realize the specific embodiment of the present invention, And in actual applications, can to it, various changes can be made in the form and details, without departing from the spirit and scope of the present invention.

Claims (10)

1. the file classification method of a term vector, it is characterised in that including:
Data after L text participle are inputted in continuous word bag model CBOW, calculate and comprise each term vector wmnTerm vector Matrix Wmn
After the training sample participle of M known text type, from described term vector matrix WmnEach training sample of middle lookup divides The term vector w not comprisedmn
The term vector w comprised according to each training sample describedmn, calculate the characteristic vector of the training sample of every kind of text type Tk
Characteristic vector according to text to be sorted and characteristic vector T of the described training sample of various text typek, determine and treat point The type of class text;
Wherein, M≤L, m are word number, and n is the dimension of term vector, k=1,2 ..., K, K represent text type quantity.
The file classification method of term vector the most according to claim 1, it is characterised in that use the calculating adding and being averaging Mode, calculates characteristic vector T of the described training sample of every kind of text typek
The file classification method of term vector the most according to claim 2, it is characterised in that the feature of described text to be sorted Vector calculates in the following manner:
After described text participle to be sorted, from described term vector matrix WmnThe described text to be sorted of middle lookup comprised each Term vector wmn
Term vector w according to described text to be sortedmn, use the calculation adding and being averaging, calculate described text to be sorted Characteristic vector.
The file classification method of term vector the most according to claim 1, it is characterised in that described according to text to be sorted Characteristic vector and characteristic vector T of described training samplek, determine the type of text to be sorted, specifically include:
Calculate characteristic vector T of the characteristic vector of described text to be sorted and the described training sample of various text typekCosine Similarity value;
The type of described text to be sorted is the described cosine similarity text class closest to training sample corresponding when 1 Type.
5. according to the file classification method of the term vector described in claim 1 or 3, it is characterised in that from described term vector Matrix WmnThe term vector w that each training sample of middle lookup comprises respectivelymn, specifically include:
Pre-build described term vector matrix WmnIndex;
According to described index in described term vector matrix WmnThe term vector w that each training sample of middle lookup comprises respectivelymn
The file classification method of term vector the most according to claim 1, it is characterised in that described CBOW is based on negative sampling The CBOW of algorithm.
7. a terminal unit, it is characterised in that including:
Term vector computing module, after the data after L text participle are inputted continuous word bag model CBOW, calculates and comprises respectively Individual term vector wmnTerm vector matrix Wmn
Search module, for by after the training sample participle of M known text type, from described term vector matrix WmnMiddle lookup is each The term vector w that individual training sample comprises respectivelymn
Training sample characteristic vector computing module, for the term vector w comprised according to each training sample describedmn, calculate every Plant characteristic vector T of the training sample of text typek
Text type to be sorted determines module, for the characteristic vector according to text to be sorted and the described instruction of various text type Practice characteristic vector T of samplek, determine the type of text to be sorted;
Wherein, M≤L, m are word number, and n is the dimension of term vector, k=1,2 ..., K, K represent text type quantity.
Terminal unit the most according to claim 7, it is characterised in that described training sample characteristic vector computing module, adopts By the calculation added and be averaging, calculate characteristic vector T of the described training sample of every kind of text typek
Terminal unit the most according to claim 8, it is characterised in that described text type to be sorted determines that module includes:
Term vector obtains submodule, for by after described text participle to be sorted, from described term vector matrix WmnDescribed in middle lookup Each term vector w that text to be sorted is comprisedmn
Characteristic vector calculating sub module, for the term vector w according to described text to be sortedmn, use the calculating adding and being averaging Mode, calculates the characteristic vector of described text to be sorted.
Terminal unit the most according to claim 7, it is characterised in that described text type to be sorted determines that module includes:
Cosine similarity value calculating sub module, for calculating the characteristic vector of described text to be sorted and the institute of various text type State characteristic vector T of training samplekCosine similarity value;
Determine submodule, for by described cosine similarity closest to the text type of training sample corresponding when 1, as The type of described text to be sorted.
CN201610639589.9A 2016-08-06 2016-08-06 The file classification method of term vector and terminal unit Pending CN106294684A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610639589.9A CN106294684A (en) 2016-08-06 2016-08-06 The file classification method of term vector and terminal unit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610639589.9A CN106294684A (en) 2016-08-06 2016-08-06 The file classification method of term vector and terminal unit

Publications (1)

Publication Number Publication Date
CN106294684A true CN106294684A (en) 2017-01-04

Family

ID=57665678

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610639589.9A Pending CN106294684A (en) 2016-08-06 2016-08-06 The file classification method of term vector and terminal unit

Country Status (1)

Country Link
CN (1) CN106294684A (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107153642A (en) * 2017-05-16 2017-09-12 华北电力大学 A kind of analysis method based on neural network recognization text comments Sentiment orientation
CN107491444A (en) * 2017-08-18 2017-12-19 南京大学 Parallelization word alignment method based on bilingual word embedded technology
CN107544957A (en) * 2017-07-05 2018-01-05 华北电力大学 A kind of Sentiment orientation analysis method of business product target word
CN107577708A (en) * 2017-07-31 2018-01-12 北京北信源软件股份有限公司 Class base construction method and system based on SparkMLlib document classifications
CN107908716A (en) * 2017-11-10 2018-04-13 国网山东省电力公司电力科学研究院 95598 work order text mining method and apparatus of word-based vector model
CN108959329A (en) * 2017-05-27 2018-12-07 腾讯科技(北京)有限公司 A kind of file classification method, device, medium and equipment
CN109284377A (en) * 2018-09-13 2019-01-29 云南电网有限责任公司 A kind of file classification method and device based on vector space
CN109388706A (en) * 2017-08-10 2019-02-26 华东师范大学 A kind of problem fine grit classification method, system and device
WO2019056692A1 (en) * 2017-09-25 2019-03-28 平安科技(深圳)有限公司 News sentence clustering method based on semantic similarity, device, and storage medium
CN109615153A (en) * 2017-09-26 2019-04-12 阿里巴巴集团控股有限公司 Businessman's methods of risk assessment, device, equipment and storage medium
CN109637607A (en) * 2018-12-24 2019-04-16 广州天鹏计算机科技有限公司 Medical data classifying method, device, computer equipment and storage medium
CN109670190A (en) * 2018-12-25 2019-04-23 北京百度网讯科技有限公司 Translation model construction method and device
CN109800422A (en) * 2018-12-20 2019-05-24 北京明略软件系统有限公司 Method, system, terminal and the storage medium that a kind of pair of tables of data is classified
CN109947945A (en) * 2019-03-19 2019-06-28 合肥工业大学 Word-based vector sum integrates the textstream classification method of SVM
CN110858219A (en) * 2018-08-17 2020-03-03 菜鸟智能物流控股有限公司 Logistics object information processing method and device and computer system
CN111353282A (en) * 2020-03-09 2020-06-30 腾讯科技(深圳)有限公司 Model training method, text rewriting method, device and storage medium
CN111709251A (en) * 2020-06-12 2020-09-25 哈尔滨工程大学 Formal concept similarity rapid measurement method with general semantics and domain semantics
CN112257419A (en) * 2020-11-06 2021-01-22 开普云信息科技股份有限公司 Intelligent retrieval method and device for calculating patent document similarity based on word frequency and semantics, electronic equipment and storage medium thereof
CN113111174A (en) * 2020-04-28 2021-07-13 北京明亿科技有限公司 Group identification method, device, equipment and medium based on deep learning model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes
CN105335352A (en) * 2015-11-30 2016-02-17 武汉大学 Entity identification method based on Weibo emotion
US20160110343A1 (en) * 2014-10-21 2016-04-21 At&T Intellectual Property I, L.P. Unsupervised topic modeling for short texts
CN105824904A (en) * 2016-03-15 2016-08-03 浙江大学 Chinese herbal medicine plant picture capturing method based on professional term vector of traditional Chinese medicine and pharmacy field

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160110343A1 (en) * 2014-10-21 2016-04-21 At&T Intellectual Property I, L.P. Unsupervised topic modeling for short texts
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes
CN105335352A (en) * 2015-11-30 2016-02-17 武汉大学 Entity identification method based on Weibo emotion
CN105824904A (en) * 2016-03-15 2016-08-03 浙江大学 Chinese herbal medicine plant picture capturing method based on professional term vector of traditional Chinese medicine and pharmacy field

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
江大鹏: "基于词向量的短文本分类方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107153642A (en) * 2017-05-16 2017-09-12 华北电力大学 A kind of analysis method based on neural network recognization text comments Sentiment orientation
CN108959329B (en) * 2017-05-27 2023-05-16 腾讯科技(北京)有限公司 Text classification method, device, medium and equipment
CN108959329A (en) * 2017-05-27 2018-12-07 腾讯科技(北京)有限公司 A kind of file classification method, device, medium and equipment
CN107544957A (en) * 2017-07-05 2018-01-05 华北电力大学 A kind of Sentiment orientation analysis method of business product target word
CN107577708A (en) * 2017-07-31 2018-01-12 北京北信源软件股份有限公司 Class base construction method and system based on SparkMLlib document classifications
CN109388706A (en) * 2017-08-10 2019-02-26 华东师范大学 A kind of problem fine grit classification method, system and device
CN107491444A (en) * 2017-08-18 2017-12-19 南京大学 Parallelization word alignment method based on bilingual word embedded technology
WO2019056692A1 (en) * 2017-09-25 2019-03-28 平安科技(深圳)有限公司 News sentence clustering method based on semantic similarity, device, and storage medium
CN109615153A (en) * 2017-09-26 2019-04-12 阿里巴巴集团控股有限公司 Businessman's methods of risk assessment, device, equipment and storage medium
CN109615153B (en) * 2017-09-26 2023-06-16 阿里巴巴集团控股有限公司 Merchant risk assessment method, device, equipment and storage medium
CN107908716A (en) * 2017-11-10 2018-04-13 国网山东省电力公司电力科学研究院 95598 work order text mining method and apparatus of word-based vector model
CN110858219A (en) * 2018-08-17 2020-03-03 菜鸟智能物流控股有限公司 Logistics object information processing method and device and computer system
CN109284377A (en) * 2018-09-13 2019-01-29 云南电网有限责任公司 A kind of file classification method and device based on vector space
CN109800422A (en) * 2018-12-20 2019-05-24 北京明略软件系统有限公司 Method, system, terminal and the storage medium that a kind of pair of tables of data is classified
CN109637607A (en) * 2018-12-24 2019-04-16 广州天鹏计算机科技有限公司 Medical data classifying method, device, computer equipment and storage medium
CN109670190B (en) * 2018-12-25 2023-05-16 北京百度网讯科技有限公司 Translation model construction method and device
CN109670190A (en) * 2018-12-25 2019-04-23 北京百度网讯科技有限公司 Translation model construction method and device
CN109947945A (en) * 2019-03-19 2019-06-28 合肥工业大学 Word-based vector sum integrates the textstream classification method of SVM
CN111353282A (en) * 2020-03-09 2020-06-30 腾讯科技(深圳)有限公司 Model training method, text rewriting method, device and storage medium
CN111353282B (en) * 2020-03-09 2023-08-22 腾讯科技(深圳)有限公司 Model training, text rewriting method, device and storage medium
CN113111174A (en) * 2020-04-28 2021-07-13 北京明亿科技有限公司 Group identification method, device, equipment and medium based on deep learning model
CN111709251A (en) * 2020-06-12 2020-09-25 哈尔滨工程大学 Formal concept similarity rapid measurement method with general semantics and domain semantics
CN112257419A (en) * 2020-11-06 2021-01-22 开普云信息科技股份有限公司 Intelligent retrieval method and device for calculating patent document similarity based on word frequency and semantics, electronic equipment and storage medium thereof

Similar Documents

Publication Publication Date Title
CN106294684A (en) The file classification method of term vector and terminal unit
CN106326346A (en) Text classification method and terminal device
CN110020438B (en) Sequence identification based enterprise or organization Chinese name entity disambiguation method and device
CN104199857B (en) A kind of tax document hierarchy classification method based on multi-tag classification
CN106980683A (en) Blog text snippet generation method based on deep learning
CN109739978A (en) A kind of Text Clustering Method, text cluster device and terminal device
US11775594B2 (en) Method for disambiguating between authors with same name on basis of network representation and semantic representation
CN105893609A (en) Mobile APP recommendation method based on weighted mixing
CN107679082A (en) Question and answer searching method, device and electronic equipment
Ju et al. An efficient method for document categorization based on word2vec and latent semantic analysis
CN105740236A (en) Writing feature and sequence feature combined Chinese sentiment new word recognition method and system
CN105893362A (en) A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points
CN104484380A (en) Personalized search method and personalized search device
CN107491434A (en) Text snippet automatic generation method and device based on semantic dependency
CN110413993A (en) A kind of semantic classification method, system and medium based on sparse weight neural network
Motwani et al. A study on initial centroids selection for partitional clustering algorithms
Rooshenas et al. Discriminative structure learning of arithmetic circuits
Cao et al. Stacked residual recurrent neural network with word weight for text classification
CN111061876B (en) Event public opinion data analysis method and device
Stemle et al. Using language learner data for metaphor detection
CN109241298A (en) Semantic data stores dispatching method
CN106802787A (en) MapReduce optimization methods based on GPU sequences
Hwang et al. Recent deep learning methods for tabular data
CN115879450B (en) Gradual text generation method, system, computer equipment and storage medium
CN107329951A (en) Build name entity mark resources bank method, device, storage medium and computer equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170104

RJ01 Rejection of invention patent application after publication