CN106294684A

CN106294684A - The file classification method of term vector and terminal unit

Info

Publication number: CN106294684A
Application number: CN201610639589.9A
Authority: CN
Inventors: 周诚; 赵世亭
Original assignee: Shanghai Gaoxin Computer Systems Co Ltd
Current assignee: Shanghai Gaoxin Computer Systems Co Ltd
Priority date: 2016-08-06
Filing date: 2016-08-06
Publication date: 2017-01-04

Abstract

The present invention relates to the communications field, disclose file classification method and the terminal unit of a kind of term vector.In embodiment of the present invention, by continuous word bag model CBOW, the term vector matrix of each term vector is comprised after calculating text participle, the term vector that each training sample of known text type comprises respectively is searched based on this term vector matrix, and calculate the characteristic vector of the training sample of every kind of text type, finally according to the characteristic vector of text to be sorted and the characteristic vector of the training sample of various text type, determine the type of text to be sorted.In this way, during to term vector, several words and the contact of the most several word before current word can be considered, whole text feature is made to have the feature of semanteme, when training sample is trained, efficiency is costly and time consuming few, and when determining the type of text to be sorted, amount of calculation is little, calculation is simple and efficient and precision is high.

Description

The file classification method of term vector and terminal unit

Technical field

The present invention relates to field of information processing, particularly to file classification method and the terminal unit of a kind of term vector.

Background technology

Text classification refers to that training text collection, as training text collection, is entered by the text that a group is crossed by expert classification in advance Row analysis draws classification mode, is classified other texts with the classification mode derived, and it is mainly used in information retrieval, machine Device translation, automatic abstract and information filtering etc..Wherein, the method for text training has a lot, as based on continuous word bag model CBOW Hierarchical classification device Hierarchical Softmax and two kinds of training methods of negative sampling.

During realizing the present invention, inventors herein have recognized that, use based on Hierarchical Softmax CBOW training method more favourable to rare words, the faster classification to this class text can be realized, based on negative sampling algorithm CBOW training method is favourable to the text classification of common word and low dimensional vector, and meanwhile, CBOW based on two kinds of algorithms of different are in instruction When practicing, the window size being generally selected is about 5, but above two method needs during determining text type in a large number Calculating, be unfavorable for quickly realizing.

Summary of the invention

The purpose of embodiment of the present invention is to provide file classification method and the terminal unit of a kind of term vector so that word During vectorization, several words and the contact of the most several word before current word can be considered, make whole text feature have semantic special Property, when being trained training sample, efficiency is costly and time consuming few, and when determining the type of text to be sorted, amount of calculation is little, calculating Mode is simple and efficient and precision is high.

For solving above-mentioned technical problem, embodiments of the present invention provide the file classification method of a kind of term vector, bag Include:

Data after L text participle are inputted in continuous word bag model CBOW, calculate and comprise each term vector w_mnWord Vector matrix W_mn；

After the training sample participle of M known text type, from described term vector matrix W_mnEach training sample of middle lookup The term vector w that this comprises respectively_mn；

The term vector w comprised according to each training sample described_mn, calculate the feature of the training sample of every kind of text type Vector T_k；

Characteristic vector according to text to be sorted and characteristic vector T of the described training sample of various text type_k, determine The type of text to be sorted；

Wherein, M≤L, m are word number, and n is the dimension of term vector, k=1,2 ..., K, K represent text type quantity.

Embodiments of the present invention additionally provide a kind of terminal unit, comprise:

Term vector computing module, after the data after L text participle are inputted continuous word bag model CBOW, calculates bag Containing each term vector w_mnTerm vector matrix W_mn；

Search module, for by after the training sample participle of M known text type, from described term vector matrix W_mnIn look into Look for the term vector w that each training sample comprises respectively_mn；

Training sample characteristic vector computing module, for the term vector w comprised according to each training sample described_mn, meter Calculate characteristic vector T of the training sample of every kind of text type_k；

Text type to be sorted determines module, for the characteristic vector according to text to be sorted and the institute of various text type State characteristic vector T of training sample_k, determine the type of text to be sorted；

Embodiment of the present invention in terms of existing technologies, by continuous word bag model CBOW, after calculating text participle Comprise the term vector matrix of each term vector, search each training sample of known text type respectively based on this term vector matrix The term vector comprised, and calculate the characteristic vector of the training sample of every kind of text type, finally according to the spy of text to be sorted Levy the characteristic vector of the training sample of the various text type of vector sum, determine the type of text to be sorted.In this way, right During term vector, several words and the contact of the most several word before current word can be considered, make whole text feature have semanteme Characteristic, when being trained training sample, efficiency is costly and time consuming few, and when determining the type of text to be sorted, amount of calculation is little, meter Calculation mode is simple and efficient and precision is high.

It addition, use the calculation that adds and be averaging, calculate every kind of text type described training sample feature to Amount T_k。

By the calculation added and be averaging, when calculating the characteristic vector of described training sample of every kind of text type, Operand is little and to calculate process simple and efficient.

It addition, the characteristic vector of text to be sorted calculates in the following manner: after described text participle to be sorted, from institute Predicate vector matrix W_mnEach term vector w that the described text to be sorted of middle lookup is comprised_mn；According to described text to be sorted Term vector w_mn, use the calculation adding and being averaging, calculate characteristic vector D of described text to be sorted.

By searching term vector matrix W_mnIn each term vector w of being comprised of text to be sorted_mn, can make full use of Existing term vector matrix W_mn, thus realize the quick and precisely lookup to each term vector, indirectly improve whole text classification The efficiency of journey.

It addition, according to the characteristic vector of text to be sorted and characteristic vector T of described training sample_k, determine literary composition to be sorted This type, specifically includes: calculate the characteristic vector of described text to be sorted and the described training sample of various text type Characteristic vector T_kCosine similarity value；The type of described text to be sorted by described cosine similarity closest to right when 1 The text type of the training sample answered.

Utilizing the mode calculating cosine similarity value to determine the type of text to be sorted, operand is little and calculation is simple Victory, precision when determining text type to be sorted is high.

Accompanying drawing explanation

Fig. 1 is the file classification method flow chart of a kind of term vector according to first embodiment of the invention；

Fig. 2 is the structural representation of the CBOW model according to first embodiment of the invention；

Fig. 3 is the schematic network structure of the CBOW model according to first embodiment of the invention；

Fig. 4 is the accompanying drawings that word looked into by the CBOW model according to first embodiment of the invention；

Fig. 5 be the negative sampling algorithm according to first embodiment of the invention insinuate set up schematic diagram；

Fig. 6 is the structural representation of a kind of terminal unit according to third embodiment of the invention；

Fig. 7 is the structural representation of a kind of terminal unit according to four embodiment of the invention.

Detailed description of the invention

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with the accompanying drawing each reality to the present invention The mode of executing is explained in detail.But, it will be understood by those skilled in the art that in each embodiment of the present invention, In order to make reader be more fully understood that, the application proposes many ins and outs.But, even if there is no these ins and outs and base Many variations and amendment in following embodiment, it is also possible to realize the application technical scheme required for protection.

First embodiment of the present invention relates to the file classification method of a kind of term vector.Idiographic flow is as shown in Figure 1.

In a step 101, the data after L text participle are inputted in CBOW, calculates its term vector matrix.

Specifically, L text being carried out participle, the data obtained are as the input of CBOW, thus calculate this L literary composition This term vector and comprise the term vector matrix of each term vector, wherein, term vector is denoted as w_mn, term vector matrix is denoted as W_mn, m is Word number, n is the dimension of term vector, term vector matrix W_mnConcrete form as follows:

Wherein,Represent the vector set of m word, W_mnIn every a line represent a word vector expression.

Additionally, CBOW described in embodiment of the present invention, it is CBOW based on negative sampling algorithm.Conventional CBOW has base In CBOW and the CBOW two types based on negative sampling algorithm of Hierarchical Softmax, separately below to based on The CBOW and CBOW based on negative sampling algorithm of Hierarchical Softmax are introduced.

Wherein, CBOW model based on Hierarchical Softmax, comprise input layer, projection layer and output layer, its It is in known current word w_tContext w_t-2, w_t-1, w_t+1, w_t+2On the premise of predict current word w_t, as shown in Figure 2.Assume sample (Context (w), w) is made up of c word each before and after w, then the term vector v of 2c word during input layer comprises Context (w) (Context(w)₁V), v (Context (w)₂V), v (Context (w)_2cv)∈R^m, m represents the length of term vector.Projection layer will It is cumulative that 2c vector of input does summation, i.e.Its structural representation is as shown in Figure 3.Output The corresponding binary tree of layer, it is to work as leafy node with the word occurred in language material, works as with the number of times that each word occurs in language material The Hofman tree that weights construct, in this Hofman tree, N=(| D |) altogether is individual for leafy node, in the most corresponding dictionary D Word s, non-leaf node N-1 those nodes of black (figure get the bid into).

Hierarchical Softmax is for putting forward a high performance key technology, in Hofman tree in term vector Certain leafy node, it is assumed that the word w in its corresponding dictionary D, note:

1)p^wRepresent the path arriving w correspondence leafy node from root node；

2)l^wRepresent path p^wIn comprise the number of node；

3)Represent path p^wIn l^wIndividual node, whereinRepresent root node,Represent that word w is corresponding Node；

4)Representing the huffman coding of word w, it is by l^w-1 coding is constituted,Represent path p^wThe coding (the most corresponding coding of root node) that middle jth node is corresponding；

5)Represent path p^wThe vector that middle non-leaf node is corresponding,Represent path p^wMiddle jth The vector that individual non-leaf node is corresponding.

The example of existing word w=" football ", illustrates how, under the network structure shown in Fig. 3, to utilize vector x_w∈R^mAnd Hofman tree carrys out defined function p (w | Context (w)), and detailed process is as shown in Figure 4.Article four, 5 nodes that dotted line limit strings together Just constitute path p^w, its length l^w=5,For path p^wOn 5 nodes, andCorresponding root node,Be respectively 1,0,0,1, i.e. the huffman coding of " football " is 1001, additionally,Table respectively Show path p^wThe vector that upper 4 non-leaf nodes are corresponding." football " this leaf node, middle warp altogether is arrived from root node Manage 4 branches (the corresponding branch in the limit of every dotted line), and branch all can be considered and carried out two classification each time.Both It it is so that the angle from two classification considers a problem, then each non-leaf node, it is necessary to formulate for child's node around One classification, i.e. which is positive class (label is 1), and which is negative class (label is 0), beyond the node that digs up the roots, each node in tree An all corresponding value is the huffman coding of 0 or 1, and therefore, huffman coding is 1 by the most natural a kind of way Node is defined as positive class, and the node being encoded to 0 is defined as negative class (also can be defined as positive class by 0,1 is defined as negative class), namely When being classified by one node, assigning to the left side and bear class exactly, assigning to the right is exactly positive class, arranges at this:According to logistic regression, the probability that node is divided into positive class isThe probability being divided into negative class is thenWherein, θ is undetermined parameter, here non-leaf node Corresponding thoseJust can play the part of the role of parameter θ.

For arriving, from root node, 4 two classification that " football " this leafy node is experienced, every subseries is tied The probability of fruit writes out and is exactly:

1) the 1st time:

2) the 2nd time:

3) the 3rd time:

4) the 4th:

ThenSo far, by the example of w=" football ": For any word w in dictionary D, Hofman tree must exist the counterfoil node path p to word w correspondence node^w(and this Path is unique).Path p^wOn there is l^w-1 branch, regards two classification as by each branch, and classification each time just produces These probability multiplications are p (w | Context (w)) by one probability.

p (w | C o n t e x t (w)) = Π_{j = 2}^{l^{w}} p (d_{j}^{w} | x_{w}, θ_{j - 1}^{w}) - - - (1)

p (d_{j}^{w} | x_{w}, θ_{j - 1}^{w}) = \{\begin{matrix} σ (x_{w}^{T} θ_{j - 1}^{w}), & d_{j}^{w} = 0 \\ 1 - σ (x_{w}^{T} θ_{j - 1}^{w}), & d_{j}^{w} = 1 \end{matrix}

Or, write as the form of overall expression formula:

p (d_{j}^{w} | x_{w}, θ_{j - 1}^{w}) = {[σ (x_{w}^{T} θ_{j - 1}^{w})]}^{1 - d_{j}^{w}} {[1 - σ (x_{w}^{T} θ_{j - 1}^{w})]}^{d_{j}^{w}}

Assume Y_w=(y_w,1,y_w,2,…,y_w,N)^TBeing the vector of an a length of N, its component can not represent probability, if it is desired to Want Y_wComponent y_w,iRepresent that when context is Context (w), next one word is by chance the probability of i-th word in dictionary D, then need Do the normalized of a softmax, then haveWherein i_wRepresent that word w is at dictionary D In index.Therefore

Formula (1) is substituted into log-likelihood functionIn, then:

NoteFor the object function of CBOW model, Term vector utilize stochastic gradient rise method to optimize above-mentioned function.The way of stochastic gradient rise method is as follows:

(all relevant parameters in object function w), are just done and once update by Context (w) often to take a sample.See Observation of the eyes scalar functionsUnderstanding, the parameter in this function includes vectorBe given function l (w, j) about The gradient of these vectors.

Wherein, l (w, j) aboutGradient be:

Then,More new formula writeable be:η represents learning rate.

(w, j) to x in like manner can to obtain l_wGradient be:

Here, x_wRepresent is the cumulative of the term vector of each word in Context (w), and final purpose is to ask each in dictionary D The term vector of word, then

In CBOW model based on negative sampling algorithm, it is known that context Context (w) of word w, need predict w, therefore, Being exactly positive sample for given Context (w), word w, other words are exactly negative sample.Assume to have selected what a bearing about w Sample set NEG (w) ≠ φ, forDefinitionRepresent wordLabel, the mark of i.e. positive sample Label are 1, and the label of negative sample is 0.

For a given positive sample (Context (w), w), it is desirable to maximize

g (w) = \underset{u &Element; {w} \cup N E G (w)}{Π} p (u | C o n t e x t (w)) - - - (2)

Wherein,

Can also be write as the form of overall expression formula:

p (u | C o n t e x t (w)) = {[σ (x_{w}^{T} θ^{u})]}^{L^{w} (u)} \cdot {[1 - σ (x_{w}^{T} θ^{u})]}^{1 - L^{w} (u)} - - - (3)

Here x_wStill represent the vectorial sum of each word in Context (w), and θ^u∈R^mRepresent word u corresponding to Amount.Formula (3) is substituted into formula (2), has:

g (w) = σ (x_{w}^{T} θ^{w}) \underset{u &Element; N E G (w)}{Π} [1 - σ (x_{w}^{T} θ^{u})]

Wherein,Represent when context be (Context (w), time w), it was predicted that centre word is the probability of w, andThen represent when context is Context (w), it was predicted that centre word is the probability of u, then maximize g Time (w), namely maximizeMinimize all of simultaneouslyNamely increase the general of positive sample The probability of negative sample is reduced while rate.For a given corpus C, functionMesh as global optimization Mark, calculates for simplifying, takes the logarithm G, and the most final object function is:

Note above formula be l (w, u) then:

Also with stochastic gradient rise method, its parameter is optimized, then:

(w, u) about θ for l^uGradient be:

Then, θ^uMore new formula writeable be:

θ^{u} : = θ^{u} + η [L^{w} (u) - σ (x_{w}^{T} θ)] x_{w}

(w, u) about x for l_wGradient be:

Then, utilizeCan obtainMore new formula be:

The most simply introducing the negative sampling algorithm used in embodiment of the present invention, the word in dictionary D is at language It is low that the number of times occurred in material C has height to have, and for those high frequency words, the probability being chosen as negative sample just should be bigger, otherwise, right In those low-frequency words, selected probability will be less.It is essentially the problem of cum rights sampling, specific algorithm process, can retouch State for:

Assume corresponding line segment l (w) of each word w in dictionary D, a length of:

l e n (w) = \frac{c o u n t e r (w)}{\underset{u &Element; D}{Σ} c o u n t e r (u)}

Here (sum term in denominator is used for doing normalizing the number of times that one word of counter () expression occurs in language material C Change), now these line segments ending is connected and is stitched together, form the unit segment of a length of 1.If it is random toward this Get ready on individual unit segment, then the probability that the line segment (corresponding high frequency words) that length is the longest is hit is the biggest.

NoteHere w_jRepresent jth word in dictionary D, then withFor cuing open Partial node can get a non-equidistant subdivision in district [0,1], I_i=(l_i-1,l_i], i-1,2 ..., N is its N number of region subdivision. The equidistant subdivision being further introduced on interval [0,1], subdivision node isWherein M ＞＞ N, as shown in Figure 5.

By inside subdivision nodeProject on non-equidistant subdivision, as shown in the dotted line in Fig. 6, thenWith IntervalMapping relations be:

Table (i)=w_k,where m_i∈I_k, i=1,2 ..., M-1

According to these mapping relations, its sampling process is: generate random integers r between [1, M-1], Table every time R () sample, just gets self w if sampled in sampling process_i, then skip.

In a step 102, from W_mnIn, search the term vector that training sample comprises.

Specifically, from term vector matrix W_mnIn, search the word that comprises respectively of M training sample of known text type to Amount, first the training sample of M known text type is carried out participle (M≤L here, be in order to prevent training sample participle after The term vector matrix W that obtains in step 101 of result_mnMiddle lookup less than), then at word moment matrix W_mnEach training of middle lookup The term vector w that sample comprises respectively_mn, can the term vector w of each training sample of known every kind of text type_mn。

In step 103, the characteristic vector of the training sample of every kind of text type is calculated.

Specifically, the term vector w of each training sample obtained according to step 102_mn, use the calculating adding and being averaging Mode, calculates characteristic vector T of the training sample of every kind of text type_k, wherein, k=1,2 ..., K, K represent text type number Amount.Assume have amusement class, science and technology class, finance and economic, use T respectively₁、T₂、T₃Represent amusement class, science and technology class and finance and economic feature to Amount, then have:

T₁=[avg (w₁₁+w₂₁+…+w_e1)avg(w₁₂+w₂₂+…+w_e2)…avg(w_1n+w_3n+…+w_en)]=[W_ent1 W_ent2 … W_entn]

T₂=[avg (w₁₁+w₃₁+…+w_i1)avg(w₁₂+w₃₂+…+w_i2)…avg(w_1n+w_3n+…+w_in)]=[W_tech1 W_tech3 … W_techn]

T₃=[avg (w₁₁+…+w_f1+…+w_i1)avg(w₁₂+…+w_f2+…+w_i2)…avg(w_1n+…+w_fn+…+w_in)] =[W_fina1 W_fina2 … W_finan]

Wherein, e represents the e word, i i-th word, the f word of f,

W_ent1=avg (w₁₁+w₂₁+…+w_e1),

W_ent2=avg (w₁₂+w₂₂+…+w_en),

W_entn=avg (w_1n+w_3n+…+w_en),

W_tech1=avg (w₁₁+w₃₁+…+w_i1),

W_tech2=avg (w₁₂+w₃₂+…+w_i2),

W_techn=avg (w_1n+w_3n+…+w_in),

W_fina1=avg (w₁₁+…+w_f1+…+w_i1),

W_fina2=avg (w₁₂+…+w_f2+…+w_i2),

W_finan=avg (w_1n+…+w_fn+…+w_in),

If there being other classifications, each type of characteristic vector in like manner can be obtained.

It should be noted that T₁∈[w₁,w₂,…,w_e], T₂∈[w₁,w₃,…,w_i], T₃∈[w₁,…,w_f,…,w_i], its In, each apoplexy due to endogenous wind w₁,w₂,…w_mFor the set of word, each different text type are formed by different or that part is identical word combination, And characteristic vector T₁、T₂、T₃In, the element in each avg (), with W_mnThe order of middle m word is unrelated, it is not required that continuously, only Relevant with the word of the composition in current text type, therefore, when seeking each class text type, from matrix W_mnMiddle lookup is correlated with Word also calculates accordingly.

At step 104, from W_mnThe term vector that middle lookup text to be sorted comprises.

Specifically, first text to be sorted is carried out participle, then at word moment matrix W_mnThis text to be sorted of middle lookup Each term vector w comprised_mn。

In step 105, the characteristic vector of text to be sorted is calculated.

Specifically, each term vector w that the text to be sorted obtained according to step 104 is comprised_mn, use and add and ask flat Equal calculation, calculates the characteristic vector of text to be sorted.

After assuming current text k participle to be sorted, word set is combined into (w₁,w₂,…,w_l), represent that be made up of this treats point l word Class sample, by term vector matrix W_mnMiddle lookup corresponding word (w₁,w₂,…,w_l) vector, obtainThe characteristic vector of the most current text to be sorted is:

D_k=[avg (w₁₁+w₂₁+…+w_l1)avg(w₁₂+w₂₂+…+w_l2)avg(w_1n+w_2n+…+w_ln)]=[d₁₁ d₁₂ … d_1n]

Wherein, the subscript n of d represents the dimension of term vector, and 1 represents current first text, when there being multiple text, at this The value of 1 can be natural number.

In step 106, cosine similarity value is calculated.

Specifically, the characteristic vector of text to be sorted is calculatedWith the feature of the training sample of various text type to Amount T_kCosine similarity valueWherein,

Current text to be sorted with the cosine similarity value of the characteristic vector of the training sample of types of entertainment is:

c o s (T_{1}, D_{x_{k}}) = \frac{W_{e n t 1} d_{11} + W_{e n t 2} d_{12} + ... + W_{e n t n} d_{1 n}}{\sqrt{W_{e n t 1}^{2} + W_{e n t 2}^{2} + ... + W_{e n t n}^{2}} \times \sqrt{d_{11}^{2} + d_{12}^{2} + ... + d_{1 n}^{2}}}

Current text to be sorted with the cosine similarity value of the characteristic vector of the training sample of science and technology type is:

c o s (T_{2}, D_{x_{k}}) = \frac{W_{t e c h 1} d_{11} + W_{t e c h 2} d_{12} + ... + W_{t e c h} d_{1 n}}{\sqrt{W_{t e c h 1}^{2} + W_{t e c h 2}^{2} + ... + W_{t e c h n}^{2}} \times \sqrt{d_{11}^{2} + d_{12}^{2} + ... + d_{1 n}^{2}}}

Current text to be sorted with the cosine similarity value of the characteristic vector of the training sample of financial type is:

c o s (T_{3}, D_{x_{k}}) = \frac{W_{f i n a 1} d_{11} + W_{f i n a 2} d_{12} + ... + W_{f i n a} d_{1 n}}{\sqrt{W_{f i n a 1}^{2} + W_{f i n a 2}^{2} + ... + W_{f i n a n}^{2}} \times \sqrt{d_{11}^{2} + d_{12}^{2} + ... + d_{1 n}^{2}}}

In like manner, the cosine similarity value of sample to be sorted and the characteristic vector of the training sample of other text type can be obtained.

In step 107, the type of text to be sorted is determined.

Specifically, the cosine similarity value obtained according to step 106 determines the type of text to be sorted, and wherein, this is treated The type of classification samples is exactly the cosine similarity value text type closest to training sample corresponding when 1.

What deserves to be explained is, when there being several texts to be sorted, just constitute a text set to be sortedX represents and treats The text type of classifying text, k represents the kth text of text type to be sorted, x ∈ [1,2,3 ..., K], k ∈ [1,2, 3,...,M].When classifying text collection treated by needsIn other text to be sorted any when classifying, it is only necessary to perform step Rapid 104 to step 107.

It is seen that, in the present embodiment, by continuous word bag mould based on layering Hierarchical Softmax Type CBOW, comprises the term vector matrix of each term vector after calculating text participle, search known text based on this term vector matrix The term vector that each training sample of type comprises respectively, and calculate the characteristic vector of the training sample of every kind of text type, Finally according to the characteristic vector of text to be sorted and the characteristic vector of the training sample of various text type, determine text to be sorted Type.In this way, during to term vector, several words and the contact of the most several word before current word can be considered, Making whole text feature have the feature of semanteme, when being trained training sample, efficiency is costly and time consuming few, according to text to be sorted Characteristic vector T of training sample of characteristic vector and various text type_kCosine similarity value, determine text to be sorted During type, calculation is simple and direct and precision is high.

Second embodiment of the present invention relates to the file classification method of a kind of term vector.Second embodiment is real first Do further improvement on the basis of executing mode, mainly theed improvement is that: in second embodiment of the invention, give step In rapid 102,104, from term vector matrix W_mnIn quickly search the optimization method of required term vector, the method is particularly as follows: be in advance Term vector matrix W_mnIn each term vector w_mnSet up respective index, then according to this index in vector matrix W_mnMiddle lookup The term vector w that each training sample or text to be sorted are comprised_mn.Idiographic flow is as shown in Figure 1.

Present embodiment not only can reach the technique effect of the first embodiment, and passing through is term vector matrix in advance W_mnIn each term vector w_mnSet up the mode of respective index, can be more convenient, quickly at word moment matrix W_mnMiddle lookup institute Each term vector w needed_mn, not only increase search efficiency, the most indirectly improve the efficiency of whole text classification.

Third embodiment of the invention relates to a kind of terminal unit, including: term vector computing module 10, search module 11, Training sample characteristic vector computing module 12 and text type to be sorted determine module 13, and wherein, text type to be sorted determines Module 13 specifically includes again: term vector obtains submodule 131, characteristic vector calculating sub module 132, cosine similarity calculating submodule Block 133 and determine submodule 134, as shown in Figure 6.

Term vector computing module 10, after the data after L text participle are inputted continuous word bag model CBOW, calculates Comprise each term vector w_mnTerm vector matrix W_mn。

Search module 11, for by after the training sample participle of M known text type, from term vector matrix W_mnMiddle lookup The term vector w that each training sample comprises respectively_mn。

Training sample characteristic vector computing module 12, for the term vector w comprised according to each training sample_mn, calculate Characteristic vector T of the training sample of every kind of text type_k。

Text type to be sorted determines module 13, for according to the characteristic vector of text to be sorted and various text type Characteristic vector T of described training sample_k, determine the type of text to be sorted.Wherein,

Term vector obtains submodule 131, for from term vector matrix W_mnEach word that middle lookup text to be sorted is comprised to Amount w_mn。

Characteristic vector calculating sub module 132, for each term vector w comprised according to this text to be sorted_mn, use add and The calculation being averaging, calculates the characteristic vector of this text to be sorted.

Cosine similarity calculating sub module 133, for calculating the characteristic vector of text to be sorted and various text type Characteristic vector T of described training sample_kCosine similarity value.

Determine submodule 134, for determining the type of text to be sorted, wherein, cosine phase according to this cosine similarity value It is the type of this text to be sorted closest to the text type of training sample corresponding when 1 like angle value.

It is seen that, present embodiment is the system embodiment corresponding with the first embodiment, and present embodiment can be with First embodiment is worked in coordination enforcement.The relevant technical details mentioned in first embodiment the most still has Effect, in order to reduce repetition, repeats no more here.Correspondingly, the relevant technical details mentioned in present embodiment is also applicable in In first embodiment.

It is noted that each module involved in present embodiment is logic module, in actual applications, one Individual logical block can be a physical location, it is also possible to be a part for a physical location, it is also possible to multiple physics lists The combination of unit realizes.Additionally, for the innovative part highlighting the present invention, will be with solution institute of the present invention in present embodiment The unit that the technical problem relation of proposition is the closest introduces, but this is not intended that in present embodiment the list that there is not other Unit.

Four embodiment of the invention relates to a kind of terminal unit, and the 4th embodiment is on the basis of the 3rd embodiment Done further improvement, mainly theed improvement is that: in four embodiment of the invention, term vector computing module 10 it After, also include: set up index module 14, as shown in Figure 7.

Set up index module 14, be used for as term vector matrix W_mnIn each term vector w_mnSet up respective index so that Searching module 11 can be more convenient, quickly according to this index, in term vector matrix W with term vector acquisition submodule 131_mnIn, Search each training sample or term vector w that text to be sorted is comprised_mn。

Owing to the second embodiment is the most corresponding with present embodiment, therefore present embodiment can be mutual with the second embodiment Match enforcement.The relevant technical details mentioned in second embodiment is the most effective, implements second The technique effect that can reach in mode is the most too it is achieved that in order to reduce repetition, the most superfluous State.Correspondingly, the relevant technical details mentioned in present embodiment is also applicable in the second embodiment.

It will be appreciated by those skilled in the art that all or part of step realizing in above-described embodiment method can be by Program instructs relevant hardware and completes, and this program is stored in a storage medium, including some instructions with so that one Individual equipment (can be single-chip microcomputer, chip etc.) or processor (processor) perform method described in each embodiment of the application All or part of step.And aforesaid storage medium includes: USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), GPU, magnetic disc or CD etc. are various can deposit The medium of storage program code.

It will be understood by those skilled in the art that the respective embodiments described above are to realize the specific embodiment of the present invention, And in actual applications, can to it, various changes can be made in the form and details, without departing from the spirit and scope of the present invention.

Claims

1. the file classification method of a term vector, it is characterised in that including:

Data after L text participle are inputted in continuous word bag model CBOW, calculate and comprise each term vector w_mnTerm vector Matrix W_mn；

After the training sample participle of M known text type, from described term vector matrix W_mnEach training sample of middle lookup divides The term vector w not comprised_mn；

The term vector w comprised according to each training sample described_mn, calculate the characteristic vector of the training sample of every kind of text type T_k；

Characteristic vector according to text to be sorted and characteristic vector T of the described training sample of various text type_k, determine and treat point The type of class text；

The file classification method of term vector the most according to claim 1, it is characterised in that use the calculating adding and being averaging Mode, calculates characteristic vector T of the described training sample of every kind of text type_k。

The file classification method of term vector the most according to claim 2, it is characterised in that the feature of described text to be sorted Vector calculates in the following manner:

After described text participle to be sorted, from described term vector matrix W_mnThe described text to be sorted of middle lookup comprised each Term vector w_mn；

Term vector w according to described text to be sorted_mn, use the calculation adding and being averaging, calculate described text to be sorted Characteristic vector.

The file classification method of term vector the most according to claim 1, it is characterised in that described according to text to be sorted Characteristic vector and characteristic vector T of described training sample_k, determine the type of text to be sorted, specifically include:

Calculate characteristic vector T of the characteristic vector of described text to be sorted and the described training sample of various text type_kCosine Similarity value；

The type of described text to be sorted is the described cosine similarity text class closest to training sample corresponding when 1 Type.

5. according to the file classification method of the term vector described in claim 1 or 3, it is characterised in that from described term vector Matrix W_mnThe term vector w that each training sample of middle lookup comprises respectively_mn, specifically include:

Pre-build described term vector matrix W_mnIndex；

According to described index in described term vector matrix W_mnThe term vector w that each training sample of middle lookup comprises respectively_mn。

The file classification method of term vector the most according to claim 1, it is characterised in that described CBOW is based on negative sampling The CBOW of algorithm.

7. a terminal unit, it is characterised in that including:

Term vector computing module, after the data after L text participle are inputted continuous word bag model CBOW, calculates and comprises respectively Individual term vector w_mnTerm vector matrix W_mn；

Search module, for by after the training sample participle of M known text type, from described term vector matrix W_mnMiddle lookup is each The term vector w that individual training sample comprises respectively_mn；

Training sample characteristic vector computing module, for the term vector w comprised according to each training sample described_mn, calculate every Plant characteristic vector T of the training sample of text type_k；

Text type to be sorted determines module, for the characteristic vector according to text to be sorted and the described instruction of various text type Practice characteristic vector T of sample_k, determine the type of text to be sorted；

Terminal unit the most according to claim 7, it is characterised in that described training sample characteristic vector computing module, adopts By the calculation added and be averaging, calculate characteristic vector T of the described training sample of every kind of text type_k。

Terminal unit the most according to claim 8, it is characterised in that described text type to be sorted determines that module includes:

Term vector obtains submodule, for by after described text participle to be sorted, from described term vector matrix W_mnDescribed in middle lookup Each term vector w that text to be sorted is comprised_mn；

Characteristic vector calculating sub module, for the term vector w according to described text to be sorted_mn, use the calculating adding and being averaging Mode, calculates the characteristic vector of described text to be sorted.

Terminal unit the most according to claim 7, it is characterised in that described text type to be sorted determines that module includes:

Cosine similarity value calculating sub module, for calculating the characteristic vector of described text to be sorted and the institute of various text type State characteristic vector T of training sample_kCosine similarity value；

Determine submodule, for by described cosine similarity closest to the text type of training sample corresponding when 1, as The type of described text to be sorted.