CN106126734A

CN106126734A - The sorting technique of document and device

Info

Publication number: CN106126734A
Application number: CN201610519971.6A
Authority: CN
Inventors: 丁希晨
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2016-07-04
Filing date: 2016-07-04
Publication date: 2016-11-16
Anticipated expiration: 2036-07-04
Also published as: CN106126734B

Abstract

Embodiments providing sorting technique and the device of a kind of document, the method includes: by training deep neural network language model, each participle in document to be sorted is converted to vector；By vector clusters is generated similar participle set；Document to be sorted is converted to characteristic frequency against document matrix by the set according to feature；By the similarity between the vector of calculating any two document to be sorted, characteristic frequency is converted to hierarchical clustering tree against document matrix；Based on default end condition, hierarchical clustering tree is dynamically cut at differing heights, obtain classifying documents.The present invention take into account participle contextual information in special context when document classification, so that every class document is the highest on semantic understanding degree and semantics recognition degree；Further, based on default end condition, hierarchical clustering tree is carried out the cutting of differing heights, it is to avoid the problem that in every class document, document number difference is big so that the classification of document is more reasonable.

Description

The sorting technique of document and device

Technical field

The present invention relates to field of computer technology, particularly relate to sorting technique and the classification of a kind of document of a kind of document Device.

Background technology

In Internet, the explosive growth of information is the management of information and use brings inconvenience.It is hidden in disclose Having information or the structure of potential value after web data, web data digging technology achieves development faster with wide in recent years General application.Clustering documents is one of most important instrument in web data excavation applications.Wherein, document of the prior art gathers Class method mainly includes K-means, hierarchical clustering method etc..

But, document clustering method of the prior art yet suffers from following problems: when to document classification not Consider word contextual information under special context in document, therefore so as to get classifying documents at semantic understanding degree and language Justice resolution on the most relatively low, be not easy to understand；Additionally, when clustering tree (dendrogram) is cut, existing skill Document clustering method in art can only cut at identical height, and needs the most artificially to specify the classification number of document, So that the document number comprised between different classes of document differs greatly and extremely uneven, it is impossible to realize document is moved State Rational Classification.

As can be seen here, document clustering method of the prior art generally also exists semantic understanding degree relatively when to document classification Low, semantics recognition degree is relatively low and the irrational problem of document classification.

Summary of the invention

Embodiment of the present invention technical problem to be solved is to provide sorting technique and the device of a kind of document, existing to solve Have the document clustering method in technology generally to also exist when to document classification semantic understanding degree is relatively low, semantics recognition degree relatively low with And the irrational problem of document classification.

In order to solve the problems referred to above, according to an aspect of the present invention, the invention discloses the sorting technique of a kind of document, Including:

By training deep neural network language model, each participle in document to be sorted is converted to vector；

By vector clusters is generated similar participle set, wherein, each one feature of similar participle set expression；

Document to be sorted is converted to characteristic frequency against document matrix by the set according to feature；

By the similarity between the vector of calculating any two document to be sorted, characteristic frequency is converted to against document matrix Hierarchical clustering tree；

Based on default end condition, hierarchical clustering tree is dynamically cut at differing heights, obtain classifying documents.

According to a further aspect in the invention, the invention also discloses the sorter of a kind of document, including:

First modular converter, for by training deep neural network language model by each participle in document to be sorted Be converted to vector；

Cluster module, for by generating similar participle set, wherein, each similar participle set expression to vector clusters One feature；

Second modular converter, for being converted to characteristic frequency against document matrix according to the set of feature by document to be sorted；

3rd modular converter, the similarity between the vector by calculating any two document to be sorted, by feature frequency Rate is converted to hierarchical clustering tree against document matrix；

Cutting module, for hierarchical clustering tree dynamically being cut at differing heights based on default end condition, To classifying documents.

Compared with prior art, the embodiment of the present invention includes advantages below:

The characterization of the document participle of the embodiment of the present invention by means of deep neural network model, and by similar participle to Amount cluster, carries out follow-up classification process with the feature that cluster obtains for basis, take into account participle when document classification spy Contextual information in attribute border, so that the semantic understanding degree of every class document and semantics recognition degree are the highest；Additionally, this Bright embodiment carries out the cutting of differing heights based on default end condition to hierarchical clustering tree, it is to avoid every class document Chinese The problem that shelves number difference is big, it is possible to the number of documents dynamically comprised according to subclass document is classified dynamically so that The classification of document is more reasonable.

Further, the embodiment of the present invention utilizes depth model to consider the information of context words order so that feature Statement effect be improved significantly；Utilizing the feature combination of deep neural network language model and name entity, cluster obtains The characteristic set that similar phrase under special context is constituted, is different from prior art and does not accounts for described by document subclass The problem of the relation of object；By similar participle and Chinese name entity are all carried out characterization so that the literary composition in every class document Shelves can be in linguistic context and the most close, good classification effect；Additionally, the set of feature based generates characteristic frequency against document square Battle array so that in matrix, the data of each column are a feature, the set of the most similar phrase so that the classification results of document and participle Actual context is associated；Further, characteristic frequency is that each document is under some feature against each element value in document matrix Weighted value so that last every class document all comprises the document under similar linguistic context so that the classification of document is more reasonable, it is simple to People understands and understands.

Accompanying drawing explanation

Fig. 1 is the flow chart of steps of the sorting technique embodiment of a kind of document of the present invention；

Fig. 2 is the flow chart of steps of the sorting technique embodiment of the another kind of document of the present invention；

Fig. 3 is the flow chart of steps of the dynamic partitioning methods embodiment of a kind of hierarchical clustering tree of the present invention；

Fig. 4 is the flow chart of steps of the sorting technique embodiment of another document of the present invention；

Fig. 5 is the structured flowchart of the sorter embodiment of a kind of document of the present invention；

Fig. 6 is the structured flowchart of the sorter embodiment of the another kind of document of the present invention.

Detailed description of the invention

Understandable for enabling the above-mentioned purpose of the present invention, feature and advantage to become apparent from, real with concrete below in conjunction with the accompanying drawings The present invention is further detailed explanation to execute mode.

One of core idea of the embodiment of the present invention is, the characterization of the document participle of the embodiment of the present invention is by means of deeply Degree neural network model, and by the vector clusters of similar participle, carry out at follow-up classification with the feature that cluster obtains for basis Reason, take into account participle contextual information in special context when document classification, so that every class document is managed at semanteme On Xie Du and semantics recognition degree the highest；Additionally, hierarchical clustering tree is entered by the embodiment of the present invention based on default end condition The cutting of row differing heights, it is to avoid the problem that in every class document, document number difference is big, it is possible to dynamically according to subclass document The number of documents comprised is classified dynamically so that the classification of document is more reasonable.

With reference to Fig. 1, it is shown that the flow chart of steps of the sorting technique embodiment of a kind of document of the present invention, specifically can wrap Include following steps:

Step 101, by training deep neural network language model each participle in document to be sorted is converted to Amount；

Wherein it is possible to based on language material training deep neural network language model (such as word2vec), classification will be needed Multiple documents (such as doc1, doc2, doc3 ... etc.) in each participle be described as one-dimensional term vector, obtain term vector constitute Dictionary.

Step 103, by vector clusters generates similar participle set, wherein, each similar participle set includes representing Multiple vectors of same characteristic features；

The set that some similar phrases are constituted is obtained wherein it is possible to clustered by all term vectors in dictionary.Wherein, by Phrase included in each set is similar, therefore, in order to make it easy to understand, can be a spy by each set expression Levy, thus obtained the set being made up of multiple features.

Step 105, is converted to characteristic frequency against document matrix according to the set of feature by document to be sorted；

Wherein, through above-mentioned steps 101 and step 103, multiple documents are converted to the set of feature, the most just may be used So that the collection of feature based is incompatible, multiple documents are converted to characteristic frequency against document matrix TFIDF-feature.Wherein, TFIDF- Feature is similar to traditional document and is that the TFIDF-feature of the embodiment of the present invention is against frequency matrix TFIDF, difference The incompatible formation of collection of feature based, therefore, in matrix, often row or each column represent a feature, the i.e. collection of some similar phrases Close, rather than in prior art TFIDF matrix, often row or each column represent single word.Wherein, TFIDF is: TF*IDF, TF represent word Frequently (Term Frequency), TF represents the frequency that entry occurs in document d；IDF represents reverse document-frequency (Inverse Document Frequency)。

Step 107, by calculating described characteristic frequency against between the vector of the document to be sorted of any two in document matrix Similarity, is converted to hierarchical clustering tree by characteristic frequency against document matrix；

Wherein it is possible to calculate characteristic frequency against in document matrix TFIDF-feature between the vector of any two document doc Similarity, based on result of calculation by characteristic frequency against the vector combination of two document doc corresponding in document matrix, will with this Characteristic frequency is changed against document matrix.

Step 109, dynamically cuts at differing heights hierarchical clustering tree based on default end condition, is classified Document.

Wherein it is possible to the hierarchical clustering tree generated is moved at differing heights based on the default cutting condition that terminates State is cut, thus obtains multiple subtree, i.e. multiclass document, it is achieved that the reasonable distribution of multiple documents.

By means of the technical scheme of the above embodiment of the present invention, the characterization of the document participle of the embodiment of the present invention by means of Deep neural network model, and by the vector clusters of similar participle, the feature obtained with cluster carries out follow-up classification for basis Process, take into account participle contextual information in special context when document classification, so that every class document is at semanteme In the level of understanding and semantics recognition degree the highest；Additionally, the embodiment of the present invention is come hierarchical clustering tree based on default end condition Carry out the cutting of differing heights, it is to avoid the problem that in every class document, document number difference is big, it is possible to dynamically according to subclass literary composition The number of documents that shelves are comprised is classified dynamically so that the classification of document is more reasonable.

With reference to Fig. 2, it is shown that the flow chart of steps of the sorting technique embodiment of the another kind of document of the present invention, the most permissible Comprise the steps:

Step 101a, treats classifying documents and makees word segmentation processing, obtains the participle set that each document to be sorted is comprised；

Wherein it is possible to multiple document doc1, doc2, doc3 to be made respectively word segmentation processing, the most each document corresponding Individual participle set, obtains multiple participle set.

Step 101b, by training deep neural network language model each participle in document to be sorted is converted to Amount；

Wherein it is possible to based on language material (document doc1, doc2, doc3 the most to be sorted) training Word2vec by doc1, Each participle word1, word2, word3 of doc2, doc3 ... wordm is respectively converted into the one-dimensional arithmetic number vector of a length of d w_i(i=1,2 ... m), wherein, m is total number of participle in language material；

Wherein, length d of vector can determine, specifically, compared to document based on total number of participle in language material Total number (total number of words may be up to ten thousand) of the participle comprised, the participle vector that deep neural network model training obtains, can Higher dimensional space (dimensions up to ten thousand) to be converted to a statement close low latitudes vector (such as 200 dimensions etc.).Therefore, dimension Determine relevant with total participle number, the length of vector can be set to hundreds of dimension in actual applications.

Step 103a, to vector clusters, using cluster result less than presetting the vector of difference value as a similar participle collection Closing, wherein, set includes the multiple vectors under similar linguistic context, each one feature of similar participle set expression；

Wherein it is possible to multiple vector wi clusters are obtained cluster result by clustering method, and by cluster result and Preset difference value (such as 1.2) to make comparisons, when this cluster result is less than 1.2, then using the vector in this cluster result as one Individual similar participle set.Thus obtain the feature of multiple similar participle set.

Step 103b, replaces with different entity sets respectively by the participle of subordinate difference name entity class in document to be sorted Closing, wherein, each entity sets represents a feature；

Wherein it is possible to dividing Chinese name entity such as time, name, tissue and the geography information class in document to be sorted Word carries out characterization respectively, obtains the feature of multiple different name entity class.

Wherein it is possible to feature and multiple difference of the multiple similar participle set of deep neural network language model are named The feature of entity class, is merged into and is characterized the characteristic set that engineering obtains.

So after obtaining characteristic set, it is possible to according to this feature collection is incompatible, document to be sorted is converted to characteristic frequency Inverse document matrix, wherein, this feature frequency is against the characteristic frequency that document matrix is m*n rank against document matrix, and wherein, m is for treating point The quantity of class document, the quantity that n is characterized, and, element (x, y) table that characteristic frequency arranges against each x row y in document matrix Show this document x to be sorted weighted value in this feature y.

Certainly, in various embodiments, this feature frequency is against the quantity that the m in document matrix can also be feature, and n is then Quantity for document to be sorted.

Step 107a, calculates characteristic frequency similar against the cosine between the vector of the document to be sorted of any two in document matrix Degree；

Step 107b, carrys out generation layer time cluster by the vector combination of two documents to be sorted that cosine similarity is maximum Tree；

Wherein it is possible to calculate characteristic frequency against similarity between the vector of any two document doc in document matrix, by phase Seemingly the vector combination of two documents that degree is the highest, generates new vector；Calculate the most again new vectorial and remaining document to Similarity between any two groups of vectors in amount, still by two groups of the highest for similarity vector combination, by that analogy, thus by feature Frequency is converted to hierarchical clustering tree against document matrix.

By means of the technical scheme of the above embodiment of the present invention, the present invention utilize depth model to consider context words is suitable The information of sequence so that feature statement effect be improved significantly；Utilize deep neural network language model and name entity Feature combines, and cluster obtains the characteristic set that the similar phrase under special context is constituted, and is different from prior art and does not accounts for To the problem of the relation of object described by document subclass；By similar participle and Chinese name entity are all carried out characterization, Make the document in every class document can be in linguistic context and the most close, good classification effect；Additionally, the set next life of feature based Become characteristic frequency against document matrix so that in matrix, the data of each column are a feature, the set of the most similar phrase so that document Classification results be associated with the actual context of participle；Further, characteristic frequency is each against each element value in document matrix Document weighted value under some feature so that last every class document all comprises the document under similar linguistic context so that document Classification more reasonable, it is simple to people understand and understand.

In another embodiment, level is gathered by the step 109 in above-described embodiment based on default end condition Class is dynamically cut by tree at differing heights, obtains implementing of classifying documents, with reference to Fig. 3, it is shown that the one of the present invention The flow chart of steps of the dynamic partitioning methods embodiment of hierarchical clustering tree, specifically may include steps of:

Step 301, to described hierarchical clustering tree, proceeds by two from root node and cuts and cut, obtain two subtrees；

Wherein, due to when generating level clustering tree, use characteristic frequency against document two-by-two in document matrix The mode of vector combination is formed, and therefore, hierarchical clustering tree is binary tree, then by this hierarchical clustering tree from father node Proceed by two to cut and cut, it is possible to obtain two subtrees.

Step 303, the height calculating each subtree respectively and the number of documents comprised；

Wherein, the height of every subtree is the dissimilar degree between every comprised document of subtree；And every subtree is comprised Document, then can determine by the way of calculating the quantity of the node that every subtree is comprised.

Step 305, for each subtree, it is judged that whether the height of this subtree meets the first preset termination condition；

Wherein, for each subtree, it can be determined that whether the height of this subtree is less than or equal to preset termination height, if It is yes, then meets, otherwise for not meeting.

Step 307, for each subtree, it is judged that whether the number of documents that this subtree is comprised meets the second preset termination bar Part；

Wherein, for each subtree, it is judged that whether the number of documents that this subtree is comprised is less than or equal to preset termination number Amount, if YES, then meets, otherwise for not meeting.

Step 309, for each subtree, meets described first preset termination condition at the height judging this subtree or is somebody's turn to do When the number of documents that subtree is comprised meets described second preset termination condition, stop that this subtree is continued executing with described two and cut Cut step 301；

Step 311, for each subtree, the height judging this subtree do not meet described first preset termination condition and When the number of documents that this subtree is comprised does not meets described second preset termination condition yet, then continue this subtree is followed from root node Ring performs described step 301～step 311.

Wherein, when all stopping performing described two points of cutting steps to any one subtree, now, the total quantity of subtree is just It is the quantity (the division categorical measure of document the most to be sorted) of classifying documents, wherein, the number of files that every class classifying documents is comprised Amount is and finally no longer carries out the two of any subtree and cut when cutting, the quantity of all nodes that each subtree is comprised.

By means of the technical scheme of the above embodiment of the present invention, the present invention is tactful at differing heights by dynamically cutting tree Cutting obtains subtree, based on to binary tree traversal traversal of binary tree and the threshold value of predetermined end condition, and can be to profound subtree Effectively split, document classification effects equalizer.

In order to be better understood from the technique scheme of the present invention, below in conjunction with a specific embodiment come the present invention upper State technical scheme to be described in detail.

With reference to such as Fig. 4, it is shown that the flow chart of steps of the sorting technique embodiment of another document of the present invention, specifically may be used To comprise the steps:

Step 401, n document D i{i=1 to be clustered of input, 2 ... n} (such as, doc1, doc2, doc3 ... Docn), and carry out participle pretreatment, obtain corpus；

Step 403a, by the participle of n document to be clustered is carried out Feature Engineering, obtains word2vec degree of depth nerve net The feature bunch of network；

Specifically, first, train word2vec model, each participle in corpus is expressed as one a length of The one-dimensional arithmetic number vector w of d_i(i=1,2 ... m) (m is the number of participle in corpus)；According to these represent participle to Amount w_iCluster generates the set F of T relevant phrase_i{w_j... (i=1,2 ... T), the set F of each generation_iRepresent a spy Levy, set contains several participles under similar linguistic context word1, word2 ... } and vector.

Step 403b, by n document to be clustered is carried out Feature Engineering, obtains the feature bunch of NER Entity recognition；

Specifically, name Entity recognition NER is utilized to replace with feature by unified for the name phrase in n document to be clustered PERSON, as comprised name1, name2 ... }；Place word is replaced with LOCATION, as comprised loc1, loc2 ... } etc..

Finally, the feature bunch of the feature bunch of degree of depth network model and NER name entity is merged, obtains F_i(i=1, 2 ..., T+2) and as the characteristic set analyzing text.

Step 405, according to characteristic set, generates document term vector matrix；

Specifically, according to the characteristic set F generated_i(i=1,2 ..., T+2), by the set Di{i=1 of document, 2 ... N} is converted into the TfIdf-feature matrix of inverse characteristic frequency.The difference of this matrix and traditional TfIdf matrix is: traditional Every string of TfIdf matrix only represents a word, and in the TfIdf matrix that the present embodiment uses, each column represents a feature set Close, be the set comprising multiple words close under special context, the most single word.In matrix, each real number vector representation is each Document distribution under a certain feature, the numerical value of each column represents the document weighted value under characteristic set.

For example, participle " film " occurred twice in document D 1, and " film " occurred in 10 documents, that , the participle " film " of document D 1 weighted value in TfIdf-feature matrix is exactly 2*1/10.

Step 407, the Matrix Cluster to characteristic set, generate level clustering tree；

Wherein, the method generating level clustering tree can be described as: in an initial condition, is drawn N number of object to be clustered It is divided into N number of classification, each iteration: obtaining the distance between class by calculating the cosine similarity between subclass, combined distance is the most close Two classifications.Iteration is until all N number of objects all merge into a classification.Wherein, the process of merging constitutes one Hierarchical clustering tree.

Specifically, the vector that can calculate between any two document di with dj based on cosine (Cosine) similarity is similar Degree, such as: Similarity (di, dj)=cosine (vi, vj), wherein, vi and vj is that di and dj document is at TfIdf- The vector data of corresponding row in feature matrix；Then, according to similarity distance (i.e. 1-Similarity (di, dj), not phase Like degree) generate hierarchical clustering hierarchical clustering tree.

Step 409, uses dynamic Cut Stratagem, cuts hierarchical clustering tree.

For any one tree T_k, two subtrees of its binary tree are denoted as T respectively_k1And T_k2, it is carried out two and cuts and cut；Cut The height h of every subtree is calculated respectively after cutting_k1And h_k2, and it is any one to judge whether two subtrees meet in two end conditions Individual:Or N_ki≤N_min(i=1,2)；If meeting any one of two above condition, terminate This sub-traversal of tree cuts, if not up to end condition, continuation is to this subtree T_kiCutting, recurrence uses this strategy.Eventually Only condition reaches the stopping of rear recurrence, and symbiosis becomes K^*Individual subclass, the vector of the height of each subtree is (h₁,h₂,…,h_K*), subset The document number inside comprised is respectively { N₁,N₂,…,N_K*}.Above parameter is satisfied by condition h_k≤Or N_k≤N_min；So When end condition reaches, when i.e. any one subtree is the most no longer cut, the set Di{i=1 of document to be clustered, 2 ... n} quilt It is cut into K^*Individual subclass C_kK=1,2 ... K^*, each subclass C_kContain N the most respectively_kIndividual document, the set of document be by All nodes that the subtree of cutting comprises.

Additionally, in another embodiment, after completing the cutting to all subtrees, it is also possible to semantic to cutting based on Chinese The subtree cut merges thus generative semantics is close and understand the classifying documents that identification degree is high.

In the prior art when hierarchical clustering tree is cut, it is all to predefine number K of subclass or identical Height of tree degree height at cut.The shortcoming of this Cut Stratagem is the subclass number in K the subclass that cutting generates N_k(k=1,2 ..K) very different.The dynamically cutting tree strategy of the present embodiment is then based on to binary tree traversal traversal of binary tree and in advance The threshold value of the end condition determined (includes that the threshold value of subtree height reachesOr subclass number reaches N_min) in difference Highly place's cutting obtains subtree, promotes the cutting effect of subtree, makes document classification effect obvious；And the embodiment of the present invention can be On the basis of dynamic cutting tree, the name entity according to description object is such as: name PERSON, tissue ORGANIZATION etc. are to newly The subclass generated carries out merging semantically, increases the intelligibility of clustering documents.

It should be noted that for embodiment of the method, in order to be briefly described, therefore it is all expressed as a series of action group Closing, but those skilled in the art should know, the embodiment of the present invention is not limited by described sequence of movement, because depending on According to the embodiment of the present invention, some step can use other orders or carry out simultaneously.Secondly, those skilled in the art also should Knowing, embodiment described in this description belongs to preferred embodiment, and the involved action not necessarily present invention implements Necessary to example.

With reference to Fig. 5, it is shown that the structured flowchart of the sorter embodiment of a kind of document of the present invention, specifically can include as Lower module:

First modular converter 51, for by training deep neural network language model by each point in document to be sorted Word is converted to vector；

Cluster module 52, for by generating similar participle set, wherein, each similar participle set bag to vector clusters Include the multiple vectors representing same characteristic features；

Second modular converter 53, for being converted to characteristic frequency against document square according to the set of feature by document to be sorted Battle array；

3rd modular converter 54, for by calculating described characteristic frequency against the document to be sorted of any two in document matrix Vector between similarity, characteristic frequency is converted to hierarchical clustering tree against document matrix；

Cutting module 55, for hierarchical clustering tree dynamically being cut at differing heights based on default end condition, Obtain classifying documents.

By means of the technique scheme of the embodiment of the present invention, the characterization of the document participle of inventive embodiments is by means of deeply Degree neural network model, and by the vector clusters of similar participle, carry out at follow-up classification with the feature that cluster obtains for basis Reason, take into account participle contextual information in special context when document classification, so that every class document is managed at semanteme On Xie Du and semantics recognition degree the highest；Additionally, hierarchical clustering tree is entered by the embodiment of the present invention based on default end condition The cutting of row differing heights, it is to avoid the problem that in every class document, document number difference is big, it is possible to dynamically according to subclass document The number of documents comprised is classified dynamically so that the classification of document is more reasonable.

In another embodiment, with reference to Fig. 6, it is shown that the structure of the sorter embodiment of another kind document of the present invention Block diagram, specifically can also include such as lower module:

Word-dividing mode 50, is used for treating classifying documents and makees word segmentation processing, obtains the participle that each document to be sorted is comprised Set；

First modular converter 51 is identical with the first modular converter 51 of embodiment illustrated in fig. 5, does not repeats them here；

Cluster module 52, for vector clusters, using similar as one less than the vector presetting difference value for cluster result Participle set, wherein, similar participle set includes the multiple vectors under similar linguistic context, each one spy of similar participle set expression Levy；

Replacement module 56, for replacing with different respectively by the participle of subordinate difference name entity class in document to be sorted Entity sets, wherein, each entity sets represents a feature；

Second modular converter 53, identical with the second modular converter 53 of embodiment illustrated in fig. 5, do not repeat them here；

Wherein, characteristic frequency is against the characteristic frequency that document matrix is m*n rank against document matrix, and wherein, m is document to be sorted Quantity, the quantity that n is characterized；Further, characteristic frequency against this document to be sorted of each element representation in document matrix at this Weighted value under feature.

3rd modular converter 54 includes following submodule:

First calculating sub module 54a, for calculate characteristic frequency against the document to be sorted of any two in document matrix to Cosine similarity between amount；

Polymerization submodule 54b, for treating described characteristic frequency point against maximum two of cosine similarity in document matrix The vector combination of class document generates level clustering tree；

Cutting module 55 includes following submodule:

Cutting submodule 55a, for described hierarchical clustering tree, proceeds by two from root node and cuts and cut, obtain two Subtree；

Second calculating sub module 55b, for the height calculating each subtree respectively and the number of documents comprised；

First judges submodule 55c, for for each subtree, it is judged that whether the height of this subtree meets first is preset eventually Only condition；

Wherein, described first judges submodule 55c, for for each subtree, it is judged that whether the height of this subtree is less than Or equal to preset termination height, if YES, then meet, otherwise for not meeting；

Second judges submodule 55d, for for each subtree, it is judged that whether the number of documents that this subtree is comprised meets Second preset termination condition；

Wherein, described second judges submodule 55d, for for each subtree, it is judged that the number of files that this subtree is comprised Whether amount, less than or equal to preset termination quantity, if YES, then meets, otherwise for not meeting.

Stop submodule 55e, for for each subtree, meet described first preset termination at the height judging this subtree When the number of documents that condition or this subtree are comprised meets described second preset termination condition, stop this subtree is continued executing with Described two cut and cut；

Described cutting submodule 55a, for for each subtree, does not meets described first pre-at the height judging this subtree If the number of documents that end condition and this subtree are comprised does not meets described second preset termination condition yet, continue this son Set and start to perform described two from root node and cut and cut；

Wherein, when any one subtree is all stopped performing described two points of cutting steps, the sum of the subtree generated Amount is the quantity of described classifying documents.

Determining module 57, the total number for participle based on document to be sorted determines the length of vector.

Wherein, the quantity of classifying documents for any one subtree is all stopped two cutting cut time, generated subtree total Quantity, wherein, the quantity of all nodes that the number of documents that every class classifying documents is comprised is comprised by each subtree.

For device embodiment, due to itself and embodiment of the method basic simlarity, so describe is fairly simple, relevant Part sees the part of embodiment of the method and illustrates.

Each embodiment in this specification all uses the mode gone forward one by one to describe, what each embodiment stressed is with The difference of other embodiments, between each embodiment, identical similar part sees mutually.

Those skilled in the art are it should be appreciated that the embodiment of the embodiment of the present invention can be provided as method, device or calculate Machine program product.Therefore, the embodiment of the present invention can use complete hardware embodiment, complete software implementation or combine software and The form of the embodiment of hardware aspect.And, the embodiment of the present invention can use one or more wherein include computer can With in the computer-usable storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.) of program code The form of the computer program implemented.

The embodiment of the present invention is with reference to method, terminal unit (system) and computer program according to embodiments of the present invention The flow chart of product and/or block diagram describe.It should be understood that can be by computer program instructions flowchart and/or block diagram In each flow process and/or the flow process in square frame and flow chart and/or block diagram and/or the combination of square frame.These can be provided Computer program instructions sets to general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing terminals Standby processor is to produce a machine so that held by the processor of computer or other programmable data processing terminal equipment The instruction of row produces for realizing in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame The device of the function specified.

These computer program instructions may be alternatively stored in and can guide computer or other programmable data processing terminal equipment In the computer-readable memory worked in a specific way so that the instruction being stored in this computer-readable memory produces bag Including the manufacture of command device, this command device realizes in one flow process of flow chart or multiple flow process and/or one side of block diagram The function specified in frame or multiple square frame.

These computer program instructions also can be loaded on computer or other programmable data processing terminal equipment so that On computer or other programmable terminal equipment, execution sequence of operations step is to produce computer implemented process, thus The instruction performed on computer or other programmable terminal equipment provides for realizing in one flow process of flow chart or multiple flow process And/or the step of the function specified in one square frame of block diagram or multiple square frame.

Although having been described for the preferred embodiment of the embodiment of the present invention, but those skilled in the art once knowing base This creativeness concept, then can make other change and amendment to these embodiments.So, claims are intended to be construed to The all changes including preferred embodiment and falling into range of embodiment of the invention and amendment.

Finally, in addition it is also necessary to explanation, in this article, the relational terms of such as first and second or the like be used merely to by One entity or operation separate with another entity or operating space, and not necessarily require or imply these entities or operation Between exist any this reality relation or order.And, term " includes ", " comprising " or its any other variant meaning Containing comprising of nonexcludability, so that include that the process of a series of key element, method, article or terminal unit not only wrap Include those key elements, but also include other key elements being not expressly set out, or also include for this process, method, article Or the key element that terminal unit is intrinsic.In the case of there is no more restriction, by wanting that statement " including ... " limits Element, it is not excluded that there is also other identical element in including the process of described key element, method, article or terminal unit.

Sorting technique and the sorter of a kind of document to a kind of document provided by the present invention above, has been carried out in detail Introducing, principle and the embodiment of the present invention are set forth by specific case used herein, the explanation of above example It is only intended to help to understand method and the core concept thereof of the present invention；Simultaneously for one of ordinary skill in the art, according to this The thought of invention, the most all will change, and in sum, this specification content should not It is interpreted as limitation of the present invention.

Claims

1. the sorting technique of a document, it is characterised in that including:

By described vector clusters generates similar participle set, wherein, each similar participle set includes representing same characteristic features Multiple vectors；

Described document to be sorted is converted to characteristic frequency against document matrix by the set according to described feature；

By the described characteristic frequency of calculating against the similarity between the vector of the document to be sorted of any two in document matrix, by described Characteristic frequency is converted to hierarchical clustering tree against document matrix；

Based on default end condition, described hierarchical clustering tree is dynamically cut at differing heights, obtain classifying documents.

Method the most according to claim 1, it is characterised in that incited somebody to action by training deep neural network language model described Before each participle in document to be sorted is converted to the step of vector, described method also includes:

Described document to be sorted is made word segmentation processing, obtains the participle set that each document to be sorted is comprised.

Method the most according to claim 1, it is characterised in that described characteristic frequency is against the feature that document matrix is m*n rank Frequency is against document matrix, and wherein, described m is the quantity of described document to be sorted, and described n is the quantity of described feature.

Method the most according to claim 3, it is characterised in that described characteristic frequency is against each list of elements in document matrix Show this document to be sorted weighted value under this feature.

Method the most according to claim 1, it is characterised in that based on described document to be sorted point of the length of described vector Total number of word determines.

Method the most according to claim 1, it is characterised in that described by described vector clusters is generated similar participle collection The step closed includes:

To described vector clusters, cluster result is constituted a similar participle set, wherein, institute less than the vector presetting difference value State multiple participles vector that similar participle set includes under similar linguistic context.

Method the most according to claim 1, it is characterised in that in the described set according to described feature by described to be sorted Document is converted to characteristic frequency against before the step of document matrix, and described method also includes:

The participle of subordinate difference name entity class in described document to be sorted is replaced with different entity setses respectively, wherein, Each entity sets represents a feature.

Method the most according to claim 1, it is characterised in that described by calculating described characteristic frequency against in document matrix Similarity between the vector of any two document to be sorted, is converted to hierarchical clustering tree by described characteristic frequency against document matrix Step includes:

Calculate described characteristic frequency against the cosine similarity between the vector of the document to be sorted of any two in document matrix；

Described characteristic frequency is raw against the vector combination of two documents to be sorted of the maximum of cosine similarity described in document matrix Become described hierarchical clustering tree.

Method the most according to claim 1, it is characterised in that described based on default end condition to described hierarchical clustering Tree is dynamically cut at differing heights, and the step obtaining classifying documents includes:

For described hierarchical clustering tree, proceed by two from root node and cut and cut, obtain two subtrees；

The height calculating each subtree respectively and the number of documents comprised；

For each subtree, it is judged that whether the height of this subtree meets the first preset termination condition；

For each subtree, it is judged that whether the number of documents that this subtree is comprised meets the second preset termination condition；

For each subtree, if judging, the height of this subtree meets described first preset termination condition or judges that this subtree is wrapped The number of documents contained meets described second preset termination condition, then stop continuing executing with this subtree described two and cut and cut；

For each subtree, if judging, the height of this subtree does not meets described first preset termination condition, and judges this subtree The number of documents comprised does not meets described second preset termination condition yet, then continue to start to perform institute from root node to this subtree State two points of cutting steps；

When all stopping performing described two points of cutting steps to any one subtree, the total quantity of the subtree generated is described point The quantity of class document.

Method the most according to claim 9, it is characterised in that described for each subtree, it is judged that the height of this subtree is The no step meeting the first preset termination condition includes:

For each subtree, it is judged that whether the height of this subtree, less than or equal to preset termination height, if YES, then meets, Otherwise for not meeting；

Described for each subtree, it is judged that whether the number of documents that this subtree is comprised meets the step of the second preset termination condition Including:

For each subtree, it is judged that whether the number of documents that this subtree is comprised is less than or equal to preset termination quantity, if It is then to meet, otherwise for not meeting.

11. methods according to claim 9, it is characterised in that the number of documents that every class classifying documents is comprised is each The quantity of all nodes that subtree is comprised.

The sorter of 12. 1 kinds of documents, it is characterised in that including:

First modular converter, for changing each participle in document to be sorted by training deep neural network language model For vector；

Cluster module, for by described vector clusters is generated similar participle set, wherein, each similar participle set includes Represent multiple vectors of same characteristic features；

Second modular converter, for being converted to characteristic frequency against document square according to the set of described feature by described document to be sorted Battle array；

3rd modular converter, for by calculating the described characteristic frequency vector against the document to be sorted of any two in document matrix Between similarity, described characteristic frequency is converted to hierarchical clustering tree against document matrix；

Cutting module, for described hierarchical clustering tree dynamically being cut at differing heights based on default end condition, To classifying documents.

13. devices according to claim 12, it is characterised in that described device also includes:

Word-dividing mode, for described document to be sorted is made word segmentation processing, obtains the participle collection that each document to be sorted is comprised Close.

14. devices according to claim 12, it is characterised in that described characteristic frequency is against the spy that document matrix is m*n rank Levying frequency against document matrix, wherein, described m is the quantity of described document to be sorted, and described n is the quantity of described feature.

15. devices according to claim 14, it is characterised in that described characteristic frequency is against each element in document matrix Represent this document to be sorted weighted value under this feature.

16. devices according to claim 12, it is characterised in that described device also includes:

Determining module, the total number for participle based on described document to be sorted determines the length of described vector.

17. devices according to claim 12, it is characterised in that described cluster module, are used for described vector clusters, will Cluster result constitutes a similar participle set less than the vector presetting difference value, and wherein, described similar participle set includes phase Like the multiple vectors under linguistic context.

18. devices according to claim 12, it is characterised in that described device also includes:

Replacement module, for replacing with different realities respectively by the participle of subordinate difference name entity class in described document to be sorted Body set, wherein, each entity sets represents a feature.

19. devices according to claim 12, it is characterised in that described 3rd modular converter includes:

First calculating sub module, for calculating described characteristic frequency against between the vector of the document to be sorted of any two in document matrix Cosine similarity；

Polymerization submodule, for by described characteristic frequency against maximum two literary compositions to be sorted of cosine similarity described in document matrix The vector combination of shelves generates described hierarchical clustering tree.

20. devices according to claim 12, it is characterised in that described cutting module includes:

Cutting submodule, for for described hierarchical clustering tree, proceeds by two from root node and cuts and cut, obtain two subtrees；

Second calculating sub module, for the height calculating each subtree respectively and the number of documents comprised；

First judges submodule, for for each subtree, it is judged that whether the height of this subtree meets the first preset termination condition；

Second judges submodule, for for each subtree, it is judged that it is pre-whether the number of documents that this subtree is comprised meets second If end condition；

Stop submodule, for for each subtree, if judge the height of this subtree meet described first preset termination condition or Person judges when the number of documents that this subtree is comprised meets described second preset termination condition, then stop continuing executing with this subtree Described two cut and cut；

Described cutting submodule, for for each subtree, if judging, the height of this subtree does not meets described first preset termination Condition, and judge when the number of documents that this subtree is comprised does not meets described second preset termination condition yet, continue this son Set and start to perform described two from root node and cut and cut；

21. devices according to claim 20, it is characterised in that described first judges submodule, for for every height Tree, it is judged that whether the height of this subtree, less than or equal to preset termination height, if YES, then meets, otherwise for not meeting；

Described second judges submodule, for for each subtree, it is judged that the number of documents that this subtree is comprised whether less than or Equal to preset termination quantity, if YES, then meet, otherwise for not meeting.

22. devices according to claim 20, it is characterised in that the number of documents that every class classifying documents is comprised is each The quantity of all nodes that subtree is comprised.