CN106126734A - The sorting technique of document and device - Google Patents
The sorting technique of document and device Download PDFInfo
- Publication number
- CN106126734A CN106126734A CN201610519971.6A CN201610519971A CN106126734A CN 106126734 A CN106126734 A CN 106126734A CN 201610519971 A CN201610519971 A CN 201610519971A CN 106126734 A CN106126734 A CN 106126734A
- Authority
- CN
- China
- Prior art keywords
- document
- subtree
- sorted
- vector
- documents
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Embodiments providing sorting technique and the device of a kind of document, the method includes: by training deep neural network language model, each participle in document to be sorted is converted to vector;By vector clusters is generated similar participle set;Document to be sorted is converted to characteristic frequency against document matrix by the set according to feature;By the similarity between the vector of calculating any two document to be sorted, characteristic frequency is converted to hierarchical clustering tree against document matrix;Based on default end condition, hierarchical clustering tree is dynamically cut at differing heights, obtain classifying documents.The present invention take into account participle contextual information in special context when document classification, so that every class document is the highest on semantic understanding degree and semantics recognition degree;Further, based on default end condition, hierarchical clustering tree is carried out the cutting of differing heights, it is to avoid the problem that in every class document, document number difference is big so that the classification of document is more reasonable.
Description
Technical field
The present invention relates to field of computer technology, particularly relate to sorting technique and the classification of a kind of document of a kind of document
Device.
Background technology
In Internet, the explosive growth of information is the management of information and use brings inconvenience.It is hidden in disclose
Having information or the structure of potential value after web data, web data digging technology achieves development faster with wide in recent years
General application.Clustering documents is one of most important instrument in web data excavation applications.Wherein, document of the prior art gathers
Class method mainly includes K-means, hierarchical clustering method etc..
But, document clustering method of the prior art yet suffers from following problems: when to document classification not
Consider word contextual information under special context in document, therefore so as to get classifying documents at semantic understanding degree and language
Justice resolution on the most relatively low, be not easy to understand;Additionally, when clustering tree (dendrogram) is cut, existing skill
Document clustering method in art can only cut at identical height, and needs the most artificially to specify the classification number of document,
So that the document number comprised between different classes of document differs greatly and extremely uneven, it is impossible to realize document is moved
State Rational Classification.
As can be seen here, document clustering method of the prior art generally also exists semantic understanding degree relatively when to document classification
Low, semantics recognition degree is relatively low and the irrational problem of document classification.
Summary of the invention
Embodiment of the present invention technical problem to be solved is to provide sorting technique and the device of a kind of document, existing to solve
Have the document clustering method in technology generally to also exist when to document classification semantic understanding degree is relatively low, semantics recognition degree relatively low with
And the irrational problem of document classification.
In order to solve the problems referred to above, according to an aspect of the present invention, the invention discloses the sorting technique of a kind of document,
Including:
By training deep neural network language model, each participle in document to be sorted is converted to vector;
By vector clusters is generated similar participle set, wherein, each one feature of similar participle set expression;
Document to be sorted is converted to characteristic frequency against document matrix by the set according to feature;
By the similarity between the vector of calculating any two document to be sorted, characteristic frequency is converted to against document matrix
Hierarchical clustering tree;
Based on default end condition, hierarchical clustering tree is dynamically cut at differing heights, obtain classifying documents.
According to a further aspect in the invention, the invention also discloses the sorter of a kind of document, including:
First modular converter, for by training deep neural network language model by each participle in document to be sorted
Be converted to vector;
Cluster module, for by generating similar participle set, wherein, each similar participle set expression to vector clusters
One feature;
Second modular converter, for being converted to characteristic frequency against document matrix according to the set of feature by document to be sorted;
3rd modular converter, the similarity between the vector by calculating any two document to be sorted, by feature frequency
Rate is converted to hierarchical clustering tree against document matrix;
Cutting module, for hierarchical clustering tree dynamically being cut at differing heights based on default end condition,
To classifying documents.
Compared with prior art, the embodiment of the present invention includes advantages below:
The characterization of the document participle of the embodiment of the present invention by means of deep neural network model, and by similar participle to
Amount cluster, carries out follow-up classification process with the feature that cluster obtains for basis, take into account participle when document classification spy
Contextual information in attribute border, so that the semantic understanding degree of every class document and semantics recognition degree are the highest;Additionally, this
Bright embodiment carries out the cutting of differing heights based on default end condition to hierarchical clustering tree, it is to avoid every class document Chinese
The problem that shelves number difference is big, it is possible to the number of documents dynamically comprised according to subclass document is classified dynamically so that
The classification of document is more reasonable.
Further, the embodiment of the present invention utilizes depth model to consider the information of context words order so that feature
Statement effect be improved significantly;Utilizing the feature combination of deep neural network language model and name entity, cluster obtains
The characteristic set that similar phrase under special context is constituted, is different from prior art and does not accounts for described by document subclass
The problem of the relation of object;By similar participle and Chinese name entity are all carried out characterization so that the literary composition in every class document
Shelves can be in linguistic context and the most close, good classification effect;Additionally, the set of feature based generates characteristic frequency against document square
Battle array so that in matrix, the data of each column are a feature, the set of the most similar phrase so that the classification results of document and participle
Actual context is associated;Further, characteristic frequency is that each document is under some feature against each element value in document matrix
Weighted value so that last every class document all comprises the document under similar linguistic context so that the classification of document is more reasonable, it is simple to
People understands and understands.
Accompanying drawing explanation
Fig. 1 is the flow chart of steps of the sorting technique embodiment of a kind of document of the present invention;
Fig. 2 is the flow chart of steps of the sorting technique embodiment of the another kind of document of the present invention;
Fig. 3 is the flow chart of steps of the dynamic partitioning methods embodiment of a kind of hierarchical clustering tree of the present invention;
Fig. 4 is the flow chart of steps of the sorting technique embodiment of another document of the present invention;
Fig. 5 is the structured flowchart of the sorter embodiment of a kind of document of the present invention;
Fig. 6 is the structured flowchart of the sorter embodiment of the another kind of document of the present invention.
Detailed description of the invention
Understandable for enabling the above-mentioned purpose of the present invention, feature and advantage to become apparent from, real with concrete below in conjunction with the accompanying drawings
The present invention is further detailed explanation to execute mode.
One of core idea of the embodiment of the present invention is, the characterization of the document participle of the embodiment of the present invention is by means of deeply
Degree neural network model, and by the vector clusters of similar participle, carry out at follow-up classification with the feature that cluster obtains for basis
Reason, take into account participle contextual information in special context when document classification, so that every class document is managed at semanteme
On Xie Du and semantics recognition degree the highest;Additionally, hierarchical clustering tree is entered by the embodiment of the present invention based on default end condition
The cutting of row differing heights, it is to avoid the problem that in every class document, document number difference is big, it is possible to dynamically according to subclass document
The number of documents comprised is classified dynamically so that the classification of document is more reasonable.
With reference to Fig. 1, it is shown that the flow chart of steps of the sorting technique embodiment of a kind of document of the present invention, specifically can wrap
Include following steps:
Step 101, by training deep neural network language model each participle in document to be sorted is converted to
Amount;
Wherein it is possible to based on language material training deep neural network language model (such as word2vec), classification will be needed
Multiple documents (such as doc1, doc2, doc3 ... etc.) in each participle be described as one-dimensional term vector, obtain term vector constitute
Dictionary.
Step 103, by vector clusters generates similar participle set, wherein, each similar participle set includes representing
Multiple vectors of same characteristic features;
The set that some similar phrases are constituted is obtained wherein it is possible to clustered by all term vectors in dictionary.Wherein, by
Phrase included in each set is similar, therefore, in order to make it easy to understand, can be a spy by each set expression
Levy, thus obtained the set being made up of multiple features.
Step 105, is converted to characteristic frequency against document matrix according to the set of feature by document to be sorted;
Wherein, through above-mentioned steps 101 and step 103, multiple documents are converted to the set of feature, the most just may be used
So that the collection of feature based is incompatible, multiple documents are converted to characteristic frequency against document matrix TFIDF-feature.Wherein, TFIDF-
Feature is similar to traditional document and is that the TFIDF-feature of the embodiment of the present invention is against frequency matrix TFIDF, difference
The incompatible formation of collection of feature based, therefore, in matrix, often row or each column represent a feature, the i.e. collection of some similar phrases
Close, rather than in prior art TFIDF matrix, often row or each column represent single word.Wherein, TFIDF is: TF*IDF, TF represent word
Frequently (Term Frequency), TF represents the frequency that entry occurs in document d;IDF represents reverse document-frequency (Inverse
Document Frequency)。
Step 107, by calculating described characteristic frequency against between the vector of the document to be sorted of any two in document matrix
Similarity, is converted to hierarchical clustering tree by characteristic frequency against document matrix;
Wherein it is possible to calculate characteristic frequency against in document matrix TFIDF-feature between the vector of any two document doc
Similarity, based on result of calculation by characteristic frequency against the vector combination of two document doc corresponding in document matrix, will with this
Characteristic frequency is changed against document matrix.
Step 109, dynamically cuts at differing heights hierarchical clustering tree based on default end condition, is classified
Document.
Wherein it is possible to the hierarchical clustering tree generated is moved at differing heights based on the default cutting condition that terminates
State is cut, thus obtains multiple subtree, i.e. multiclass document, it is achieved that the reasonable distribution of multiple documents.
By means of the technical scheme of the above embodiment of the present invention, the characterization of the document participle of the embodiment of the present invention by means of
Deep neural network model, and by the vector clusters of similar participle, the feature obtained with cluster carries out follow-up classification for basis
Process, take into account participle contextual information in special context when document classification, so that every class document is at semanteme
In the level of understanding and semantics recognition degree the highest;Additionally, the embodiment of the present invention is come hierarchical clustering tree based on default end condition
Carry out the cutting of differing heights, it is to avoid the problem that in every class document, document number difference is big, it is possible to dynamically according to subclass literary composition
The number of documents that shelves are comprised is classified dynamically so that the classification of document is more reasonable.
With reference to Fig. 2, it is shown that the flow chart of steps of the sorting technique embodiment of the another kind of document of the present invention, the most permissible
Comprise the steps:
Step 101a, treats classifying documents and makees word segmentation processing, obtains the participle set that each document to be sorted is comprised;
Wherein it is possible to multiple document doc1, doc2, doc3 to be made respectively word segmentation processing, the most each document corresponding
Individual participle set, obtains multiple participle set.
Step 101b, by training deep neural network language model each participle in document to be sorted is converted to
Amount;
Wherein it is possible to based on language material (document doc1, doc2, doc3 the most to be sorted) training Word2vec by doc1,
Each participle word1, word2, word3 of doc2, doc3 ... wordm is respectively converted into the one-dimensional arithmetic number vector of a length of d
wi(i=1,2 ... m), wherein, m is total number of participle in language material;
Wherein, length d of vector can determine, specifically, compared to document based on total number of participle in language material
Total number (total number of words may be up to ten thousand) of the participle comprised, the participle vector that deep neural network model training obtains, can
Higher dimensional space (dimensions up to ten thousand) to be converted to a statement close low latitudes vector (such as 200 dimensions etc.).Therefore, dimension
Determine relevant with total participle number, the length of vector can be set to hundreds of dimension in actual applications.
Step 103a, to vector clusters, using cluster result less than presetting the vector of difference value as a similar participle collection
Closing, wherein, set includes the multiple vectors under similar linguistic context, each one feature of similar participle set expression;
Wherein it is possible to multiple vector wi clusters are obtained cluster result by clustering method, and by cluster result and
Preset difference value (such as 1.2) to make comparisons, when this cluster result is less than 1.2, then using the vector in this cluster result as one
Individual similar participle set.Thus obtain the feature of multiple similar participle set.
Step 103b, replaces with different entity sets respectively by the participle of subordinate difference name entity class in document to be sorted
Closing, wherein, each entity sets represents a feature;
Wherein it is possible to dividing Chinese name entity such as time, name, tissue and the geography information class in document to be sorted
Word carries out characterization respectively, obtains the feature of multiple different name entity class.
Step 105, is converted to characteristic frequency against document matrix according to the set of feature by document to be sorted;
Wherein it is possible to feature and multiple difference of the multiple similar participle set of deep neural network language model are named
The feature of entity class, is merged into and is characterized the characteristic set that engineering obtains.
So after obtaining characteristic set, it is possible to according to this feature collection is incompatible, document to be sorted is converted to characteristic frequency
Inverse document matrix, wherein, this feature frequency is against the characteristic frequency that document matrix is m*n rank against document matrix, and wherein, m is for treating point
The quantity of class document, the quantity that n is characterized, and, element (x, y) table that characteristic frequency arranges against each x row y in document matrix
Show this document x to be sorted weighted value in this feature y.
Certainly, in various embodiments, this feature frequency is against the quantity that the m in document matrix can also be feature, and n is then
Quantity for document to be sorted.
Step 107a, calculates characteristic frequency similar against the cosine between the vector of the document to be sorted of any two in document matrix
Degree;
Step 107b, carrys out generation layer time cluster by the vector combination of two documents to be sorted that cosine similarity is maximum
Tree;
Wherein it is possible to calculate characteristic frequency against similarity between the vector of any two document doc in document matrix, by phase
Seemingly the vector combination of two documents that degree is the highest, generates new vector;Calculate the most again new vectorial and remaining document to
Similarity between any two groups of vectors in amount, still by two groups of the highest for similarity vector combination, by that analogy, thus by feature
Frequency is converted to hierarchical clustering tree against document matrix.
Step 109, dynamically cuts at differing heights hierarchical clustering tree based on default end condition, is classified
Document.
By means of the technical scheme of the above embodiment of the present invention, the present invention utilize depth model to consider context words is suitable
The information of sequence so that feature statement effect be improved significantly;Utilize deep neural network language model and name entity
Feature combines, and cluster obtains the characteristic set that the similar phrase under special context is constituted, and is different from prior art and does not accounts for
To the problem of the relation of object described by document subclass;By similar participle and Chinese name entity are all carried out characterization,
Make the document in every class document can be in linguistic context and the most close, good classification effect;Additionally, the set next life of feature based
Become characteristic frequency against document matrix so that in matrix, the data of each column are a feature, the set of the most similar phrase so that document
Classification results be associated with the actual context of participle;Further, characteristic frequency is each against each element value in document matrix
Document weighted value under some feature so that last every class document all comprises the document under similar linguistic context so that document
Classification more reasonable, it is simple to people understand and understand.
In another embodiment, level is gathered by the step 109 in above-described embodiment based on default end condition
Class is dynamically cut by tree at differing heights, obtains implementing of classifying documents, with reference to Fig. 3, it is shown that the one of the present invention
The flow chart of steps of the dynamic partitioning methods embodiment of hierarchical clustering tree, specifically may include steps of:
Step 301, to described hierarchical clustering tree, proceeds by two from root node and cuts and cut, obtain two subtrees;
Wherein, due to when generating level clustering tree, use characteristic frequency against document two-by-two in document matrix
The mode of vector combination is formed, and therefore, hierarchical clustering tree is binary tree, then by this hierarchical clustering tree from father node
Proceed by two to cut and cut, it is possible to obtain two subtrees.
Step 303, the height calculating each subtree respectively and the number of documents comprised;
Wherein, the height of every subtree is the dissimilar degree between every comprised document of subtree;And every subtree is comprised
Document, then can determine by the way of calculating the quantity of the node that every subtree is comprised.
Step 305, for each subtree, it is judged that whether the height of this subtree meets the first preset termination condition;
Wherein, for each subtree, it can be determined that whether the height of this subtree is less than or equal to preset termination height, if
It is yes, then meets, otherwise for not meeting.
Step 307, for each subtree, it is judged that whether the number of documents that this subtree is comprised meets the second preset termination bar
Part;
Wherein, for each subtree, it is judged that whether the number of documents that this subtree is comprised is less than or equal to preset termination number
Amount, if YES, then meets, otherwise for not meeting.
Step 309, for each subtree, meets described first preset termination condition at the height judging this subtree or is somebody's turn to do
When the number of documents that subtree is comprised meets described second preset termination condition, stop that this subtree is continued executing with described two and cut
Cut step 301;
Step 311, for each subtree, the height judging this subtree do not meet described first preset termination condition and
When the number of documents that this subtree is comprised does not meets described second preset termination condition yet, then continue this subtree is followed from root node
Ring performs described step 301~step 311.
Wherein, when all stopping performing described two points of cutting steps to any one subtree, now, the total quantity of subtree is just
It is the quantity (the division categorical measure of document the most to be sorted) of classifying documents, wherein, the number of files that every class classifying documents is comprised
Amount is and finally no longer carries out the two of any subtree and cut when cutting, the quantity of all nodes that each subtree is comprised.
By means of the technical scheme of the above embodiment of the present invention, the present invention is tactful at differing heights by dynamically cutting tree
Cutting obtains subtree, based on to binary tree traversal traversal of binary tree and the threshold value of predetermined end condition, and can be to profound subtree
Effectively split, document classification effects equalizer.
In order to be better understood from the technique scheme of the present invention, below in conjunction with a specific embodiment come the present invention upper
State technical scheme to be described in detail.
With reference to such as Fig. 4, it is shown that the flow chart of steps of the sorting technique embodiment of another document of the present invention, specifically may be used
To comprise the steps:
Step 401, n document D i{i=1 to be clustered of input, 2 ... n} (such as, doc1, doc2, doc3 ...
Docn), and carry out participle pretreatment, obtain corpus;
Step 403a, by the participle of n document to be clustered is carried out Feature Engineering, obtains word2vec degree of depth nerve net
The feature bunch of network;
Specifically, first, train word2vec model, each participle in corpus is expressed as one a length of
The one-dimensional arithmetic number vector w of di(i=1,2 ... m) (m is the number of participle in corpus);According to these represent participle to
Amount wiCluster generates the set F of T relevant phrasei{wj... (i=1,2 ... T), the set F of each generationiRepresent a spy
Levy, set contains several participles under similar linguistic context word1, word2 ... } and vector.
Step 403b, by n document to be clustered is carried out Feature Engineering, obtains the feature bunch of NER Entity recognition;
Specifically, name Entity recognition NER is utilized to replace with feature by unified for the name phrase in n document to be clustered
PERSON, as comprised name1, name2 ... };Place word is replaced with LOCATION, as comprised loc1, loc2 ... } etc..
Finally, the feature bunch of the feature bunch of degree of depth network model and NER name entity is merged, obtains Fi(i=1,
2 ..., T+2) and as the characteristic set analyzing text.
Step 405, according to characteristic set, generates document term vector matrix;
Specifically, according to the characteristic set F generatedi(i=1,2 ..., T+2), by the set Di{i=1 of document, 2 ...
N} is converted into the TfIdf-feature matrix of inverse characteristic frequency.The difference of this matrix and traditional TfIdf matrix is: traditional
Every string of TfIdf matrix only represents a word, and in the TfIdf matrix that the present embodiment uses, each column represents a feature set
Close, be the set comprising multiple words close under special context, the most single word.In matrix, each real number vector representation is each
Document distribution under a certain feature, the numerical value of each column represents the document weighted value under characteristic set.
For example, participle " film " occurred twice in document D 1, and " film " occurred in 10 documents, that
, the participle " film " of document D 1 weighted value in TfIdf-feature matrix is exactly 2*1/10.
Step 407, the Matrix Cluster to characteristic set, generate level clustering tree;
Wherein, the method generating level clustering tree can be described as: in an initial condition, is drawn N number of object to be clustered
It is divided into N number of classification, each iteration: obtaining the distance between class by calculating the cosine similarity between subclass, combined distance is the most close
Two classifications.Iteration is until all N number of objects all merge into a classification.Wherein, the process of merging constitutes one
Hierarchical clustering tree.
Specifically, the vector that can calculate between any two document di with dj based on cosine (Cosine) similarity is similar
Degree, such as: Similarity (di, dj)=cosine (vi, vj), wherein, vi and vj is that di and dj document is at TfIdf-
The vector data of corresponding row in feature matrix;Then, according to similarity distance (i.e. 1-Similarity (di, dj), not phase
Like degree) generate hierarchical clustering hierarchical clustering tree.
Step 409, uses dynamic Cut Stratagem, cuts hierarchical clustering tree.
For any one tree Tk, two subtrees of its binary tree are denoted as T respectivelyk1And Tk2, it is carried out two and cuts and cut;Cut
The height h of every subtree is calculated respectively after cuttingk1And hk2, and it is any one to judge whether two subtrees meet in two end conditions
Individual:Or Nki≤Nmin(i=1,2);If meeting any one of two above condition, terminate
This sub-traversal of tree cuts, if not up to end condition, continuation is to this subtree TkiCutting, recurrence uses this strategy.Eventually
Only condition reaches the stopping of rear recurrence, and symbiosis becomes K*Individual subclass, the vector of the height of each subtree is (h1,h2,…,hK*), subset
The document number inside comprised is respectively { N1,N2,…,NK*}.Above parameter is satisfied by condition hk≤Or Nk≤Nmin;So
When end condition reaches, when i.e. any one subtree is the most no longer cut, the set Di{i=1 of document to be clustered, 2 ... n} quilt
It is cut into K*Individual subclass CkK=1,2 ... K*, each subclass CkContain N the most respectivelykIndividual document, the set of document be by
All nodes that the subtree of cutting comprises.
Additionally, in another embodiment, after completing the cutting to all subtrees, it is also possible to semantic to cutting based on Chinese
The subtree cut merges thus generative semantics is close and understand the classifying documents that identification degree is high.
In the prior art when hierarchical clustering tree is cut, it is all to predefine number K of subclass or identical
Height of tree degree height at cut.The shortcoming of this Cut Stratagem is the subclass number in K the subclass that cutting generates
Nk(k=1,2 ..K) very different.The dynamically cutting tree strategy of the present embodiment is then based on to binary tree traversal traversal of binary tree and in advance
The threshold value of the end condition determined (includes that the threshold value of subtree height reachesOr subclass number reaches Nmin) in difference
Highly place's cutting obtains subtree, promotes the cutting effect of subtree, makes document classification effect obvious;And the embodiment of the present invention can be
On the basis of dynamic cutting tree, the name entity according to description object is such as: name PERSON, tissue ORGANIZATION etc. are to newly
The subclass generated carries out merging semantically, increases the intelligibility of clustering documents.
It should be noted that for embodiment of the method, in order to be briefly described, therefore it is all expressed as a series of action group
Closing, but those skilled in the art should know, the embodiment of the present invention is not limited by described sequence of movement, because depending on
According to the embodiment of the present invention, some step can use other orders or carry out simultaneously.Secondly, those skilled in the art also should
Knowing, embodiment described in this description belongs to preferred embodiment, and the involved action not necessarily present invention implements
Necessary to example.
With reference to Fig. 5, it is shown that the structured flowchart of the sorter embodiment of a kind of document of the present invention, specifically can include as
Lower module:
First modular converter 51, for by training deep neural network language model by each point in document to be sorted
Word is converted to vector;
Cluster module 52, for by generating similar participle set, wherein, each similar participle set bag to vector clusters
Include the multiple vectors representing same characteristic features;
Second modular converter 53, for being converted to characteristic frequency against document square according to the set of feature by document to be sorted
Battle array;
3rd modular converter 54, for by calculating described characteristic frequency against the document to be sorted of any two in document matrix
Vector between similarity, characteristic frequency is converted to hierarchical clustering tree against document matrix;
Cutting module 55, for hierarchical clustering tree dynamically being cut at differing heights based on default end condition,
Obtain classifying documents.
By means of the technique scheme of the embodiment of the present invention, the characterization of the document participle of inventive embodiments is by means of deeply
Degree neural network model, and by the vector clusters of similar participle, carry out at follow-up classification with the feature that cluster obtains for basis
Reason, take into account participle contextual information in special context when document classification, so that every class document is managed at semanteme
On Xie Du and semantics recognition degree the highest;Additionally, hierarchical clustering tree is entered by the embodiment of the present invention based on default end condition
The cutting of row differing heights, it is to avoid the problem that in every class document, document number difference is big, it is possible to dynamically according to subclass document
The number of documents comprised is classified dynamically so that the classification of document is more reasonable.
In another embodiment, with reference to Fig. 6, it is shown that the structure of the sorter embodiment of another kind document of the present invention
Block diagram, specifically can also include such as lower module:
Word-dividing mode 50, is used for treating classifying documents and makees word segmentation processing, obtains the participle that each document to be sorted is comprised
Set;
First modular converter 51 is identical with the first modular converter 51 of embodiment illustrated in fig. 5, does not repeats them here;
Cluster module 52, for vector clusters, using similar as one less than the vector presetting difference value for cluster result
Participle set, wherein, similar participle set includes the multiple vectors under similar linguistic context, each one spy of similar participle set expression
Levy;
Replacement module 56, for replacing with different respectively by the participle of subordinate difference name entity class in document to be sorted
Entity sets, wherein, each entity sets represents a feature;
Second modular converter 53, identical with the second modular converter 53 of embodiment illustrated in fig. 5, do not repeat them here;
Wherein, characteristic frequency is against the characteristic frequency that document matrix is m*n rank against document matrix, and wherein, m is document to be sorted
Quantity, the quantity that n is characterized;Further, characteristic frequency against this document to be sorted of each element representation in document matrix at this
Weighted value under feature.
3rd modular converter 54 includes following submodule:
First calculating sub module 54a, for calculate characteristic frequency against the document to be sorted of any two in document matrix to
Cosine similarity between amount;
Polymerization submodule 54b, for treating described characteristic frequency point against maximum two of cosine similarity in document matrix
The vector combination of class document generates level clustering tree;
Cutting module 55 includes following submodule:
Cutting submodule 55a, for described hierarchical clustering tree, proceeds by two from root node and cuts and cut, obtain two
Subtree;
Second calculating sub module 55b, for the height calculating each subtree respectively and the number of documents comprised;
First judges submodule 55c, for for each subtree, it is judged that whether the height of this subtree meets first is preset eventually
Only condition;
Wherein, described first judges submodule 55c, for for each subtree, it is judged that whether the height of this subtree is less than
Or equal to preset termination height, if YES, then meet, otherwise for not meeting;
Second judges submodule 55d, for for each subtree, it is judged that whether the number of documents that this subtree is comprised meets
Second preset termination condition;
Wherein, described second judges submodule 55d, for for each subtree, it is judged that the number of files that this subtree is comprised
Whether amount, less than or equal to preset termination quantity, if YES, then meets, otherwise for not meeting.
Stop submodule 55e, for for each subtree, meet described first preset termination at the height judging this subtree
When the number of documents that condition or this subtree are comprised meets described second preset termination condition, stop this subtree is continued executing with
Described two cut and cut;
Described cutting submodule 55a, for for each subtree, does not meets described first pre-at the height judging this subtree
If the number of documents that end condition and this subtree are comprised does not meets described second preset termination condition yet, continue this son
Set and start to perform described two from root node and cut and cut;
Wherein, when any one subtree is all stopped performing described two points of cutting steps, the sum of the subtree generated
Amount is the quantity of described classifying documents.
Determining module 57, the total number for participle based on document to be sorted determines the length of vector.
Wherein, the quantity of classifying documents for any one subtree is all stopped two cutting cut time, generated subtree total
Quantity, wherein, the quantity of all nodes that the number of documents that every class classifying documents is comprised is comprised by each subtree.
By means of the technical scheme of the above embodiment of the present invention, the present invention utilize depth model to consider context words is suitable
The information of sequence so that feature statement effect be improved significantly;Utilize deep neural network language model and name entity
Feature combines, and cluster obtains the characteristic set that the similar phrase under special context is constituted, and is different from prior art and does not accounts for
To the problem of the relation of object described by document subclass;By similar participle and Chinese name entity are all carried out characterization,
Make the document in every class document can be in linguistic context and the most close, good classification effect;Additionally, the set next life of feature based
Become characteristic frequency against document matrix so that in matrix, the data of each column are a feature, the set of the most similar phrase so that document
Classification results be associated with the actual context of participle;Further, characteristic frequency is each against each element value in document matrix
Document weighted value under some feature so that last every class document all comprises the document under similar linguistic context so that document
Classification more reasonable, it is simple to people understand and understand.
For device embodiment, due to itself and embodiment of the method basic simlarity, so describe is fairly simple, relevant
Part sees the part of embodiment of the method and illustrates.
Each embodiment in this specification all uses the mode gone forward one by one to describe, what each embodiment stressed is with
The difference of other embodiments, between each embodiment, identical similar part sees mutually.
Those skilled in the art are it should be appreciated that the embodiment of the embodiment of the present invention can be provided as method, device or calculate
Machine program product.Therefore, the embodiment of the present invention can use complete hardware embodiment, complete software implementation or combine software and
The form of the embodiment of hardware aspect.And, the embodiment of the present invention can use one or more wherein include computer can
With in the computer-usable storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.) of program code
The form of the computer program implemented.
The embodiment of the present invention is with reference to method, terminal unit (system) and computer program according to embodiments of the present invention
The flow chart of product and/or block diagram describe.It should be understood that can be by computer program instructions flowchart and/or block diagram
In each flow process and/or the flow process in square frame and flow chart and/or block diagram and/or the combination of square frame.These can be provided
Computer program instructions sets to general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing terminals
Standby processor is to produce a machine so that held by the processor of computer or other programmable data processing terminal equipment
The instruction of row produces for realizing in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame
The device of the function specified.
These computer program instructions may be alternatively stored in and can guide computer or other programmable data processing terminal equipment
In the computer-readable memory worked in a specific way so that the instruction being stored in this computer-readable memory produces bag
Including the manufacture of command device, this command device realizes in one flow process of flow chart or multiple flow process and/or one side of block diagram
The function specified in frame or multiple square frame.
These computer program instructions also can be loaded on computer or other programmable data processing terminal equipment so that
On computer or other programmable terminal equipment, execution sequence of operations step is to produce computer implemented process, thus
The instruction performed on computer or other programmable terminal equipment provides for realizing in one flow process of flow chart or multiple flow process
And/or the step of the function specified in one square frame of block diagram or multiple square frame.
Although having been described for the preferred embodiment of the embodiment of the present invention, but those skilled in the art once knowing base
This creativeness concept, then can make other change and amendment to these embodiments.So, claims are intended to be construed to
The all changes including preferred embodiment and falling into range of embodiment of the invention and amendment.
Finally, in addition it is also necessary to explanation, in this article, the relational terms of such as first and second or the like be used merely to by
One entity or operation separate with another entity or operating space, and not necessarily require or imply these entities or operation
Between exist any this reality relation or order.And, term " includes ", " comprising " or its any other variant meaning
Containing comprising of nonexcludability, so that include that the process of a series of key element, method, article or terminal unit not only wrap
Include those key elements, but also include other key elements being not expressly set out, or also include for this process, method, article
Or the key element that terminal unit is intrinsic.In the case of there is no more restriction, by wanting that statement " including ... " limits
Element, it is not excluded that there is also other identical element in including the process of described key element, method, article or terminal unit.
Sorting technique and the sorter of a kind of document to a kind of document provided by the present invention above, has been carried out in detail
Introducing, principle and the embodiment of the present invention are set forth by specific case used herein, the explanation of above example
It is only intended to help to understand method and the core concept thereof of the present invention;Simultaneously for one of ordinary skill in the art, according to this
The thought of invention, the most all will change, and in sum, this specification content should not
It is interpreted as limitation of the present invention.
Claims (22)
1. the sorting technique of a document, it is characterised in that including:
By training deep neural network language model, each participle in document to be sorted is converted to vector;
By described vector clusters generates similar participle set, wherein, each similar participle set includes representing same characteristic features
Multiple vectors;
Described document to be sorted is converted to characteristic frequency against document matrix by the set according to described feature;
By the described characteristic frequency of calculating against the similarity between the vector of the document to be sorted of any two in document matrix, by described
Characteristic frequency is converted to hierarchical clustering tree against document matrix;
Based on default end condition, described hierarchical clustering tree is dynamically cut at differing heights, obtain classifying documents.
Method the most according to claim 1, it is characterised in that incited somebody to action by training deep neural network language model described
Before each participle in document to be sorted is converted to the step of vector, described method also includes:
Described document to be sorted is made word segmentation processing, obtains the participle set that each document to be sorted is comprised.
Method the most according to claim 1, it is characterised in that described characteristic frequency is against the feature that document matrix is m*n rank
Frequency is against document matrix, and wherein, described m is the quantity of described document to be sorted, and described n is the quantity of described feature.
Method the most according to claim 3, it is characterised in that described characteristic frequency is against each list of elements in document matrix
Show this document to be sorted weighted value under this feature.
Method the most according to claim 1, it is characterised in that based on described document to be sorted point of the length of described vector
Total number of word determines.
Method the most according to claim 1, it is characterised in that described by described vector clusters is generated similar participle collection
The step closed includes:
To described vector clusters, cluster result is constituted a similar participle set, wherein, institute less than the vector presetting difference value
State multiple participles vector that similar participle set includes under similar linguistic context.
Method the most according to claim 1, it is characterised in that in the described set according to described feature by described to be sorted
Document is converted to characteristic frequency against before the step of document matrix, and described method also includes:
The participle of subordinate difference name entity class in described document to be sorted is replaced with different entity setses respectively, wherein,
Each entity sets represents a feature.
Method the most according to claim 1, it is characterised in that described by calculating described characteristic frequency against in document matrix
Similarity between the vector of any two document to be sorted, is converted to hierarchical clustering tree by described characteristic frequency against document matrix
Step includes:
Calculate described characteristic frequency against the cosine similarity between the vector of the document to be sorted of any two in document matrix;
Described characteristic frequency is raw against the vector combination of two documents to be sorted of the maximum of cosine similarity described in document matrix
Become described hierarchical clustering tree.
Method the most according to claim 1, it is characterised in that described based on default end condition to described hierarchical clustering
Tree is dynamically cut at differing heights, and the step obtaining classifying documents includes:
For described hierarchical clustering tree, proceed by two from root node and cut and cut, obtain two subtrees;
The height calculating each subtree respectively and the number of documents comprised;
For each subtree, it is judged that whether the height of this subtree meets the first preset termination condition;
For each subtree, it is judged that whether the number of documents that this subtree is comprised meets the second preset termination condition;
For each subtree, if judging, the height of this subtree meets described first preset termination condition or judges that this subtree is wrapped
The number of documents contained meets described second preset termination condition, then stop continuing executing with this subtree described two and cut and cut;
For each subtree, if judging, the height of this subtree does not meets described first preset termination condition, and judges this subtree
The number of documents comprised does not meets described second preset termination condition yet, then continue to start to perform institute from root node to this subtree
State two points of cutting steps;
When all stopping performing described two points of cutting steps to any one subtree, the total quantity of the subtree generated is described point
The quantity of class document.
Method the most according to claim 9, it is characterised in that described for each subtree, it is judged that the height of this subtree is
The no step meeting the first preset termination condition includes:
For each subtree, it is judged that whether the height of this subtree, less than or equal to preset termination height, if YES, then meets,
Otherwise for not meeting;
Described for each subtree, it is judged that whether the number of documents that this subtree is comprised meets the step of the second preset termination condition
Including:
For each subtree, it is judged that whether the number of documents that this subtree is comprised is less than or equal to preset termination quantity, if
It is then to meet, otherwise for not meeting.
11. methods according to claim 9, it is characterised in that the number of documents that every class classifying documents is comprised is each
The quantity of all nodes that subtree is comprised.
The sorter of 12. 1 kinds of documents, it is characterised in that including:
First modular converter, for changing each participle in document to be sorted by training deep neural network language model
For vector;
Cluster module, for by described vector clusters is generated similar participle set, wherein, each similar participle set includes
Represent multiple vectors of same characteristic features;
Second modular converter, for being converted to characteristic frequency against document square according to the set of described feature by described document to be sorted
Battle array;
3rd modular converter, for by calculating the described characteristic frequency vector against the document to be sorted of any two in document matrix
Between similarity, described characteristic frequency is converted to hierarchical clustering tree against document matrix;
Cutting module, for described hierarchical clustering tree dynamically being cut at differing heights based on default end condition,
To classifying documents.
13. devices according to claim 12, it is characterised in that described device also includes:
Word-dividing mode, for described document to be sorted is made word segmentation processing, obtains the participle collection that each document to be sorted is comprised
Close.
14. devices according to claim 12, it is characterised in that described characteristic frequency is against the spy that document matrix is m*n rank
Levying frequency against document matrix, wherein, described m is the quantity of described document to be sorted, and described n is the quantity of described feature.
15. devices according to claim 14, it is characterised in that described characteristic frequency is against each element in document matrix
Represent this document to be sorted weighted value under this feature.
16. devices according to claim 12, it is characterised in that described device also includes:
Determining module, the total number for participle based on described document to be sorted determines the length of described vector.
17. devices according to claim 12, it is characterised in that described cluster module, are used for described vector clusters, will
Cluster result constitutes a similar participle set less than the vector presetting difference value, and wherein, described similar participle set includes phase
Like the multiple vectors under linguistic context.
18. devices according to claim 12, it is characterised in that described device also includes:
Replacement module, for replacing with different realities respectively by the participle of subordinate difference name entity class in described document to be sorted
Body set, wherein, each entity sets represents a feature.
19. devices according to claim 12, it is characterised in that described 3rd modular converter includes:
First calculating sub module, for calculating described characteristic frequency against between the vector of the document to be sorted of any two in document matrix
Cosine similarity;
Polymerization submodule, for by described characteristic frequency against maximum two literary compositions to be sorted of cosine similarity described in document matrix
The vector combination of shelves generates described hierarchical clustering tree.
20. devices according to claim 12, it is characterised in that described cutting module includes:
Cutting submodule, for for described hierarchical clustering tree, proceeds by two from root node and cuts and cut, obtain two subtrees;
Second calculating sub module, for the height calculating each subtree respectively and the number of documents comprised;
First judges submodule, for for each subtree, it is judged that whether the height of this subtree meets the first preset termination condition;
Second judges submodule, for for each subtree, it is judged that it is pre-whether the number of documents that this subtree is comprised meets second
If end condition;
Stop submodule, for for each subtree, if judge the height of this subtree meet described first preset termination condition or
Person judges when the number of documents that this subtree is comprised meets described second preset termination condition, then stop continuing executing with this subtree
Described two cut and cut;
Described cutting submodule, for for each subtree, if judging, the height of this subtree does not meets described first preset termination
Condition, and judge when the number of documents that this subtree is comprised does not meets described second preset termination condition yet, continue this son
Set and start to perform described two from root node and cut and cut;
When all stopping performing described two points of cutting steps to any one subtree, the total quantity of the subtree generated is described point
The quantity of class document.
21. devices according to claim 20, it is characterised in that described first judges submodule, for for every height
Tree, it is judged that whether the height of this subtree, less than or equal to preset termination height, if YES, then meets, otherwise for not meeting;
Described second judges submodule, for for each subtree, it is judged that the number of documents that this subtree is comprised whether less than or
Equal to preset termination quantity, if YES, then meet, otherwise for not meeting.
22. devices according to claim 20, it is characterised in that the number of documents that every class classifying documents is comprised is each
The quantity of all nodes that subtree is comprised.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610519971.6A CN106126734B (en) | 2016-07-04 | 2016-07-04 | The classification method and device of document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610519971.6A CN106126734B (en) | 2016-07-04 | 2016-07-04 | The classification method and device of document |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106126734A true CN106126734A (en) | 2016-11-16 |
CN106126734B CN106126734B (en) | 2019-06-28 |
Family
ID=57469267
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610519971.6A Active CN106126734B (en) | 2016-07-04 | 2016-07-04 | The classification method and device of document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106126734B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106934068A (en) * | 2017-04-10 | 2017-07-07 | 江苏东方金钰智能机器人有限公司 | The method that robot is based on the semantic understanding of environmental context |
CN107391674A (en) * | 2017-07-21 | 2017-11-24 | 北京神州泰岳软件股份有限公司 | A kind of new class method for digging and device |
CN108647996A (en) * | 2018-04-11 | 2018-10-12 | 中山大学 | A kind of personalized recommendation method and system based on Spark |
CN109992673A (en) * | 2019-04-10 | 2019-07-09 | 广东工业大学 | A kind of knowledge mapping generation method, device, equipment and readable storage medium storing program for executing |
CN110427614A (en) * | 2019-07-16 | 2019-11-08 | 深圳追一科技有限公司 | Construction method, device, electronic equipment and the storage medium of paragraph level |
CN110781272A (en) * | 2019-09-10 | 2020-02-11 | 杭州云深科技有限公司 | Text matching method and device and storage medium |
CN110941958A (en) * | 2019-11-15 | 2020-03-31 | 腾讯云计算(北京)有限责任公司 | Text category labeling method and device, electronic equipment and storage medium |
CN111026920A (en) * | 2019-12-17 | 2020-04-17 | 深圳云天励飞技术有限公司 | File merging method and device, electronic equipment and storage medium |
CN111177375A (en) * | 2019-12-16 | 2020-05-19 | 医渡云(北京)技术有限公司 | Electronic document classification method and device |
CN111552805A (en) * | 2020-04-16 | 2020-08-18 | 重庆大学 | Question and answer system question and sentence intention identification method |
WO2021000675A1 (en) * | 2019-07-04 | 2021-01-07 | 平安科技(深圳)有限公司 | Method and apparatus for machine reading comprehension of chinese text, and computer device |
CN112487190A (en) * | 2020-12-13 | 2021-03-12 | 天津大学 | Method for extracting relationships between entities from text based on self-supervision and clustering technology |
CN112487194A (en) * | 2020-12-17 | 2021-03-12 | 平安消费金融有限公司 | Document classification rule updating method, device, equipment and storage medium |
CN112948633A (en) * | 2021-04-01 | 2021-06-11 | 北京奇艺世纪科技有限公司 | Content tag generation method and device and electronic equipment |
CN113312903A (en) * | 2021-05-27 | 2021-08-27 | 云南大学 | Method and system for constructing word stock of 5G mobile service product |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090112865A1 (en) * | 2007-10-26 | 2009-04-30 | Vee Erik N | Hierarchical structure entropy measurement methods and systems |
CN103092931A (en) * | 2012-12-31 | 2013-05-08 | 武汉传神信息技术有限公司 | Multi-strategy combined document automatic classification method |
CN103106262A (en) * | 2013-01-28 | 2013-05-15 | 新浪网技术(中国)有限公司 | Method and device of file classification and generation of support vector machine model |
US20140012849A1 (en) * | 2012-07-06 | 2014-01-09 | Alexander Ulanov | Multilabel classification by a hierarchy |
US20140324871A1 (en) * | 2013-04-30 | 2014-10-30 | Wal-Mart Stores, Inc. | Decision-tree based quantitative and qualitative record classification |
CN105630931A (en) * | 2015-12-22 | 2016-06-01 | 浪潮软件集团有限公司 | Document classification method and device |
-
2016
- 2016-07-04 CN CN201610519971.6A patent/CN106126734B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090112865A1 (en) * | 2007-10-26 | 2009-04-30 | Vee Erik N | Hierarchical structure entropy measurement methods and systems |
US20140012849A1 (en) * | 2012-07-06 | 2014-01-09 | Alexander Ulanov | Multilabel classification by a hierarchy |
CN103092931A (en) * | 2012-12-31 | 2013-05-08 | 武汉传神信息技术有限公司 | Multi-strategy combined document automatic classification method |
CN103106262A (en) * | 2013-01-28 | 2013-05-15 | 新浪网技术(中国)有限公司 | Method and device of file classification and generation of support vector machine model |
US20140324871A1 (en) * | 2013-04-30 | 2014-10-30 | Wal-Mart Stores, Inc. | Decision-tree based quantitative and qualitative record classification |
CN105630931A (en) * | 2015-12-22 | 2016-06-01 | 浪潮软件集团有限公司 | Document classification method and device |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106934068A (en) * | 2017-04-10 | 2017-07-07 | 江苏东方金钰智能机器人有限公司 | The method that robot is based on the semantic understanding of environmental context |
CN107391674A (en) * | 2017-07-21 | 2017-11-24 | 北京神州泰岳软件股份有限公司 | A kind of new class method for digging and device |
CN107391674B (en) * | 2017-07-21 | 2020-04-10 | 中科鼎富(北京)科技发展有限公司 | New type mining method and device |
CN108647996A (en) * | 2018-04-11 | 2018-10-12 | 中山大学 | A kind of personalized recommendation method and system based on Spark |
CN108647996B (en) * | 2018-04-11 | 2022-04-19 | 中山大学 | Spark-based personalized recommendation method and system |
CN109992673A (en) * | 2019-04-10 | 2019-07-09 | 广东工业大学 | A kind of knowledge mapping generation method, device, equipment and readable storage medium storing program for executing |
WO2021000675A1 (en) * | 2019-07-04 | 2021-01-07 | 平安科技(深圳)有限公司 | Method and apparatus for machine reading comprehension of chinese text, and computer device |
CN110427614A (en) * | 2019-07-16 | 2019-11-08 | 深圳追一科技有限公司 | Construction method, device, electronic equipment and the storage medium of paragraph level |
CN110427614B (en) * | 2019-07-16 | 2023-08-08 | 深圳追一科技有限公司 | Construction method and device of paragraph level, electronic equipment and storage medium |
CN110781272A (en) * | 2019-09-10 | 2020-02-11 | 杭州云深科技有限公司 | Text matching method and device and storage medium |
CN110941958A (en) * | 2019-11-15 | 2020-03-31 | 腾讯云计算(北京)有限责任公司 | Text category labeling method and device, electronic equipment and storage medium |
CN111177375A (en) * | 2019-12-16 | 2020-05-19 | 医渡云(北京)技术有限公司 | Electronic document classification method and device |
CN111026920A (en) * | 2019-12-17 | 2020-04-17 | 深圳云天励飞技术有限公司 | File merging method and device, electronic equipment and storage medium |
CN111552805A (en) * | 2020-04-16 | 2020-08-18 | 重庆大学 | Question and answer system question and sentence intention identification method |
CN112487190B (en) * | 2020-12-13 | 2022-04-19 | 天津大学 | Method for extracting relationships between entities from text based on self-supervision and clustering technology |
CN112487190A (en) * | 2020-12-13 | 2021-03-12 | 天津大学 | Method for extracting relationships between entities from text based on self-supervision and clustering technology |
CN112487194A (en) * | 2020-12-17 | 2021-03-12 | 平安消费金融有限公司 | Document classification rule updating method, device, equipment and storage medium |
CN112948633A (en) * | 2021-04-01 | 2021-06-11 | 北京奇艺世纪科技有限公司 | Content tag generation method and device and electronic equipment |
CN112948633B (en) * | 2021-04-01 | 2023-09-05 | 北京奇艺世纪科技有限公司 | Content tag generation method and device and electronic equipment |
CN113312903A (en) * | 2021-05-27 | 2021-08-27 | 云南大学 | Method and system for constructing word stock of 5G mobile service product |
CN113312903B (en) * | 2021-05-27 | 2022-04-19 | 云南大学 | Method and system for constructing word stock of 5G mobile service product |
Also Published As
Publication number | Publication date |
---|---|
CN106126734B (en) | 2019-06-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106126734A (en) | The sorting technique of document and device | |
Rathi et al. | Sentiment analysis of tweets using machine learning approach | |
CN106844658B (en) | Automatic construction method and system of Chinese text knowledge graph | |
Ghosh et al. | A tutorial review on Text Mining Algorithms | |
Hong et al. | The feature selection method based on genetic algorithm for efficient of text clustering and text classification | |
CN108052593A (en) | A kind of subject key words extracting method based on descriptor vector sum network structure | |
US20090037440A1 (en) | Streaming Hierarchical Clustering | |
CN106570128A (en) | Mining algorithm based on association rule analysis | |
Archambeau et al. | Latent IBP compound Dirichlet allocation | |
Jantawan et al. | A comparison of filter and wrapper approaches with data mining techniques for categorical variables selection | |
Mishra et al. | Text document clustering on the basis of inter passage approach by using K-means | |
WO2022183923A1 (en) | Phrase generation method and apparatus, and computer readable storage medium | |
CN106874469A (en) | A kind of news roundup generation method and system | |
Yun et al. | An efficient approach for mining weighted approximate closed frequent patterns considering noise constraints | |
Olivas et al. | An application of the FIS-CRM model to the FISS metasearcher: Using fuzzy synonymy and fuzzy generality for representing concepts in documents | |
Pham et al. | An efficient method for mining top-K closed sequential patterns | |
CN109829054A (en) | A kind of file classification method and system | |
Altinel et al. | A simple semantic kernel approach for SVM using higher-order paths | |
Wen et al. | Ontology learning by clustering based on fuzzy formal concept analysis | |
Yang | Algorithm study of new association rules and classification rules in data mining | |
Keyan et al. | Multi-document and multi-lingual summarization using neural networks | |
Kabir et al. | A novel approach to mining maximal frequent itemsets based on genetic algorithm | |
Eclarin et al. | A novel feature hashing with efficient collision resolution for bag-of-words representation of text data | |
Rakib et al. | Improving short text clustering by similarity matrix sparsification | |
Nürnberger et al. | User modelling for interactive user-adaptive collection structuring |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |