CN110019806B

CN110019806B - Document clustering method and device

Info

Publication number: CN110019806B
Application number: CN201711423310.4A
Authority: CN
Inventors: 符晶晶; 盛家波
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Information Technology Co Ltd
Priority date: 2017-12-25
Filing date: 2017-12-25
Publication date: 2021-08-06
Anticipated expiration: 2037-12-25
Also published as: CN110019806A

Abstract

The invention discloses a document clustering method and a device, wherein the method comprises the following steps: determining an importance value of a word included in an alternative word set of each document in the documents to be clustered, wherein the alternative word set includes a word obtained after word segmentation processing is performed on each document, and the importance value is used for representing the degree of association between the word and the document where the word is located; determining at least one word with the importance value within a preset range in the alternative word set of each document; forming the at least one phrase into a tuple of each document, wherein the tuple is used for completing clustering on each document; determining the similarity among the multi-component groups of all the documents in the documents to be clustered, and aggregating all the documents in the documents to be clustered into at least one cluster according to the similarity, wherein the similarity among the multi-component groups of the documents in the same cluster is within a set range.

Description

Document clustering method and device

Technical Field

The invention relates to the technical field of natural language processing, in particular to a document clustering method and device.

Background

With the continuous development of Natural Language Processing (NLP) and the rapid increase of the number of documents, a large amount of work is brought to document query. In order to facilitate users to search for documents, clustering of documents is becoming a problem of increasing concern. The document clustering is to cluster similar documents into the same category according to the category, the occurrence frequency and the like of words included in the documents.

At present, the process of clustering documents mainly includes: the method comprises the steps of performing word segmentation on a document, and clustering the document by using a clustering algorithm based on distance, such as a K-means algorithm, or a clustering method based on bag of words, such as a potential Dirichlet Allocation model (LDA), according to words obtained by word segmentation. However, after the word segmentation processing is performed on the document, a large number of words are obtained, and the words obtained after the word segmentation processing usually include words irrelevant to the document theme, so that the result of document clustering is inaccurate when the words obtained by the word segmentation processing are directly used for document clustering.

Therefore, the technical problem of inaccurate document clustering exists in the prior art.

Disclosure of Invention

The embodiment of the invention provides a document clustering method and device, which are used for solving the technical problem of inaccurate document clustering in the prior art.

Therefore, the technical scheme provided by the embodiment of the invention is as follows:

in a first aspect, a document clustering method is provided, including:

determining an importance value of a word included in an alternative word set of each document in the documents to be clustered, wherein the alternative word set includes a word obtained after word segmentation processing is performed on each document, and the importance value is used for representing the degree of association between the word and the document where the word is located;

determining at least one word with the importance value within a preset range in the alternative word set of each document;

forming the at least one phrase into a tuple of each document, wherein the tuple is used for completing clustering on each document;

determining the similarity among the multi-component groups of all the documents in the documents to be clustered, and aggregating all the documents in the documents to be clustered into at least one cluster according to the similarity, wherein the similarity among the multi-component groups of the documents in the same cluster is within a set range.

Optionally, the method further includes:

acquiring a title of each document;

performing word segmentation processing on the title of each document;

and obtaining the alternative word set of each document according to the word segmentation processing result of each document.

Optionally, obtaining the candidate word set of each document according to the word segmentation processing result of each document, including:

performing part-of-speech filtering on the word segmentation processing result of each document to obtain target words of which the parts of speech are nouns and/or verbs in each document;

and forming the target words of each document into the alternative word set of each document.

Optionally, the determining the similarity between the multi-element groups of all the documents to be clustered includes:

obtaining a word vector model of the multi-element group of each document;

similarity between the word vector models of the multi-element groups of all the documents in the documents to be clustered is determined.

Optionally, the obtaining a word vector model of a tuple of each document includes:

obtaining a word vector of each word in the multi-tuple of each document;

and obtaining a word vector model of the multi-tuple of each document according to the word vector of each word in the multi-tuple of each document.

In a second aspect, an embodiment of the present invention further provides an apparatus for document clustering, including:

the clustering method comprises a first determining unit, a second determining unit and a clustering unit, wherein the first determining unit is used for determining an importance value of a word included in an alternative word set of each document in the documents to be clustered, the alternative word set includes a word obtained after word segmentation processing is carried out on each document, and the importance value is used for representing the degree of association between the word and the document where the word is located;

the second determining unit is used for determining at least one word with the importance value within a preset range in the alternative word set of each document;

a composition unit, configured to compose the at least one phrase into a tuple of each document, where the tuple is used to complete clustering on each document;

and the third determining unit is used for determining the similarity among the multi-component groups of all the documents in the documents to be clustered and aggregating all the documents in the documents to be clustered into at least one cluster according to the similarity, wherein the similarity among the multi-component groups of the documents in the same cluster is within a set range.

Optionally, the method further includes:

acquiring a title of each document;

performing word segmentation processing on the title of each document;

Optionally, determining the similarity between the multi-element groups of all the documents to be clustered includes:

obtaining a word vector model of the multi-element group of each document;

Optionally, obtaining a word vector model of a tuple of each document includes:

obtaining a word vector of each word in the multi-tuple of each document;

In a third aspect, an embodiment of the present invention further provides an apparatus for document clustering, including:

at least one processor, and

a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, and the at least one processor performs a method of document clustering as described above by executing the instructions stored by the memory.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium:

the computer readable storage medium stores computer instructions which, when executed on a computer, cause the computer to perform a method of document clustering as set forth in the first aspect or any one of the optional embodiments of the first aspect.

In the embodiment of the invention, at least one word with a larger importance value, namely a larger association degree with the document is selected to form the multi-element group for calculating the similarity between the documents, so that the multi-element group of the selected document to be clustered can represent the theme of the whole document as much as possible.

Drawings

FIG. 1 is a flowchart of a document clustering method according to an embodiment of the present invention;

FIG. 2 is a flow chart of obtaining alternative words for each document according to an embodiment of the present invention;

FIG. 3 is another flow chart of obtaining alternative words for each document according to an embodiment of the present invention;

FIG. 4 is a flowchart of determining document similarity in an embodiment of the present invention;

FIG. 5 is a flowchart of obtaining a word vector model of a document tuple according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an apparatus for document clustering according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

Referring to fig. 1, a document clustering method provided in an embodiment of the present invention includes:

step S101: determining an importance value of a word included in an alternative word set of each document in the documents to be clustered, wherein the alternative word set includes a word obtained after word segmentation processing is performed on each document, and the importance value is used for representing the degree of association between the word and the document where the word is located;

step S102: determining at least one word with the importance value within a preset range in the alternative word set of each document;

step S103: forming the at least one phrase into a tuple of each document, wherein the tuple is used for completing clustering on each document;

step S104: determining the similarity among the multi-component groups of all the documents in the documents to be clustered, and aggregating all the documents in the documents to be clustered into at least one cluster according to the similarity, wherein the similarity among the multi-component groups of the documents in the same cluster is within a set range.

Referring to fig. 2, in the embodiment of the present invention, the set of candidate words of each document in the documents to be clustered in step S101 may be obtained as follows:

step S201: acquiring a title of each document;

step S202: performing word segmentation processing on the title of each document;

step S203: and obtaining the alternative word set of each document according to the word segmentation processing result of each document.

The document to be clustered includes a plurality of documents, such as 5, 50, or 100, and in this embodiment of the present invention, specifically, if the document to be clustered is 5, then the document title of each document in the 5 documents to be clustered may be obtained, where it is assumed that the obtained title of document 1 is: a general processing method for the network fault of the switch; the title of document 2 is: a method for processing computer network faults; the title of document 3 is: a general processing method for computer network faults; the title of document 4 is: setting a tutorial on the switch network; the title of document 5 is: summary of network failures common to switches.

After the document titles of the documents 1 to 5 are obtained, the document titles of each document 1 to 5 may be subjected to word segmentation, in practical applications, the obtained document titles of each document may be subjected to word segmentation by a Jieba word segmentation method, word segmentation of the document title of each document is obtained, and part of speech of the word segmentation is labeled, of course, other word segmentation methods, such as a SnowNLP word segmentation method and the like, may also be used.

It is assumed here that the result of performing word segmentation and part-of-speech tagging on the document 1 by the Jieba word segmentation method is: switch (noun), network (noun), fault (noun), help (verb), general (verb), treatment (verb), measure (noun), where the content in parentheses is the part-of-speech tag of the word. According to the word segmentation processing result of the document 1, namely, words obtained by word segmentation processing of the document 1: "switch", "network", "failed", "of", "general", "handle", and "approach" constitute a set of alternatives for document 1.

Referring to fig. 3, in the embodiment of the present invention, step S203 may be further specifically implemented in the following manner:

step S2031: performing part-of-speech filtering on the word segmentation processing result of each document to obtain target words of which the parts of speech are nouns and/or verbs in each document;

step S2032: and forming the target words of each document into the alternative word set of each document.

That is, after the document 1 is participled by the Jieba word segmentation method, the word obtained by the word segmentation of the document 1 can be filtered out in part of speech, words with parts of speech such as adverb, adjective, conjunctive, sigh, vernoun and the like in the word obtained by the word segmentation are filtered out, the target word with parts of speech such as noun and/or verb in the word segmentation result is retained, that is, the target word in the document 1 is removed: and (auxiliary words), common (first-name verbs), and the reserved part of speech as the target word of the noun and/or verb: the switch (noun), network (noun), fault (noun), treatment (verb), method (noun), and the candidate word set of the document 1 is composed of the reserved parts of speech as nouns and/or target words of verbs, that is, the candidate word set of the document 1 includes the words: "switch", "network", "failure", "handling", and "approach".

It is assumed here that the result of performing word segmentation and part-of-speech tagging on the document 2 by the Jieba word segmentation method is: computer (noun), network (noun), fault (noun), help (verb), treatment (verb), solution (noun). Then, the word obtained by the word segmentation processing of the document 2 is subjected to part-of-speech filtering, words with parts-of-speech being adverbs, adjectives, conjunctions, sighs, vernouns and the like are filtered, the word with the parts-of-speech being a target word of a noun and/or a verb is reserved, and a candidate word set of the document 2 is formed by the reserved target words with the parts-of-speech being the noun and/or the verb, that is, the candidate word set of the document 2 includes the following words: "computer", "network", "failure", "handling", and "approach".

It is assumed here that the result of performing word segmentation and part-of-speech tagging on the document 3 by the Jieba word segmentation method is: computer (noun), network (noun), fault (noun), help (verb), general (verb) processing (verb), solution (noun). Then, the word obtained by the word segmentation processing of the document 3 is subjected to part-of-speech filtering, words with parts-of-speech being adverbs, adjectives, conjunctions, sighs, vernouns and the like are filtered, the words with parts-of-speech being nouns and/or target words of verbs are reserved, and the reserved target words with parts-of-speech being nouns and/or verbs form an alternative word set of the document 3, that is, the alternative word set of the document 3 includes the following words: "computer", "network", "failure", "handling" and "approach".

It is assumed here that the result of performing word segmentation and part-of-speech tagging on the document 4 by the Jieba word segmentation method is: switch (noun), network (noun), set (verb), course (noun). Then, the word obtained by the word segmentation processing of the document 4 is subjected to part-of-speech filtering, words with parts-of-speech being adverbs, adjectives, conjunctions, exclamations, vernouns and the like are filtered, the words with parts-of-speech being nouns and/or target words of verbs are reserved, and the word in the title of the document 4 is a noun and/or a verb, so that the word set of alternatives of the document 4 comprises the following words: "switch", "network", "settings", and "tutorial".

It is assumed here that the result of performing word segmentation and part-of-speech tagging on the document 5 by the Jieba word segmentation method is: switch (noun), common (first name verb), help (first aid), network (noun), failure (noun), summary (noun). Then, the word obtained by the word segmentation processing of the document 5 is subjected to part-of-speech filtering, words with parts-of-speech being adverbs, adjectives, conjunctions, sighs, vernouns and the like are filtered, the words with parts-of-speech being nouns and/or target words of verbs are reserved, and a candidate word set of the document 5 is formed by the reserved target words with parts-of-speech being nouns and/or verbs, that is, the candidate word set of the document 5 includes the following words: "switch", "network", "failure", and "summary".

After the method in the embodiment of the present invention performs step S101, step S102 may be performed: determining at least one word with the importance value within a preset range in the alternative word set of each document, and step S103: and forming the at least one word into a multi-element group of each document, wherein the multi-element group is used for completing the clustering of each document.

After the candidate word set of the document 1 is obtained, the method in this embodiment may calculate an importance value of each word included in the candidate word set in the document 1, and in practical applications, the importance value of each word included in the candidate word set in the document 1 may be calculated by a TF-IDF algorithm, but not limited to the TF-IDF algorithm, in this embodiment of the present invention, specifically, the importance value of each word included in the candidate word set in the document 1 is calculated by using the TF-IDF algorithm, where it is assumed that the importance value of each word in the candidate word set of the document 1 calculated by using the TF-IDF algorithm is as follows:

the importance value of "switch" is 0.9, the importance value of "network" is 0.7, the importance value of "failure" is 0.8, the importance value of "handling" is 0.6, and the importance value of "solution" is 0.3.

In the embodiment of the present invention, after obtaining the importance value of each word in the candidate word set of the document by calculation, the words may be ranked according to the importance value of each word from top to bottom or ranked according to the importance value of each word from bottom to top, in the embodiment of the present invention, specifically, the ranking according to the importance value of each word from top to bottom is performed to obtain the arrangement of the importance values of the words in the candidate word set of the document 1, and then at least one word with the importance value arranged at the top in the candidate word set of the document 1 is selected to form the multi-tuple of the document 1 according to the arrangement of the importance values of the words in the candidate word set of the document 1.

For example, the word with the first arrangement (i.e. the largest importance value) in the arrangement of importance values in the candidate word set of document 1 is selected to form a unary group of document 1; selecting two words with the most important values in the alternative word set of the document 1 to form a binary group of the document 1; and selecting three words with the most top importance values in the candidate word set of the document 1 to form a triple of the document 1, and the like.

In the embodiment of the present invention, when the three words with the highest importance values in the candidate word set of the document form the triples of the document, if the candidate word set of the document includes the word with the part of speech being a noun and the word with the part of speech being a verb, it may be considered that at least one noun and one verb are included in the triples selected from the three words with the highest importance values.

For example, the set of candidate words in document 1 includes words with part-of-speech as nouns and words with part-of-speech as verbs, two nouns with the top arranged importance values in the set of candidate words in document 1 may be selected, the first verb with the top arranged importance values is selected, that is, "switch", "failure", "handling" is selected to constitute the triple in document 1; of course, the three words with the most important values arranged at the top in the candidate word set of the document 1, that is, "switch", "failure", and "network" may also be selected to constitute the triplet of the document 1, and in the embodiment of the present invention, it is specifically assumed that the selected triplet includes at least one noun and one verb, that is, the triplet in the document 1 is composed of "switch", "failure", and "processing".

Similarly, after the candidate word set of the document 2 is obtained, the importance value of each word in the candidate word set of the document 2 is calculated by using the TF-IDF algorithm, where it is assumed that the calculation result of the importance value of each word in the candidate word set of the document 2 is as follows:

the importance value of "computer" is 0.95, the importance value of "network" is 0.72, the importance value of "failure" is 0.8, the importance value of "treatment" is 0.6, and the importance value of "solution" is 0.4.

After the importance value of each word in the candidate word set of the document 2 is obtained through calculation, the word importance value arrangement of the candidate word set of the document 2 is obtained by sequencing from top to bottom according to the importance value of each word, that is: "computer", "failure", "network", "handling", "approach".

According to the arrangement of the importance values of the words in the candidate word set of the document 2, at least one word with the top importance value arrangement in the candidate word set of the document 2 is selected to form a multi-element group of the document 2, and in this embodiment, a triple group of the document 2 is specifically formed by three words with the top importance value arrangement in the candidate word set of the document 2.

The set of candidate words in the document 2 includes words with parts of speech as nouns and words with parts of speech as verbs, and it may also be considered that the triples in the document 2 include at least one noun and one verb, for example, two nouns with the top importance value arrangement in the set of candidate words in the document 2 are selected, and one verb with the top importance value arrangement is selected, that is, "computer", "failure", and "processing" are selected to constitute the triples in the document 2; in the embodiment of the present invention, it is specifically assumed that the selected triplet of the document 2 includes at least one noun and one verb, that is, the triplet in the document 2 is composed of "computer", "failure", and "processing".

Similarly, after the candidate word set of the document 3 is obtained, the importance of each word included in the candidate word set of the document 3 is calculated by using the TF-IDF algorithm, where it is assumed that the calculation result of the importance value of each word included in the candidate word set of the document 3 is as follows:

the importance value of "computer" is 0.91, the importance value of "network" is 0.7, the importance value of "failure" is 0.8, the importance value of "treatment" is 0.6, and the importance value of "solution" is 0.2.

Ranking according to the importance value of each word from top to bottom to obtain the arrangement of the importance values of the words in the candidate word set of the document 3, namely: "computer", "failure", "network", "processing" and "method".

At least one word with the importance value arranged at the top in the candidate set of the document 3 is selected according to the arrangement list of the importance value of the word in the candidate set of the document 3 to form the multi-tuple of the document 3, the candidate set of the document 3 includes words with the part of speech being nouns and words with the part of speech being verbs, the selected triples at least include one noun and one verb, for example, two nouns with the top in the candidate set of the document 3 with the importance value arranged, the triples in the document 3 are selected by the verb with the top in the importance value, namely, "computer", "failure", "processing", or the triples in the document 3 are selected by the three words with the top in the candidate set of the document 3 with the importance value, namely, "computer", "failure" and "network" to form the triples of the document 3.

In the embodiment of the present invention, it is specifically exemplified that the selected triple includes at least one noun and one verb, that is, the triple in the document 3 is composed of "computer", "failure", and "handling".

Similarly, after the candidate word set of the document 4 is obtained, the importance of each word included in the candidate word set of the document 4 is calculated by using the TF-IDF algorithm, where it is assumed that the calculation result of the importance value of each word included in the candidate word set of the document 4 is as follows:

the importance value for "switch" is 0.92, the importance value for "network" is 0.65, the importance value for "set" is 0.6, and the importance value for "course" is 0.5.

Ranking according to the importance value of each word from top to bottom to obtain the arrangement of the importance values of the words in the candidate word set of the document 4, that is: "switch", "network", "setup", "tutorial".

According to the arrangement list of the importance values of the words in the candidate word set of the document 4, at least one word with the top importance value in the candidate word set of the document 4 is selected to form the tuple of the document 4, and in this embodiment, a triple with the top importance value in the candidate word set of the document 4 is specifically selected to form the document 4.

The set of candidate words in the document 4 includes words with parts of speech as nouns and words with parts of speech as verbs, and it may be considered that the selected triples include at least one noun and one verb, for example, two nouns with the top importance value arrangement in the set of candidate words in the document 4 are selected, and one verb with the top importance value arrangement is selected, that is, "switch", "network", and "setup" are selected to constitute the triples in the document 4; the three words with the most important values in the alternative word set of document 4 ranked first, i.e., "switch", "network", and "tutorial", may also be selected to constitute the triplet of document 4

In the embodiment of the present invention, it is specifically taken as an example that the selected triplet is guaranteed to include at least one noun and one verb, that is, the triplet in the document 4 is composed of "switch", "network" and "setup".

Similarly, after the candidate word set of the document 5 is obtained, the importance of each word in the candidate word set of the document 5 is calculated by using the TF-IDF algorithm, where it is assumed that the calculation of the importance value of each word in the candidate word set of the document 5 is as follows:

the importance value for "switch" is 0.88, the importance value for "network" is 0.7, the importance value for "failure" is 0.8, and the importance value for "summary" is 0.65.

Ranking the word importance values in the candidate word set of document 5 by ranking from top to bottom according to the importance value of each word, namely: "switch", "failure", "network", "summary".

According to the arrangement list of the importance values of the words in the candidate word set of the document 5, at least one word with the importance value arranged at the front in the candidate word set of the document 5 is selected to form a multi-element group of the document 5.

The parts of speech of the words in the alternative word set of the document 5 are all nouns, so that the three nouns with the most top importance values in the alternative word set of the document 5, namely the triples of the switch, the fault and the network, which form the document 5, can be selected.

After the method in the embodiment of the present invention performs step S103, step S104 may be performed: determining the similarity among the multi-component groups of all the documents in the documents to be clustered, and aggregating all the documents in the documents to be clustered into at least one cluster according to the similarity, wherein the similarity among the multi-component groups of the documents in the same cluster is within a set range.

Referring to fig. 4, in the step S104 according to the embodiment of the present invention: determining the similarity among the multi-element groups of all the documents to be clustered can be specifically implemented according to the following modes:

step S301: obtaining a word vector model of the multi-element group of each document;

step S302: similarity between the word vector models of the multi-element groups of all the documents in the documents to be clustered is determined.

Referring to fig. 5, step S301 may be further implemented in the following manner:

step S303: obtaining a word vector of each word in the multi-tuple of each document;

step S304: and obtaining a word vector model of the multi-tuple of each document according to the word vector of each word in the multi-tuple of each document.

In the embodiment of the present invention, after obtaining the triples of each document in the documents 1 to 5, the similarity between the multi-component groups of the documents 1 to 5 may be calculated, in a specific practical process, before calculating the similarity between the multi-component groups of the documents 1 to 5, a Word vector of each Word in the triples of each document in the documents 1 to 5 may be obtained first.

For convenience of description, a Word vector model which is obtained by using a Word2vector tool and consists of Word vectors of each Word in a triple group 'switch', 'fault' and 'processing' of a document 1 is called as a Word vector model 1; the word vector model composed of the word vectors of each word in the triplets "computer", "failure", "processing" of the document 2 is referred to as word vector model 2; the word vector model composed of the word vectors of each word in the triplets "computer", "failure", "processing" of the document 3 is referred to as word vector model 3; the word vector model composed of the word vectors of each word in the triplets "switch", "network", "set" of the document 4 is referred to as word vector model 4; the word vector model composed of the word vectors of each word in the triplets "switch", "failure", "network" of document 5 is referred to as word vector model 5.

The similarity between the word vector models of the triplets of the documents 1 to 5 is calculated by using the cosine of the included angle, and it is assumed here that the similarity between the word vector models of the triplets of each document 1 to 5 is calculated as shown in table one below.

In a specific practical process, a threshold value can be set in advance, documents corresponding to tuples of which the similarity between the word vector models of the triples of the documents 1 to 5 is within a set threshold value range are grouped into the same cluster, the threshold value can be selected according to actual needs, and here, taking the set threshold value as 0.8 as an example, documents corresponding to triples of which the similarity between the word vector models of the triples of the documents 1 to 5 is greater than or equal to 0.8 can be grouped into the same cluster, and as can be known from the table, the similarity between the word vector models of the triples of the documents 1 and 5 is 0.9, and then the documents 1 and 5 can be grouped into the same cluster; if the similarity of the word vector model of the triples between the document 2 and the document 3 is 0.97, the document 2 and the document 3 can be grouped into the same cluster, and if the similarities between the document 4 and other documents are less than the set threshold value of 0.8, the document 4 can be grouped into a cluster separately.

Table one:

	document 1	Document 2	Document 3	Document 4	Document 5
						Document 1	1	0.5	0.48	0.7	0.9
Document 2	0.5	1	0..97	0.4	0.39
						Document 3	0.48	0..97	1	0.3	0.42
Document 4	0.7	0.4	0.3	1	0.6
						Document 5	0.9	0.39	0.42	0.6	1

Therefore, in the embodiment of the present invention, by calculating the importance value of each word included in the alternative word set of each document in the documents to be clustered, that is, the degree of association with the document, then according to the importance value of each word included in the alternative word set of each document, which is obtained by calculation, at least one word with a larger importance value, namely a high association degree with the document is selected to form a multi-tuple for calculating the similarity between the documents, thereby ensuring that the selected multi-element group of the documents to be clustered can represent the subject of the whole document most, therefore, the similarity between the documents is calculated by the multi-element group selected by the method provided by the embodiment of the invention, the similar documents can be more accurately gathered into the same cluster, therefore, the technical problem that the document clustering is inaccurate in the prior art is effectively solved, and the accuracy of the document clustering is improved.

Furthermore, in the embodiment of the present invention, after the multi-tuple that can best represent the topic of the whole document is selected, the word vector model is formed by the word vectors of the words in the multi-tuple of the document to be clustered to perform similarity calculation, so that the synonyms in the document to be clustered can be effectively identified, and the problem that the document clustering accuracy is low because the synonyms cannot be identified and are identified as different entities in the prior art is avoided, so that the present invention has the beneficial effect of further enhancing the document clustering accuracy.

Furthermore, because the title of the document can be selected and extracted in the embodiment of the invention, and the words with the part of speech of nouns and/or the part of speech of verbs in the title can most identify the subject of the whole document, the word separation processing on the title is carried out to obtain the set of candidate words of the document formed by the words with the part of speech of nouns and/or the part of speech in the title, and further the interference of irrelevant words on the clustering algorithm can be effectively avoided, so the accuracy of document clustering is further improved, and the document clustering efficiency can be improved.

Based on the same inventive concept, an embodiment of the present invention provides an apparatus for document clustering, where specific implementation of a document clustering method of the apparatus may refer to the description of the above method embodiment, and repeated details are not repeated, and please refer to fig. 6, where the apparatus includes:

a first determining unit 10, configured to determine an importance value of a word included in an alternative word set of each document in the documents to be clustered, where the alternative word set includes a word obtained after performing word segmentation processing on each document, and the importance value is used to indicate a degree of association between the word and a document in which the word is located;

a second determining unit 11, configured to determine at least one word in the candidate word set of each document, where an importance value is within a preset range;

a composition unit 12, configured to compose the at least one phrase into a tuple of each document, where the tuple is used to complete clustering on each document;

the third determining unit 13 is configured to determine similarity between the multiple groups of all documents in the documents to be clustered, and aggregate all documents in the documents to be clustered into at least one cluster according to the similarity, where the similarity between the multiple groups of documents included in the same cluster is within a set range.

Optionally, the method further includes:

acquiring a title of each document;

performing word segmentation processing on the title of each document;

obtaining a word vector model of the multi-element group of each document;

Optionally, obtaining a word vector model of a tuple of each document includes:

obtaining a word vector of each word in the multi-tuple of each document;

Based on the same inventive concept, an embodiment of the present invention further provides an apparatus for document clustering, including:

at least one processor, and

a memory coupled to the at least one processor;

Based on the same inventive concept, the embodiment of the present invention further provides a computer-readable storage medium:

the computer readable storage medium stores computer instructions which, when executed on a computer, cause the computer to perform a method of document clustering as described above.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of clustering documents, comprising:

forming the at least one phrase into a tuple of each document, wherein the tuple is used for completing clustering on each document; the multi-tuple is a triple, and the triple at least comprises a noun and a verb;

obtaining a Word vector of each Word in the multi-tuple of each document, and obtaining a Word vector model of the multi-tuple of each document according to the Word vector of each Word in the multi-tuple of each document, wherein the Word vector of each Word is obtained by calculating through a Word2vector tool;

determining the similarity among the word vector models of the multi-element groups of all the documents in the documents to be clustered, and aggregating all the documents in the documents to be clustered into at least one cluster according to the similarity, wherein the similarity among the word vector models of the multi-element groups of the documents in the same cluster is within a set range.

2. The method of claim 1, wherein the method further comprises:

acquiring a title of each document;

performing word segmentation processing on the title of each document;

3. The method of claim 2, wherein obtaining the set of word candidates for each document according to the word segmentation processing result of each document comprises:

4. An apparatus for document clustering, comprising:

a composition unit, configured to compose the at least one phrase into a tuple of each document, where the tuple is used to complete clustering on each document; the multi-tuple is a triple, and the triple at least comprises a noun and a verb;

a third determining unit, configured to obtain a Word vector of each Word in the tuple of each document, and obtain a Word vector model of the tuple of each document according to the Word vector of each Word in the tuple of each document, where the Word vector of each Word is obtained by calculating with a Word2vector tool;

5. The device of claim 4, further comprising an acquisition unit to:

acquiring a title of each document;

performing word segmentation processing on the title of each document;

6. An apparatus for document clustering, comprising:

at least one processor, and

a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor performing the method of any one of claims 1-3 by executing the instructions stored by the memory.

7. A computer-readable storage medium characterized by:

the computer readable storage medium stores computer instructions that, when executed on a computer, cause the computer to perform the method of any of claims 1-3.