CN110019806B - Document clustering method and device - Google Patents

Document clustering method and device Download PDF

Info

Publication number
CN110019806B
CN110019806B CN201711423310.4A CN201711423310A CN110019806B CN 110019806 B CN110019806 B CN 110019806B CN 201711423310 A CN201711423310 A CN 201711423310A CN 110019806 B CN110019806 B CN 110019806B
Authority
CN
China
Prior art keywords
document
word
documents
tuple
importance value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711423310.4A
Other languages
Chinese (zh)
Other versions
CN110019806A (en
Inventor
符晶晶
盛家波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Information Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Information Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201711423310.4A priority Critical patent/CN110019806B/en
Publication of CN110019806A publication Critical patent/CN110019806A/en
Application granted granted Critical
Publication of CN110019806B publication Critical patent/CN110019806B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a document clustering method and a device, wherein the method comprises the following steps: determining an importance value of a word included in an alternative word set of each document in the documents to be clustered, wherein the alternative word set includes a word obtained after word segmentation processing is performed on each document, and the importance value is used for representing the degree of association between the word and the document where the word is located; determining at least one word with the importance value within a preset range in the alternative word set of each document; forming the at least one phrase into a tuple of each document, wherein the tuple is used for completing clustering on each document; determining the similarity among the multi-component groups of all the documents in the documents to be clustered, and aggregating all the documents in the documents to be clustered into at least one cluster according to the similarity, wherein the similarity among the multi-component groups of the documents in the same cluster is within a set range.

Description

Document clustering method and device
Technical Field
The invention relates to the technical field of natural language processing, in particular to a document clustering method and device.
Background
With the continuous development of Natural Language Processing (NLP) and the rapid increase of the number of documents, a large amount of work is brought to document query. In order to facilitate users to search for documents, clustering of documents is becoming a problem of increasing concern. The document clustering is to cluster similar documents into the same category according to the category, the occurrence frequency and the like of words included in the documents.
At present, the process of clustering documents mainly includes: the method comprises the steps of performing word segmentation on a document, and clustering the document by using a clustering algorithm based on distance, such as a K-means algorithm, or a clustering method based on bag of words, such as a potential Dirichlet Allocation model (LDA), according to words obtained by word segmentation. However, after the word segmentation processing is performed on the document, a large number of words are obtained, and the words obtained after the word segmentation processing usually include words irrelevant to the document theme, so that the result of document clustering is inaccurate when the words obtained by the word segmentation processing are directly used for document clustering.
Therefore, the technical problem of inaccurate document clustering exists in the prior art.
Disclosure of Invention
The embodiment of the invention provides a document clustering method and device, which are used for solving the technical problem of inaccurate document clustering in the prior art.
Therefore, the technical scheme provided by the embodiment of the invention is as follows:
in a first aspect, a document clustering method is provided, including:
determining an importance value of a word included in an alternative word set of each document in the documents to be clustered, wherein the alternative word set includes a word obtained after word segmentation processing is performed on each document, and the importance value is used for representing the degree of association between the word and the document where the word is located;
determining at least one word with the importance value within a preset range in the alternative word set of each document;
forming the at least one phrase into a tuple of each document, wherein the tuple is used for completing clustering on each document;
determining the similarity among the multi-component groups of all the documents in the documents to be clustered, and aggregating all the documents in the documents to be clustered into at least one cluster according to the similarity, wherein the similarity among the multi-component groups of the documents in the same cluster is within a set range.
Optionally, the method further includes:
acquiring a title of each document;
performing word segmentation processing on the title of each document;
and obtaining the alternative word set of each document according to the word segmentation processing result of each document.
Optionally, obtaining the candidate word set of each document according to the word segmentation processing result of each document, including:
performing part-of-speech filtering on the word segmentation processing result of each document to obtain target words of which the parts of speech are nouns and/or verbs in each document;
and forming the target words of each document into the alternative word set of each document.
Optionally, the determining the similarity between the multi-element groups of all the documents to be clustered includes:
obtaining a word vector model of the multi-element group of each document;
similarity between the word vector models of the multi-element groups of all the documents in the documents to be clustered is determined.
Optionally, the obtaining a word vector model of a tuple of each document includes:
obtaining a word vector of each word in the multi-tuple of each document;
and obtaining a word vector model of the multi-tuple of each document according to the word vector of each word in the multi-tuple of each document.
In a second aspect, an embodiment of the present invention further provides an apparatus for document clustering, including:
the clustering method comprises a first determining unit, a second determining unit and a clustering unit, wherein the first determining unit is used for determining an importance value of a word included in an alternative word set of each document in the documents to be clustered, the alternative word set includes a word obtained after word segmentation processing is carried out on each document, and the importance value is used for representing the degree of association between the word and the document where the word is located;
the second determining unit is used for determining at least one word with the importance value within a preset range in the alternative word set of each document;
a composition unit, configured to compose the at least one phrase into a tuple of each document, where the tuple is used to complete clustering on each document;
and the third determining unit is used for determining the similarity among the multi-component groups of all the documents in the documents to be clustered and aggregating all the documents in the documents to be clustered into at least one cluster according to the similarity, wherein the similarity among the multi-component groups of the documents in the same cluster is within a set range.
Optionally, the method further includes:
acquiring a title of each document;
performing word segmentation processing on the title of each document;
and obtaining the alternative word set of each document according to the word segmentation processing result of each document.
Optionally, obtaining the candidate word set of each document according to the word segmentation processing result of each document, including:
performing part-of-speech filtering on the word segmentation processing result of each document to obtain target words of which the parts of speech are nouns and/or verbs in each document;
and forming the target words of each document into the alternative word set of each document.
Optionally, determining the similarity between the multi-element groups of all the documents to be clustered includes:
obtaining a word vector model of the multi-element group of each document;
similarity between the word vector models of the multi-element groups of all the documents in the documents to be clustered is determined.
Optionally, obtaining a word vector model of a tuple of each document includes:
obtaining a word vector of each word in the multi-tuple of each document;
and obtaining a word vector model of the multi-tuple of each document according to the word vector of each word in the multi-tuple of each document.
In a third aspect, an embodiment of the present invention further provides an apparatus for document clustering, including:
at least one processor, and
a memory coupled to the at least one processor;
wherein the memory stores instructions executable by the at least one processor, and the at least one processor performs a method of document clustering as described above by executing the instructions stored by the memory.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium:
the computer readable storage medium stores computer instructions which, when executed on a computer, cause the computer to perform a method of document clustering as set forth in the first aspect or any one of the optional embodiments of the first aspect.
In the embodiment of the invention, at least one word with a larger importance value, namely a larger association degree with the document is selected to form the multi-element group for calculating the similarity between the documents, so that the multi-element group of the selected document to be clustered can represent the theme of the whole document as much as possible.
Drawings
FIG. 1 is a flowchart of a document clustering method according to an embodiment of the present invention;
FIG. 2 is a flow chart of obtaining alternative words for each document according to an embodiment of the present invention;
FIG. 3 is another flow chart of obtaining alternative words for each document according to an embodiment of the present invention;
FIG. 4 is a flowchart of determining document similarity in an embodiment of the present invention;
FIG. 5 is a flowchart of obtaining a word vector model of a document tuple according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an apparatus for document clustering according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.
Referring to fig. 1, a document clustering method provided in an embodiment of the present invention includes:
step S101: determining an importance value of a word included in an alternative word set of each document in the documents to be clustered, wherein the alternative word set includes a word obtained after word segmentation processing is performed on each document, and the importance value is used for representing the degree of association between the word and the document where the word is located;
step S102: determining at least one word with the importance value within a preset range in the alternative word set of each document;
step S103: forming the at least one phrase into a tuple of each document, wherein the tuple is used for completing clustering on each document;
step S104: determining the similarity among the multi-component groups of all the documents in the documents to be clustered, and aggregating all the documents in the documents to be clustered into at least one cluster according to the similarity, wherein the similarity among the multi-component groups of the documents in the same cluster is within a set range.
Referring to fig. 2, in the embodiment of the present invention, the set of candidate words of each document in the documents to be clustered in step S101 may be obtained as follows:
step S201: acquiring a title of each document;
step S202: performing word segmentation processing on the title of each document;
step S203: and obtaining the alternative word set of each document according to the word segmentation processing result of each document.
The document to be clustered includes a plurality of documents, such as 5, 50, or 100, and in this embodiment of the present invention, specifically, if the document to be clustered is 5, then the document title of each document in the 5 documents to be clustered may be obtained, where it is assumed that the obtained title of document 1 is: a general processing method for the network fault of the switch; the title of document 2 is: a method for processing computer network faults; the title of document 3 is: a general processing method for computer network faults; the title of document 4 is: setting a tutorial on the switch network; the title of document 5 is: summary of network failures common to switches.
After the document titles of the documents 1 to 5 are obtained, the document titles of each document 1 to 5 may be subjected to word segmentation, in practical applications, the obtained document titles of each document may be subjected to word segmentation by a Jieba word segmentation method, word segmentation of the document title of each document is obtained, and part of speech of the word segmentation is labeled, of course, other word segmentation methods, such as a SnowNLP word segmentation method and the like, may also be used.
It is assumed here that the result of performing word segmentation and part-of-speech tagging on the document 1 by the Jieba word segmentation method is: switch (noun), network (noun), fault (noun), help (verb), general (verb), treatment (verb), measure (noun), where the content in parentheses is the part-of-speech tag of the word. According to the word segmentation processing result of the document 1, namely, words obtained by word segmentation processing of the document 1: "switch", "network", "failed", "of", "general", "handle", and "approach" constitute a set of alternatives for document 1.
Referring to fig. 3, in the embodiment of the present invention, step S203 may be further specifically implemented in the following manner:
step S2031: performing part-of-speech filtering on the word segmentation processing result of each document to obtain target words of which the parts of speech are nouns and/or verbs in each document;
step S2032: and forming the target words of each document into the alternative word set of each document.
That is, after the document 1 is participled by the Jieba word segmentation method, the word obtained by the word segmentation of the document 1 can be filtered out in part of speech, words with parts of speech such as adverb, adjective, conjunctive, sigh, vernoun and the like in the word obtained by the word segmentation are filtered out, the target word with parts of speech such as noun and/or verb in the word segmentation result is retained, that is, the target word in the document 1 is removed: and (auxiliary words), common (first-name verbs), and the reserved part of speech as the target word of the noun and/or verb: the switch (noun), network (noun), fault (noun), treatment (verb), method (noun), and the candidate word set of the document 1 is composed of the reserved parts of speech as nouns and/or target words of verbs, that is, the candidate word set of the document 1 includes the words: "switch", "network", "failure", "handling", and "approach".
It is assumed here that the result of performing word segmentation and part-of-speech tagging on the document 2 by the Jieba word segmentation method is: computer (noun), network (noun), fault (noun), help (verb), treatment (verb), solution (noun). Then, the word obtained by the word segmentation processing of the document 2 is subjected to part-of-speech filtering, words with parts-of-speech being adverbs, adjectives, conjunctions, sighs, vernouns and the like are filtered, the word with the parts-of-speech being a target word of a noun and/or a verb is reserved, and a candidate word set of the document 2 is formed by the reserved target words with the parts-of-speech being the noun and/or the verb, that is, the candidate word set of the document 2 includes the following words: "computer", "network", "failure", "handling", and "approach".
It is assumed here that the result of performing word segmentation and part-of-speech tagging on the document 3 by the Jieba word segmentation method is: computer (noun), network (noun), fault (noun), help (verb), general (verb) processing (verb), solution (noun). Then, the word obtained by the word segmentation processing of the document 3 is subjected to part-of-speech filtering, words with parts-of-speech being adverbs, adjectives, conjunctions, sighs, vernouns and the like are filtered, the words with parts-of-speech being nouns and/or target words of verbs are reserved, and the reserved target words with parts-of-speech being nouns and/or verbs form an alternative word set of the document 3, that is, the alternative word set of the document 3 includes the following words: "computer", "network", "failure", "handling" and "approach".
It is assumed here that the result of performing word segmentation and part-of-speech tagging on the document 4 by the Jieba word segmentation method is: switch (noun), network (noun), set (verb), course (noun). Then, the word obtained by the word segmentation processing of the document 4 is subjected to part-of-speech filtering, words with parts-of-speech being adverbs, adjectives, conjunctions, exclamations, vernouns and the like are filtered, the words with parts-of-speech being nouns and/or target words of verbs are reserved, and the word in the title of the document 4 is a noun and/or a verb, so that the word set of alternatives of the document 4 comprises the following words: "switch", "network", "settings", and "tutorial".
It is assumed here that the result of performing word segmentation and part-of-speech tagging on the document 5 by the Jieba word segmentation method is: switch (noun), common (first name verb), help (first aid), network (noun), failure (noun), summary (noun). Then, the word obtained by the word segmentation processing of the document 5 is subjected to part-of-speech filtering, words with parts-of-speech being adverbs, adjectives, conjunctions, sighs, vernouns and the like are filtered, the words with parts-of-speech being nouns and/or target words of verbs are reserved, and a candidate word set of the document 5 is formed by the reserved target words with parts-of-speech being nouns and/or verbs, that is, the candidate word set of the document 5 includes the following words: "switch", "network", "failure", and "summary".
After the method in the embodiment of the present invention performs step S101, step S102 may be performed: determining at least one word with the importance value within a preset range in the alternative word set of each document, and step S103: and forming the at least one word into a multi-element group of each document, wherein the multi-element group is used for completing the clustering of each document.
After the candidate word set of the document 1 is obtained, the method in this embodiment may calculate an importance value of each word included in the candidate word set in the document 1, and in practical applications, the importance value of each word included in the candidate word set in the document 1 may be calculated by a TF-IDF algorithm, but not limited to the TF-IDF algorithm, in this embodiment of the present invention, specifically, the importance value of each word included in the candidate word set in the document 1 is calculated by using the TF-IDF algorithm, where it is assumed that the importance value of each word in the candidate word set of the document 1 calculated by using the TF-IDF algorithm is as follows:
the importance value of "switch" is 0.9, the importance value of "network" is 0.7, the importance value of "failure" is 0.8, the importance value of "handling" is 0.6, and the importance value of "solution" is 0.3.
In the embodiment of the present invention, after obtaining the importance value of each word in the candidate word set of the document by calculation, the words may be ranked according to the importance value of each word from top to bottom or ranked according to the importance value of each word from bottom to top, in the embodiment of the present invention, specifically, the ranking according to the importance value of each word from top to bottom is performed to obtain the arrangement of the importance values of the words in the candidate word set of the document 1, and then at least one word with the importance value arranged at the top in the candidate word set of the document 1 is selected to form the multi-tuple of the document 1 according to the arrangement of the importance values of the words in the candidate word set of the document 1.
For example, the word with the first arrangement (i.e. the largest importance value) in the arrangement of importance values in the candidate word set of document 1 is selected to form a unary group of document 1; selecting two words with the most important values in the alternative word set of the document 1 to form a binary group of the document 1; and selecting three words with the most top importance values in the candidate word set of the document 1 to form a triple of the document 1, and the like.
In the embodiment of the present invention, when the three words with the highest importance values in the candidate word set of the document form the triples of the document, if the candidate word set of the document includes the word with the part of speech being a noun and the word with the part of speech being a verb, it may be considered that at least one noun and one verb are included in the triples selected from the three words with the highest importance values.
For example, the set of candidate words in document 1 includes words with part-of-speech as nouns and words with part-of-speech as verbs, two nouns with the top arranged importance values in the set of candidate words in document 1 may be selected, the first verb with the top arranged importance values is selected, that is, "switch", "failure", "handling" is selected to constitute the triple in document 1; of course, the three words with the most important values arranged at the top in the candidate word set of the document 1, that is, "switch", "failure", and "network" may also be selected to constitute the triplet of the document 1, and in the embodiment of the present invention, it is specifically assumed that the selected triplet includes at least one noun and one verb, that is, the triplet in the document 1 is composed of "switch", "failure", and "processing".
Similarly, after the candidate word set of the document 2 is obtained, the importance value of each word in the candidate word set of the document 2 is calculated by using the TF-IDF algorithm, where it is assumed that the calculation result of the importance value of each word in the candidate word set of the document 2 is as follows:
the importance value of "computer" is 0.95, the importance value of "network" is 0.72, the importance value of "failure" is 0.8, the importance value of "treatment" is 0.6, and the importance value of "solution" is 0.4.
After the importance value of each word in the candidate word set of the document 2 is obtained through calculation, the word importance value arrangement of the candidate word set of the document 2 is obtained by sequencing from top to bottom according to the importance value of each word, that is: "computer", "failure", "network", "handling", "approach".
According to the arrangement of the importance values of the words in the candidate word set of the document 2, at least one word with the top importance value arrangement in the candidate word set of the document 2 is selected to form a multi-element group of the document 2, and in this embodiment, a triple group of the document 2 is specifically formed by three words with the top importance value arrangement in the candidate word set of the document 2.
The set of candidate words in the document 2 includes words with parts of speech as nouns and words with parts of speech as verbs, and it may also be considered that the triples in the document 2 include at least one noun and one verb, for example, two nouns with the top importance value arrangement in the set of candidate words in the document 2 are selected, and one verb with the top importance value arrangement is selected, that is, "computer", "failure", and "processing" are selected to constitute the triples in the document 2; in the embodiment of the present invention, it is specifically assumed that the selected triplet of the document 2 includes at least one noun and one verb, that is, the triplet in the document 2 is composed of "computer", "failure", and "processing".
Similarly, after the candidate word set of the document 3 is obtained, the importance of each word included in the candidate word set of the document 3 is calculated by using the TF-IDF algorithm, where it is assumed that the calculation result of the importance value of each word included in the candidate word set of the document 3 is as follows:
the importance value of "computer" is 0.91, the importance value of "network" is 0.7, the importance value of "failure" is 0.8, the importance value of "treatment" is 0.6, and the importance value of "solution" is 0.2.
Ranking according to the importance value of each word from top to bottom to obtain the arrangement of the importance values of the words in the candidate word set of the document 3, namely: "computer", "failure", "network", "processing" and "method".
At least one word with the importance value arranged at the top in the candidate set of the document 3 is selected according to the arrangement list of the importance value of the word in the candidate set of the document 3 to form the multi-tuple of the document 3, the candidate set of the document 3 includes words with the part of speech being nouns and words with the part of speech being verbs, the selected triples at least include one noun and one verb, for example, two nouns with the top in the candidate set of the document 3 with the importance value arranged, the triples in the document 3 are selected by the verb with the top in the importance value, namely, "computer", "failure", "processing", or the triples in the document 3 are selected by the three words with the top in the candidate set of the document 3 with the importance value, namely, "computer", "failure" and "network" to form the triples of the document 3.
In the embodiment of the present invention, it is specifically exemplified that the selected triple includes at least one noun and one verb, that is, the triple in the document 3 is composed of "computer", "failure", and "handling".
Similarly, after the candidate word set of the document 4 is obtained, the importance of each word included in the candidate word set of the document 4 is calculated by using the TF-IDF algorithm, where it is assumed that the calculation result of the importance value of each word included in the candidate word set of the document 4 is as follows:
the importance value for "switch" is 0.92, the importance value for "network" is 0.65, the importance value for "set" is 0.6, and the importance value for "course" is 0.5.
Ranking according to the importance value of each word from top to bottom to obtain the arrangement of the importance values of the words in the candidate word set of the document 4, that is: "switch", "network", "setup", "tutorial".
According to the arrangement list of the importance values of the words in the candidate word set of the document 4, at least one word with the top importance value in the candidate word set of the document 4 is selected to form the tuple of the document 4, and in this embodiment, a triple with the top importance value in the candidate word set of the document 4 is specifically selected to form the document 4.
The set of candidate words in the document 4 includes words with parts of speech as nouns and words with parts of speech as verbs, and it may be considered that the selected triples include at least one noun and one verb, for example, two nouns with the top importance value arrangement in the set of candidate words in the document 4 are selected, and one verb with the top importance value arrangement is selected, that is, "switch", "network", and "setup" are selected to constitute the triples in the document 4; the three words with the most important values in the alternative word set of document 4 ranked first, i.e., "switch", "network", and "tutorial", may also be selected to constitute the triplet of document 4
In the embodiment of the present invention, it is specifically taken as an example that the selected triplet is guaranteed to include at least one noun and one verb, that is, the triplet in the document 4 is composed of "switch", "network" and "setup".
Similarly, after the candidate word set of the document 5 is obtained, the importance of each word in the candidate word set of the document 5 is calculated by using the TF-IDF algorithm, where it is assumed that the calculation of the importance value of each word in the candidate word set of the document 5 is as follows:
the importance value for "switch" is 0.88, the importance value for "network" is 0.7, the importance value for "failure" is 0.8, and the importance value for "summary" is 0.65.
Ranking the word importance values in the candidate word set of document 5 by ranking from top to bottom according to the importance value of each word, namely: "switch", "failure", "network", "summary".
According to the arrangement list of the importance values of the words in the candidate word set of the document 5, at least one word with the importance value arranged at the front in the candidate word set of the document 5 is selected to form a multi-element group of the document 5.
The parts of speech of the words in the alternative word set of the document 5 are all nouns, so that the three nouns with the most top importance values in the alternative word set of the document 5, namely the triples of the switch, the fault and the network, which form the document 5, can be selected.
After the method in the embodiment of the present invention performs step S103, step S104 may be performed: determining the similarity among the multi-component groups of all the documents in the documents to be clustered, and aggregating all the documents in the documents to be clustered into at least one cluster according to the similarity, wherein the similarity among the multi-component groups of the documents in the same cluster is within a set range.
Referring to fig. 4, in the step S104 according to the embodiment of the present invention: determining the similarity among the multi-element groups of all the documents to be clustered can be specifically implemented according to the following modes:
step S301: obtaining a word vector model of the multi-element group of each document;
step S302: similarity between the word vector models of the multi-element groups of all the documents in the documents to be clustered is determined.
Referring to fig. 5, step S301 may be further implemented in the following manner:
step S303: obtaining a word vector of each word in the multi-tuple of each document;
step S304: and obtaining a word vector model of the multi-tuple of each document according to the word vector of each word in the multi-tuple of each document.
In the embodiment of the present invention, after obtaining the triples of each document in the documents 1 to 5, the similarity between the multi-component groups of the documents 1 to 5 may be calculated, in a specific practical process, before calculating the similarity between the multi-component groups of the documents 1 to 5, a Word vector of each Word in the triples of each document in the documents 1 to 5 may be obtained first.
For convenience of description, a Word vector model which is obtained by using a Word2vector tool and consists of Word vectors of each Word in a triple group 'switch', 'fault' and 'processing' of a document 1 is called as a Word vector model 1; the word vector model composed of the word vectors of each word in the triplets "computer", "failure", "processing" of the document 2 is referred to as word vector model 2; the word vector model composed of the word vectors of each word in the triplets "computer", "failure", "processing" of the document 3 is referred to as word vector model 3; the word vector model composed of the word vectors of each word in the triplets "switch", "network", "set" of the document 4 is referred to as word vector model 4; the word vector model composed of the word vectors of each word in the triplets "switch", "failure", "network" of document 5 is referred to as word vector model 5.
The similarity between the word vector models of the triplets of the documents 1 to 5 is calculated by using the cosine of the included angle, and it is assumed here that the similarity between the word vector models of the triplets of each document 1 to 5 is calculated as shown in table one below.
In a specific practical process, a threshold value can be set in advance, documents corresponding to tuples of which the similarity between the word vector models of the triples of the documents 1 to 5 is within a set threshold value range are grouped into the same cluster, the threshold value can be selected according to actual needs, and here, taking the set threshold value as 0.8 as an example, documents corresponding to triples of which the similarity between the word vector models of the triples of the documents 1 to 5 is greater than or equal to 0.8 can be grouped into the same cluster, and as can be known from the table, the similarity between the word vector models of the triples of the documents 1 and 5 is 0.9, and then the documents 1 and 5 can be grouped into the same cluster; if the similarity of the word vector model of the triples between the document 2 and the document 3 is 0.97, the document 2 and the document 3 can be grouped into the same cluster, and if the similarities between the document 4 and other documents are less than the set threshold value of 0.8, the document 4 can be grouped into a cluster separately.
Table one:
document 1 Document 2 Document 3 Document 4 Document 5
Document 1 1 0.5 0.48 0.7 0.9
Document 2 0.5 1 0..97 0.4 0.39
Document 3 0.48 0..97 1 0.3 0.42
Document 4 0.7 0.4 0.3 1 0.6
Document 5 0.9 0.39 0.42 0.6 1
Therefore, in the embodiment of the present invention, by calculating the importance value of each word included in the alternative word set of each document in the documents to be clustered, that is, the degree of association with the document, then according to the importance value of each word included in the alternative word set of each document, which is obtained by calculation, at least one word with a larger importance value, namely a high association degree with the document is selected to form a multi-tuple for calculating the similarity between the documents, thereby ensuring that the selected multi-element group of the documents to be clustered can represent the subject of the whole document most, therefore, the similarity between the documents is calculated by the multi-element group selected by the method provided by the embodiment of the invention, the similar documents can be more accurately gathered into the same cluster, therefore, the technical problem that the document clustering is inaccurate in the prior art is effectively solved, and the accuracy of the document clustering is improved.
Furthermore, in the embodiment of the present invention, after the multi-tuple that can best represent the topic of the whole document is selected, the word vector model is formed by the word vectors of the words in the multi-tuple of the document to be clustered to perform similarity calculation, so that the synonyms in the document to be clustered can be effectively identified, and the problem that the document clustering accuracy is low because the synonyms cannot be identified and are identified as different entities in the prior art is avoided, so that the present invention has the beneficial effect of further enhancing the document clustering accuracy.
Furthermore, because the title of the document can be selected and extracted in the embodiment of the invention, and the words with the part of speech of nouns and/or the part of speech of verbs in the title can most identify the subject of the whole document, the word separation processing on the title is carried out to obtain the set of candidate words of the document formed by the words with the part of speech of nouns and/or the part of speech in the title, and further the interference of irrelevant words on the clustering algorithm can be effectively avoided, so the accuracy of document clustering is further improved, and the document clustering efficiency can be improved.
Based on the same inventive concept, an embodiment of the present invention provides an apparatus for document clustering, where specific implementation of a document clustering method of the apparatus may refer to the description of the above method embodiment, and repeated details are not repeated, and please refer to fig. 6, where the apparatus includes:
a first determining unit 10, configured to determine an importance value of a word included in an alternative word set of each document in the documents to be clustered, where the alternative word set includes a word obtained after performing word segmentation processing on each document, and the importance value is used to indicate a degree of association between the word and a document in which the word is located;
a second determining unit 11, configured to determine at least one word in the candidate word set of each document, where an importance value is within a preset range;
a composition unit 12, configured to compose the at least one phrase into a tuple of each document, where the tuple is used to complete clustering on each document;
the third determining unit 13 is configured to determine similarity between the multiple groups of all documents in the documents to be clustered, and aggregate all documents in the documents to be clustered into at least one cluster according to the similarity, where the similarity between the multiple groups of documents included in the same cluster is within a set range.
Optionally, the method further includes:
acquiring a title of each document;
performing word segmentation processing on the title of each document;
and obtaining the alternative word set of each document according to the word segmentation processing result of each document.
Optionally, obtaining the candidate word set of each document according to the word segmentation processing result of each document, including:
performing part-of-speech filtering on the word segmentation processing result of each document to obtain target words of which the parts of speech are nouns and/or verbs in each document;
and forming the target words of each document into the alternative word set of each document.
Optionally, determining the similarity between the multi-element groups of all the documents to be clustered includes:
obtaining a word vector model of the multi-element group of each document;
similarity between the word vector models of the multi-element groups of all the documents in the documents to be clustered is determined.
Optionally, obtaining a word vector model of a tuple of each document includes:
obtaining a word vector of each word in the multi-tuple of each document;
and obtaining a word vector model of the multi-tuple of each document according to the word vector of each word in the multi-tuple of each document.
Based on the same inventive concept, an embodiment of the present invention further provides an apparatus for document clustering, including:
at least one processor, and
a memory coupled to the at least one processor;
wherein the memory stores instructions executable by the at least one processor, and the at least one processor performs a method of document clustering as described above by executing the instructions stored by the memory.
Based on the same inventive concept, the embodiment of the present invention further provides a computer-readable storage medium:
the computer readable storage medium stores computer instructions which, when executed on a computer, cause the computer to perform a method of document clustering as described above.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (7)

1. A method of clustering documents, comprising:
determining an importance value of a word included in an alternative word set of each document in the documents to be clustered, wherein the alternative word set includes a word obtained after word segmentation processing is performed on each document, and the importance value is used for representing the degree of association between the word and the document where the word is located;
determining at least one word with the importance value within a preset range in the alternative word set of each document;
forming the at least one phrase into a tuple of each document, wherein the tuple is used for completing clustering on each document; the multi-tuple is a triple, and the triple at least comprises a noun and a verb;
obtaining a Word vector of each Word in the multi-tuple of each document, and obtaining a Word vector model of the multi-tuple of each document according to the Word vector of each Word in the multi-tuple of each document, wherein the Word vector of each Word is obtained by calculating through a Word2vector tool;
determining the similarity among the word vector models of the multi-element groups of all the documents in the documents to be clustered, and aggregating all the documents in the documents to be clustered into at least one cluster according to the similarity, wherein the similarity among the word vector models of the multi-element groups of the documents in the same cluster is within a set range.
2. The method of claim 1, wherein the method further comprises:
acquiring a title of each document;
performing word segmentation processing on the title of each document;
and obtaining the alternative word set of each document according to the word segmentation processing result of each document.
3. The method of claim 2, wherein obtaining the set of word candidates for each document according to the word segmentation processing result of each document comprises:
performing part-of-speech filtering on the word segmentation processing result of each document to obtain target words of which the parts of speech are nouns and/or verbs in each document;
and forming the target words of each document into the alternative word set of each document.
4. An apparatus for document clustering, comprising:
the clustering method comprises a first determining unit, a second determining unit and a clustering unit, wherein the first determining unit is used for determining an importance value of a word included in an alternative word set of each document in the documents to be clustered, the alternative word set includes a word obtained after word segmentation processing is carried out on each document, and the importance value is used for representing the degree of association between the word and the document where the word is located;
the second determining unit is used for determining at least one word with the importance value within a preset range in the alternative word set of each document;
a composition unit, configured to compose the at least one phrase into a tuple of each document, where the tuple is used to complete clustering on each document; the multi-tuple is a triple, and the triple at least comprises a noun and a verb;
a third determining unit, configured to obtain a Word vector of each Word in the tuple of each document, and obtain a Word vector model of the tuple of each document according to the Word vector of each Word in the tuple of each document, where the Word vector of each Word is obtained by calculating with a Word2vector tool;
determining the similarity among the word vector models of the multi-element groups of all the documents in the documents to be clustered, and aggregating all the documents in the documents to be clustered into at least one cluster according to the similarity, wherein the similarity among the word vector models of the multi-element groups of the documents in the same cluster is within a set range.
5. The device of claim 4, further comprising an acquisition unit to:
acquiring a title of each document;
performing word segmentation processing on the title of each document;
and obtaining the alternative word set of each document according to the word segmentation processing result of each document.
6. An apparatus for document clustering, comprising:
at least one processor, and
a memory coupled to the at least one processor;
wherein the memory stores instructions executable by the at least one processor, the at least one processor performing the method of any one of claims 1-3 by executing the instructions stored by the memory.
7. A computer-readable storage medium characterized by:
the computer readable storage medium stores computer instructions that, when executed on a computer, cause the computer to perform the method of any of claims 1-3.
CN201711423310.4A 2017-12-25 2017-12-25 Document clustering method and device Active CN110019806B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711423310.4A CN110019806B (en) 2017-12-25 2017-12-25 Document clustering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711423310.4A CN110019806B (en) 2017-12-25 2017-12-25 Document clustering method and device

Publications (2)

Publication Number Publication Date
CN110019806A CN110019806A (en) 2019-07-16
CN110019806B true CN110019806B (en) 2021-08-06

Family

ID=67187021

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711423310.4A Active CN110019806B (en) 2017-12-25 2017-12-25 Document clustering method and device

Country Status (1)

Country Link
CN (1) CN110019806B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FI20185865A1 (en) * 2018-10-13 2020-04-14 Iprally Tech Oy Method of training a natural language search system, search system and corresponding use
CN110888981B (en) * 2019-10-30 2022-11-01 深圳价值在线信息科技股份有限公司 Title-based document clustering method and device, terminal equipment and medium
CN110991168B (en) 2019-12-05 2024-05-17 京东方科技集团股份有限公司 Synonym mining method, synonym mining device, and storage medium
CN111325015B (en) * 2020-02-19 2024-01-30 南瑞集团有限公司 Document duplicate checking method and system based on semantic analysis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN101819573A (en) * 2009-09-15 2010-09-01 电子科技大学 Self-adaptive network public opinion identification method
CN103870447A (en) * 2014-03-11 2014-06-18 北京优捷信达信息科技有限公司 Keyword extracting method based on implied Dirichlet model
CN104778204A (en) * 2015-03-02 2015-07-15 华南理工大学 Multi-document subject discovery method based on two-layer clustering
CN107145568A (en) * 2017-05-04 2017-09-08 成都华栖云科技有限公司 A kind of quick media event clustering system and method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8156056B2 (en) * 2007-04-03 2012-04-10 Fernando Luege Mateos Method and system of classifying, ranking and relating information based on weights of network links
EP2188743A1 (en) * 2007-09-12 2010-05-26 ReputationDefender, Inc. Identifying information related to a particular entity from electronic sources

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN101819573A (en) * 2009-09-15 2010-09-01 电子科技大学 Self-adaptive network public opinion identification method
CN103870447A (en) * 2014-03-11 2014-06-18 北京优捷信达信息科技有限公司 Keyword extracting method based on implied Dirichlet model
CN104778204A (en) * 2015-03-02 2015-07-15 华南理工大学 Multi-document subject discovery method based on two-layer clustering
CN107145568A (en) * 2017-05-04 2017-09-08 成都华栖云科技有限公司 A kind of quick media event clustering system and method

Also Published As

Publication number Publication date
CN110019806A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
CN110019806B (en) Document clustering method and device
CN109145099B (en) Question-answering method and device based on artificial intelligence
CN106462604B (en) Identifying query intent
CN112988969B (en) Method, apparatus, device and storage medium for text retrieval
KR101508260B1 (en) Summary generation apparatus and method reflecting document feature
CN106844341B (en) Artificial intelligence-based news abstract extraction method and device
CN106649742A (en) Database maintenance method and device
CN106844658A (en) A kind of Chinese text knowledge mapping method for auto constructing and system
CN108334528B (en) Information recommendation method and device
CN103207899A (en) Method and system for recommending text files
JP2014507704A5 (en)
US20160103916A1 (en) Systems and methods of de-duplicating similar news feed items
CN111639255B (en) Recommendation method and device for search keywords, storage medium and electronic equipment
CN103390004A (en) Determination method and determination device for semantic redundancy and corresponding search method and device
US9721000B2 (en) Generating and using a customized index
CN106156041A (en) Hot information finds method and system
US20210133390A1 (en) Conceptual graph processing apparatus and non-transitory computer readable medium
CN111767320A (en) Data blood relationship determination method and device
CN112612875A (en) Method, device and equipment for automatically expanding query words and storage medium
CN110717092A (en) Method, system, device and storage medium for matching objects for articles
Maks et al. Generating Polarity Lexicons with WordNet propagation in five languages
CN109299463B (en) Emotion score calculation method and related equipment
CN107066533B (en) Search query error correction system and method
CN108415959B (en) Text classification method and device
CN106951548B (en) Method and system for improving close-up word searching precision based on RM algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100032 Beijing Finance Street, No. 29, Xicheng District

Applicant after: CHINA MOBILE COMMUNICATIONS GROUP Co.,Ltd.

Applicant after: CHINA MOBILE INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 100032 Beijing Finance Street, No. 29, Xicheng District

Applicant before: China Mobile Communications Corp.

Applicant before: CHINA MOBILE INFORMATION TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
TA01 Transfer of patent application right

Effective date of registration: 20200318

Address after: Room 1006, building 16, yard 16, Yingcai North Third Street, future science city, Changping District, Beijing 102209

Applicant after: China Mobile Information Technology Co.,Ltd.

Applicant after: CHINA MOBILE COMMUNICATIONS GROUP Co.,Ltd.

Address before: 100032 Beijing Finance Street, No. 29, Xicheng District

Applicant before: CHINA MOBILE COMMUNICATIONS GROUP Co.,Ltd.

Applicant before: CHINA MOBILE INFORMATION TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant