CN111680131A - Document clustering method and system based on semantics and computer equipment - Google Patents

Document clustering method and system based on semantics and computer equipment Download PDF

Info

Publication number
CN111680131A
CN111680131A CN202010576446.4A CN202010576446A CN111680131A CN 111680131 A CN111680131 A CN 111680131A CN 202010576446 A CN202010576446 A CN 202010576446A CN 111680131 A CN111680131 A CN 111680131A
Authority
CN
China
Prior art keywords
document
word
matrix
words
word frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010576446.4A
Other languages
Chinese (zh)
Other versions
CN111680131B (en
Inventor
余显学
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202010576446.4A priority Critical patent/CN111680131B/en
Publication of CN111680131A publication Critical patent/CN111680131A/en
Application granted granted Critical
Publication of CN111680131B publication Critical patent/CN111680131B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention relates to the technical field of artificial intelligence, and provides a semantic-based document clustering method, which comprises the following steps: acquiring an input document and preprocessing the input document; performing word frequency statistics and inverse document frequency calculation on each word contained in the processed input document to construct a word frequency-inverse document matrix; inputting the words adopted in the word frequency statistics as objects into a pre-stored natural language processing model to obtain a similarity matrix matched with the word frequency-inverse document matrix; performing semantic propagation on the word frequency-inverse document matrix according to the similarity matrix to obtain a second word frequency-inverse document matrix; and performing bidirectional clustering on the second word frequency-inverse document matrix to obtain at least one bi-cluster, wherein the bi-cluster comprises a document cluster and a word cluster, and labeling each document in the document cluster according to the characteristic words contained in the word cluster. The invention solves the problem of low accuracy of the document clustering result in the prior art. The invention also relates to the field of blockchains, where natural language processing models can be stored on blockchains.

Description

Document clustering method and system based on semantics and computer equipment
Technical Field
The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a document clustering method and system based on semantics, computer equipment and a computer readable storage medium.
Background
With the popularity of the internet, information is growing at a rapid pace, and we are facing more and more text information processing every day, so that it is easily submerged in the information ocean. With the development of artificial intelligence, people increasingly appeal to personalized services, and how to provide personalized services for people from massive text information is a major challenge for better services and experiences for people, and the foundation for realizing the challenge is to automatically classify texts. The text clustering technology has been widely applied in the aspects of personalized news recommendation, text emotion analysis, text information filtering and the like.
Existing solutions such as NMF, LSA, PLSA, LDA subject-divide documents by constructing a document-word matrix and then building a document-subject-word three-layer model based on the document-word matrix. However, in this type of scheme, text topics are classified according to word frequency statistics, and consideration of semantics cannot be taken into consideration, for example, there are evaluation terms, "excellent", and "bad" in three documents, but if a TF-IDF document-word matrix is established only according to the three words, the matrix does not contain semantic parameters, and the output result is that the three documents belong to the same class, that is, the similarity between the three documents is the same, and actually, the similarity between the first two documents is higher based on semantic analysis.
Disclosure of Invention
In view of the above, an object of the embodiments of the present invention is to provide a semantic-based document clustering method, a semantic-based document clustering system, a computer device, and a computer-readable storage medium, which solve the problem of low clustering accuracy in the angular force technology in the prior art.
In order to achieve the above object, an embodiment of the present invention provides a semantic-based document clustering method, including the following steps:
acquiring an input document and preprocessing the input document to obtain a processed input document;
performing word frequency statistics and inverse document frequency calculation on each word contained in the processed input document, and constructing a word frequency-inverse document matrix according to the calculated word frequency and inverse document frequency;
inputting the words adopted in the word frequency statistics into a pre-stored natural language processing model as objects to obtain a similarity matrix matched with a word frequency-inverse document matrix, wherein the similarity matrix comprises similarity values among the words;
performing semantic propagation on the word frequency-inverse document matrix according to the similarity matrix to obtain a second word frequency-inverse document matrix;
and performing bidirectional clustering on the second word frequency-inverse document matrix to obtain at least one bi-cluster, wherein the bi-cluster comprises a document cluster and a word cluster, and performing label assignment on each document in the document cluster according to the feature words contained in the word cluster and performing associated storage on the document and the corresponding label.
Preferably, the step of obtaining an input document and preprocessing the input document to obtain a processed input document includes:
acquiring an input document;
performing word segmentation processing on the input document to obtain a first intermediate document;
and traversing each word in the first intermediate document after the word segmentation processing, and removing stop words in the first intermediate document to obtain the processed input document.
Preferably, the step of performing word frequency statistics and inverse document frequency calculation on each word contained in the processed input document, and constructing a word frequency-inverse document matrix according to the word frequency and the inverse document frequency obtained by calculation includes:
traversing text data contained in each document in the input documents, and calculating word frequency corresponding to the words according to the number of times of the words appearing in the text and the total word number of the text;
obtaining the inverse document frequency corresponding to the word according to the total number of the documents contained in the input document and the number of the documents containing the word;
and constructing a word frequency-inverse document matrix according to the calculated word frequency and inverse document frequency.
Preferably, the inputting the word used in the word frequency statistics as an object into a pre-stored natural language processing model includes:
and taking the words adopted in the word frequency statistics as objects, and inputting the words into a pre-stored natural language processing model according to the word sequence which is the same as the word sequence mapped by the row-direction elements or the column-direction elements in the word frequency inverse document matrix.
Preferably, the semantic propagation of the word frequency-inverse document matrix according to the similarity matrix to obtain a second word frequency-inverse document matrix is as follows:
A=A*Net
wherein, A' is a second word frequency-inverse document matrix, A is a word frequency-inverse document matrix, and Net is a similarity matrix.
Preferably, the step of inputting the words used in the word frequency statistics as objects into a pre-stored natural language processing model to obtain a similarity matrix adapted to the word frequency-inverse document matrix, where the similarity matrix includes similarity values between the words includes:
inputting the words adopted in the word frequency statistics as objects into a pre-stored natural language processing model, wherein the natural language processing model is pre-stored in a block chain;
the natural language processing model generates word frequency vectors according to the input sequence of the words and the word frequencies corresponding to the words;
the natural language processing model calculates the word frequency vector corresponding to different words through a preset similarity function, and calculates to obtain a similarity value between the words;
the natural language processing model integrates the similarity values among the words through a preset matrix generator to generate a similarity matrix matched with the word frequency-inverse document matrix;
the similarity function calculation formula is as follows:
Figure BDA0002551195160000031
wherein the word A vector is { x }1,y1The word B vector is { x }2,y2And cos theta is a similarity value.
Preferably, the step of tagging the document cluster according to the feature words contained in the word cluster and storing the document and the corresponding tags in association includes:
according to the feature words contained in the word cluster, the feature words are used as second-level labels in the document cluster for associated storage;
and performing subject word query on a preset vocabulary table according to the characteristic words, querying subject words associated with the characteristic words in the vocabulary table, and performing associated storage by using the subject words as primary labels of the document clusters.
In order to achieve the above object, an embodiment of the present invention further provides a document clustering system based on semantics, including:
the preprocessing module is used for acquiring an input document and preprocessing the input document to obtain a processed input document;
the matrix module is used for performing word frequency statistics and inverse document frequency calculation on each word contained in the processed input document and constructing a word frequency-inverse document matrix according to the word frequency and the inverse document frequency obtained through calculation;
the similarity module is used for inputting the words adopted in the word frequency statistics into a pre-stored natural language processing model as objects to obtain a similarity matrix matched with a word frequency-inverse document matrix, and the similarity matrix comprises similarity values among the words;
the semantic propagation module is used for performing semantic propagation on the word frequency-inverse document matrix according to the similarity matrix to obtain a second word frequency-inverse document matrix;
and the clustering module is used for performing bidirectional clustering on the second word frequency-inverse document matrix to obtain at least one bi-cluster, wherein the bi-cluster comprises a document cluster and a word cluster, and labels are given to all documents in the document cluster according to the feature words related in the word cluster and are stored in an associated manner.
To achieve the above object, an embodiment of the present invention further provides a computer device, where the computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the computer program, when executed by the processor, implements the steps of the semantic-based document clustering method as described above.
To achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, where the computer program is executable by at least one processor to cause the at least one processor to execute the steps of the semantic-based document clustering method as described above.
According to the document clustering method, the document clustering system, the computer equipment and the computer readable storage medium based on the semantics, provided by the embodiment of the invention, the similarity between words is calculated to be taken as the semantic information, the similarity matrix adapted to the word frequency-inverse document matrix is generated, and the semantic propagation is completed by the product sum operation result of the two, so that the problem that the precision of the actual clustering result is not high due to the fact that the document clustering is lack of consideration on the semantics in the prior art is solved.
Drawings
FIG. 1 is a schematic flow chart of a semantic-based document clustering method according to a first embodiment of the present invention;
FIG. 2 is a schematic flow chart of step 100 in another embodiment of the semantic-based document clustering method of the present invention;
FIG. 3 is a schematic flow chart of step 200 in another embodiment of the semantic-based document clustering method of the present invention;
FIG. 4 is a schematic flow chart diagram illustrating step 300 of another embodiment of a semantic-based document clustering method according to the present invention;
FIG. 5 is a flow chart illustrating step 500 of another embodiment of the semantic-based document clustering method of the present invention;
FIG. 6 is a schematic diagram of program modules of a second embodiment of the semantic-based document clustering system of the present invention;
fig. 7 is a schematic diagram of a hardware structure of a third embodiment of the computer apparatus according to the present invention.
Detailed Description
For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.
It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without any inventive step, are within the scope of the present invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be understood that the term "and/or" as used herein is merely an association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates a relationship in which the front and rear associated objects are an "or".
It should be understood that although the terms first, second, etc. may be used to describe the designated key in embodiments of the present invention, the designated key should not be limited to these terms. These terms are only used to distinguish specified keywords from each other. For example, the first specified keyword may also be referred to as the second specified keyword, and similarly, the second specified keyword may also be referred to as the first specified keyword, without departing from the scope of embodiments of the present invention.
The word "if" as used herein may be interpreted as referring to "at … …" or "when … …" or "corresponding to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if it is determined" or "if detected (a stated condition or time)" may be interpreted to mean "when determined" or "in response to determining" or "when detected (a stated condition or event)" or "in response to detecting (a stated condition or event)", depending on the context.
Example one
Referring to FIG. 1, a flowchart illustrating steps of a semantic-based document clustering method according to an embodiment of the present invention is shown. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed.
The method comprises the following specific steps:
step 100, acquiring an input document and preprocessing the input document to obtain a processed input document;
specifically, the document represents an article or a sentence or paragraph around a central meaning, and the input document may be one document or a plurality of documents.
The processor pulls the input document along the preset path information, or sets a buffer area, the server places the newly uploaded document into the buffer area, the processor in the server periodically extracts all documents in the buffer area according to the set pulling frequency, and the documents are used as the input document to be subjected to a series of preprocessing to finish clustering output.
Step 200, performing word frequency statistics and inverse document frequency calculation on each word contained in the processed input document, and constructing a word frequency-inverse document matrix according to the word frequency and the inverse document frequency obtained through calculation;
specifically, the word frequency is the frequency of a word appearing in a sentence or a section of speech. The word frequency can more objectively evaluate the contribution degree of the words to the semantics of the text compared with the occurrence frequency of the words, and the higher the word frequency is, the greater the contribution degree to the semantics is. The inverse document frequency is a measure of the general importance of a word, and a word appears in a plurality of documents and has a lower importance, while the opposite is higher.
Step 300, inputting the words adopted in the word frequency statistics into a pre-stored natural language processing model as objects to obtain a similarity matrix adapted to the word frequency-inverse document matrix, wherein the similarity matrix comprises similarity values among the words;
specifically, in training a preset natural language processing model by using a large Chinese sentence corpus, the invention prefers a BERT model, the training is to mask some words in a sentence, and then the model is used for predicting the masked words, and the training technology is the prior art, so the invention is not repeated herein. By embedding words into vectors by the deep learning natural language processing model BERT, the semantic similarity between words can be calculated, for example, we derive by training the second intermediate document that the similarity between the words "excellent" and "excellent" is 0.8, and the similarity is 0 "poor", then we construct the word-word similarity matrix Net:
Figure BDA0002551195160000061
such as the ratings contained in three documents respectively: excellent, and poor. If we build the word frequency-inverse document matrix A from only these three words in the document
Figure BDA0002551195160000062
Assuming that the first three rows A (1,: A (2,: A (3,:) respectively represent "excellent", "poor", the calculated similarity of the three documents is the same, and the three documents cannot be distinguished effectively.
Therefore, a network connection matrix Net of a word is constructed according to the similarity of the word and the word based on the semantic information of the word, the size of the network connection matrix Net belongs to [0,1], each value in the matrix represents the similarity of two words, and the larger the value is, the more similar the two words are.
Step 400, performing semantic propagation on the word frequency-inverse document matrix according to the similarity matrix to obtain a second word frequency-inverse document matrix;
specifically, following the above example, the matrix a and the matrix Net are subjected to semantic propagation to obtain a second word frequency matrix a'.
The formula of semantic propagation is designed as follows:
A′=A*Net
continuing the above example output results are:
Figure BDA0002551195160000063
at this time, the similarity between A (1): (representing the first row of the matrix, i.e. document 1) and A (2): in matrix A' is improved and is obviously larger than that between A (1): and A (3): and the degree of improvement is related to the connection weight of "excellent" and "excellent" in the word-word network, thereby effectively improving the distance between A (1): and A (2): in the word-word network.
In the prior art, only word frequency is considered when a document is divided, and the semantics of words are not utilized; when a topic classification three-layer model is constructed, information is lost when a high-dimensional sparse document-word matrix is decomposed; the measure of similarity between two documents is based on word similarity between the two documents in the whole space of the word bag, and the actual similarity between the two documents may only be the similarity of some local representative words, while the frequency of a great number of other words is considered to cause noise interference. In order to solve the defects, the proposal designs a similarity matrix between each word in the word bag, and the generated word frequency-inverse document matrix is subjected to semantic propagation once to obtain a second word frequency-inverse document matrix, thereby solving the problem that documents with the same semantic meaning and different words cannot be distinguished in the prior art.
Step 500 is to perform bidirectional clustering on the second word frequency-inverse document matrix to obtain at least one bi-cluster, where the bi-cluster includes a document cluster and a word cluster, and perform label assignment on each document in the document cluster according to the feature words included in the word cluster and perform associated storage on the document and the corresponding label.
The bidirectional clustering algorithm is mainly based on statistics, graph theory, heuristic search, grid fitting, numerical value rearrangement after dispersion and the like. According to the scheme, a pattern segmentation-based mode is preferably adopted for bidirectional clustering, a bidirectional clustering algorithm is adopted for clustering rows and columns of samples at the same time, and samples which show similarity locally can be mined. Such as two documents that are highly similar on some words and are further distracted by a large number of unrelated words, resulting in inseparability. After clustering is carried out through a double clustering algorithm, bi-clusters are obtained from an original document-word matrix, each bi-cluster can clearly indicate that the documents are divided into the same cluster due to words, local information can be mined, at least one document cluster is obtained after the clear interpretable clustering is completed, and the contained characteristic words are used as labels to be embedded into the header of the document cluster.
According to the document clustering method based on the semantics, provided by the embodiment of the invention, the similarity between words is calculated as the semantic information, the similarity matrix matched with the word frequency-inverse document matrix is generated, and the semantic propagation is completed by the product and the operation result of the two, so that the problem of low precision of the actual clustering result caused by the lack of semantic consideration of document clustering in the prior art is solved.
Optionally, referring to fig. 2, in another embodiment, the step 100 of obtaining an input document and preprocessing the input document to obtain a processed input document further includes:
step 110, acquiring an input document;
step 120, performing word segmentation processing on the input document to obtain a first intermediate document;
step 130, traversing each word in the first intermediate document after the word segmentation processing, and removing the stop word in the first intermediate document to obtain the processed input document.
Specifically, the first processing of the input document is word segmentation processing, a first intermediate document is obtained after the processing, the word segmentation processing is to perform word diagram scanning on the input document based on a preset prefix dictionary, word segmentation symbols are added to words identified in the scanning, and the input document containing the word segmentation symbols after the word segmentation processing is obtained.
The preset prefix dictionary arranges words in the dictionary according to the order of prefix inclusion, for example, words beginning with "up" appear in the dictionary, then words beginning with "up" all appear in the part, for example, "shanghai", and further "shanghai city" appears, thereby forming a hierarchical inclusion structure. For example, when a word map is scanned, the word "top" is recognized, then a part of words of "top XX" in the original sentence is matched according to a word group associated with the word "top" in the prefix dictionary as a matching basis, and if the word "top" in the word group is successfully matched, a part of word symbols are added after the third word.
In addition, in order to improve the processing response speed and reduce the calculation time, the word segmentation in the invention is designed to be performed by using a word segmentation device according to a word unit, and the following exemplary examples are as follows:
the input document contains the words "Are Are you curous about tokenization? "\ is used
"Let's see how it works!"\
"We need to analyze a couple of sentences"\
"with punctuations to see it in action."
The result output using the for statement is
#1Are
#2you
#3curious
#4about
#...
#28action
The final output word list containing 28 words is then the first intermediate document.
And after the word segmentation processing is finished, stop word removing processing is carried out on the first intermediate document, a stop word vocabulary is stored in the server in advance, the first intermediate document is matched according to the stop word vocabulary, the matched stop words are removed, and the processed input document is obtained.
Exemplary stop words include words of "without significant information," ground, "having," and the like.
Optionally, referring to fig. 3, in another embodiment, in step 200, performing word frequency statistics and inverse document frequency calculation on each word included in the processed input document, and constructing a word frequency-inverse document matrix according to the word frequency and the inverse document frequency obtained through calculation includes:
step 210, traversing text data contained in each document in the input documents, and calculating word frequency corresponding to the word according to the number of times of the word appearing in the text and the total word number of the text;
step 220, obtaining the inverse document frequency corresponding to the word according to the total number of the documents contained in the input document and the number of the documents containing the word;
step 230 constructs a word frequency-inverse document matrix based on the calculated word frequency and inverse document frequency.
Specifically, the calculation formula of the word frequency and the inverse document frequency is as follows:
Figure BDA0002551195160000081
Figure BDA0002551195160000082
and each element in the constructed word frequency-inverse document matrix is the word frequency and the inverse document frequency.
Exemplary word frequencies are as follows:
Figure BDA0002551195160000091
the source input document is: document 1- "an excellent company with peace of science and technology"
Document 2- "safety financial products are outstanding"
Document 3- "XXX is very poor"
Because the characteristic information of the rows and the columns is hidden in the matrix, the hidden information of the matrix is further explained by the following table:
\ is excellent in Is excellent in Poor stiffness
Document 1 1 0 0
Document 2 0 1 0
Document 3 0 0 1
TABLE 1
"excellent" appears 1 time in document 1, does not appear in document 2, and does not appear in document 3, and therefore, the first column data of the matrix is 1,0,0.
"excellent" appears 0 times in document 1, 1 times in document 2, and does not appear in document 3, and therefore, the second column data of the matrix is 0,1,0
"poor" appears 0 times in document 1,0 times in document 2, and 1 time in document 3, and thus the third column data of the matrix is 0,0,1
An exemplary code example is as follows:
doc='The brown dog is running.
'The black dog is in the black room.
'Running in the room is forbidden.
# build word frequency
# obtaining all feature names
['black','brown','dog','forbidden','in','is','room','running','the']
Outputting word frequency:
#[0 1 1 0 0 1 0 1 1]
#[2 0 1 0 1 1 1 0 2]
#[0 0 0 1 1 1 1 1 1]
the matrix representation form of the code layer is a multi-row array structure.
Optionally, in another embodiment, the inputting the word used in the word frequency statistics as an object into a pre-stored natural language processing model includes:
and taking the words adopted in the word frequency statistics as objects, and inputting the words into a pre-stored natural language processing model according to the word sequence which is the same as the word sequence mapped by the row-direction elements or the column-direction elements in the word frequency inverse document matrix.
Specifically, following the above embodiment example, the generated word frequency-inverse document matrix is
Figure BDA0002551195160000101
The hidden mapping relation of the matrix is as in table 1, and the mapping word order of the row elements in the matrix is { excellent, bad };
therefore, the word order in the input value natural language processing model must also be { excellent, bad }, and such a design can enable the generated similarity matrix to be completely adapted to the word frequency-inverse document matrix, thereby avoiding the problem that the clustering precision is not high due to unnecessary calculation errors.
Optionally, in another embodiment, the semantic propagation on the word frequency-inverse document matrix according to the similarity matrix to obtain a calculation formula of a second word frequency-inverse document matrix is as follows:
A′=A*Net
wherein, A' is a second word frequency-inverse document matrix, A is a word frequency-inverse document matrix, and Net is a similarity matrix.
Optionally, referring to fig. 4, in another embodiment, in step 300, inputting the words used in the word frequency statistics into a pre-stored natural language processing model as objects to obtain a similarity matrix adapted to a word frequency-inverse document matrix, where the similarity matrix includes similarity values between the words, and the step of:
step 310, inputting the words adopted in the word frequency statistics as objects into a pre-stored natural language processing model, wherein the natural language processing model is pre-stored in a block chain;
specifically, the words used in the word frequency statistics may be encapsulated to obtain encapsulated word input data, and the encapsulated words are uploaded to a natural language processing model stored in a block chain, where the natural language processing model includes but is not limited to being stored in the block chain, and may also be stored in a distributed server or a local server. Uploading the summary information to the natural language processing model stored in the blockchain can ensure the safety and the fair transparency to the user. The user device may download the summary information from the blockchain to verify that the packaged word input data has been tampered with. The blockchain referred to in this example is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, and the like. The block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate the next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Step 320, the natural language processing model generates word frequency vectors according to the input order of the words and the word frequencies corresponding to the words;
step 330, the natural language processing model calculates word frequency vectors corresponding to different words through a preset similarity function, and calculates similarity values between the words;
step 340, the natural language processing model integrates the similarity values among the words through a preset matrix generator to generate a similarity matrix matched with the word frequency-inverse document matrix;
the similarity function calculation formula is as follows:
Figure BDA0002551195160000111
wherein the word A vector is { x }1,y1The word B vector is { x }2,y2And cos theta is a similarity value.
Specifically, after a word used in word frequency statistics is input into a natural language model, the natural language processing model converts the word into a word frequency vector, and then calculates the similarity between the words, and when the word is converted into the word frequency vector, the vector calibration corresponding to the word needs to be completed by means of a pre-stored dictionary.
Illustratively, given a dictionary: { safety, science and technology }
The input word is 1, safe, 2, science and technology
Then the word vector for "safe" is expressed as: {1,0}
The word vector for "science and technology" is expressed as: {0,1}
If the similarity of the two times is calculated, the numerical values of the two are input into a similarity function to obtain the similarity.
Optionally, referring to fig. 5, in step 500, the step of giving a tag to the document cluster according to the feature words included in the word cluster and storing the document and the corresponding tag in an associated manner includes:
step 510, according to the feature words contained in the word cluster, performing associated storage on the feature words as secondary labels in the document cluster;
step 520, performing subject term query on a preset vocabulary according to the feature terms, querying subject terms associated with the feature terms in the vocabulary, and performing associated storage by using the subject terms as primary tags of the document clusters.
Specifically, in addition to the dictionary for word vectorization provided in the above embodiment, the server is further provided with a topic word dictionary for label assignment, where the topic word dictionary includes a secondary word structure, and the first level is various topic words set by a technician according to requirements, such as: sports, finance, domestic, foreign, real estate, etc., the second level words are the first level sub-level structure, and are various detail feature words surrounding the theme, such as: the second-level vocabularies of sports are football, basketball, table tennis and the like.
After the bidirectional clustering result is obtained, the word clusters in the dual cluster contain a plurality of characteristic words, and the document clusters are classified into a plurality of clusters and are clustered with another dimension to obtain a word cluster, for example, the document cluster 1 corresponds to the word cluster 1, which means that each document of the document cluster 1 is to be clustered with each word in the word cluster 1, the word in the word cluster 1 is assigned as a secondary label of each document in the document cluster 1, then, each word in the word cluster 1 is matched with a preset subject word dictionary, the subject word related to the word cluster 1 is inquired, and the related subject word is assigned as a primary label with each document in the document cluster 1.
Example two
Referring to FIG. 6, a schematic diagram of program modules of the semantic-based document clustering system of the present invention is shown.
In this embodiment, the semantic-based document clustering system 20 may include or be divided into one or more program modules, which are stored in a storage medium and executed by one or more processors, to implement the present invention and implement the above-described semantic-based document clustering method. The program module referred to in the embodiments of the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable for describing the execution process of the semantic-based document clustering system 20 in a storage medium than the program itself. The following description will specifically describe the functions of the program modules of the present embodiment:
the preprocessing module 200 is configured to obtain an input document and preprocess the input document to obtain a processed input document;
a matrix module 210, configured to perform word frequency statistics and inverse document frequency calculation on each word included in the processed input document, and construct a word frequency-inverse document matrix according to the calculated word frequency and inverse document frequency;
a similarity module 220, configured to input the words used in the word frequency statistics as objects into a pre-stored natural language processing model, to obtain a similarity matrix adapted to the word frequency-inverse document matrix, where the similarity matrix includes similarity values between the words;
the semantic propagation module 230 is configured to perform semantic propagation on the word frequency-inverse document matrix according to the similarity matrix to obtain a second word frequency-inverse document matrix;
and a clustering module 240, configured to perform bidirectional clustering on the second word frequency-inverse document matrix to obtain at least one bi-cluster, where the bi-cluster includes a document cluster and a word cluster, and perform label assignment on each document in the document cluster according to a feature word related in the word cluster and perform associated storage on the document and a corresponding label.
In an exemplary embodiment, the preprocessing module 200 is further configured to obtain an input document; performing word segmentation processing on the input document to obtain a first intermediate document; traversing each word in the first intermediate document after the word segmentation processing, and removing stop words in the first intermediate document to obtain the processed input document.
In an exemplary embodiment, the matrix module 210 is further configured to traverse text data included in each document of the input documents, and calculate word frequencies corresponding to the words according to the number of times that the words appear in the text and the total number of words in the text; obtaining the inverse document frequency corresponding to the word according to the total number of the documents contained in the input document and the number of the documents containing the word; and constructing a word frequency-inverse document matrix according to the calculated word frequency and inverse document frequency.
In an exemplary embodiment, the similarity module 220 is further configured to input a word used in the word frequency statistics as an object into a pre-stored natural language processing model, where the natural language processing model is pre-stored in a block chain; the natural language processing model generates word frequency vectors according to the input sequence of the words and the word frequencies corresponding to the words; the natural language processing model calculates word frequency vectors corresponding to different words through a preset similarity function, and similarity values among the words are calculated; the natural language processing model integrates the similarity values among the words through a preset matrix generator to generate a similarity matrix matched with the word frequency-inverse document matrix;
in an exemplary embodiment, the clustering module 240 is further configured to perform, according to the feature words included in the word clusters, associated storage of the feature words as secondary tags in the document clusters; and performing subject word query on a preset vocabulary table according to the characteristic words, querying subject words related to the characteristic words in the vocabulary table, and performing related storage by using the subject words as primary labels of the document clusters.
EXAMPLE III
Fig. 7 is a schematic diagram of a hardware architecture of a computer device according to a third embodiment of the present invention. In the present embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction. The computer device 2 may be a Personal Digital Assistant (PDA), a smart phone, a notebook computer, a netbook, a Personal computer, and other similar devices. As shown, the computer device 2 includes, but is not limited to, at least a memory 21, a processor 22, a network interface 23, and a semantic-based document clustering system 20, communicatively coupled to each other via a system bus. Wherein:
in the present embodiment, the memory 21 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device 20. Of course, the memory 21 may also comprise both internal and external memory units of the computer device 2. In this embodiment, the memory 21 is generally used for storing an operating system installed on the computer device 2 and various application software, such as the program code of the semantic-based document clustering system 20 in the first embodiment. In addition, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 2. In this embodiment, the processor 22 is configured to execute the program code stored in the memory 21 or process data, for example, execute the document clustering system 20 based on semantics to implement the document clustering method based on semantics in the first embodiment.
The network interface 23 may comprise a wireless network interface or a wired network interface, and the network interface 23 is typically used for establishing a communication connection between the computer device 2 and other electronic apparatuses. For example, the network interface 23 is used to connect the computer device 2 to an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 2 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, and the like.
It is noted that fig. 7 only shows the computer device 2 with components 20-23, but it is to be understood that not all of the shown components are required to be implemented, and that more or less components may be implemented instead.
In this embodiment, the semantic-based document clustering system 20 stored in the memory 21 may be further divided into one or more program modules, which are stored in the memory 21 and executed by one or more processors (in this embodiment, the processor 22) to accomplish the present invention.
Example four
The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App (business) store, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer-readable storage medium of this embodiment is used for storing the semantic-based document clustering system 20, and when being executed by a processor, the semantic-based document clustering method of the first embodiment is implemented.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the above embodiment method can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the present specification and drawings, or used directly or indirectly in other related fields, are included in the scope of the present invention.

Claims (10)

1. A document clustering method based on semantics is characterized by comprising the following steps:
acquiring an input document and preprocessing the input document to obtain a processed input document;
performing word frequency statistics and inverse document frequency calculation on each word contained in the processed input document, and constructing a word frequency-inverse document matrix according to the calculated word frequency and inverse document frequency;
inputting the words adopted in the word frequency statistics into a pre-stored natural language processing model as objects to obtain a similarity matrix matched with a word frequency-inverse document matrix, wherein the similarity matrix comprises similarity values among the words;
performing semantic propagation on the word frequency-inverse document matrix according to the similarity matrix to obtain a second word frequency-inverse document matrix;
and performing bidirectional clustering on the second word frequency-inverse document matrix to obtain at least one bi-cluster, wherein the bi-cluster comprises a document cluster and a word cluster, performing label assignment on each document in the document cluster according to the feature words contained in the word cluster, and performing associated storage on the document and the corresponding label.
2. The semantic-based document clustering method of claim 1, wherein the step of obtaining input documents and preprocessing the input documents to obtain processed input documents comprises:
acquiring an input document;
performing word segmentation processing on the input document to obtain a first intermediate document;
traversing each word in the first intermediate document after the word segmentation processing, and removing stop words in the first intermediate document to obtain the processed input document.
3. The semantic-based document clustering method according to claim 1, wherein the step of performing word frequency statistics and inverse document frequency calculation on each word contained in the processed input document, and constructing a word frequency-inverse document matrix according to the calculated word frequency and inverse document frequency comprises:
traversing text data contained in each document in the input documents, and calculating word frequency corresponding to the words according to the number of times of the words appearing in the text and the total word number of the text;
obtaining the inverse document frequency corresponding to the word according to the total number of the documents contained in the input document and the number of the documents containing the word;
and constructing a word frequency-inverse document matrix according to the calculated word frequency and inverse document frequency.
4. The semantic-based document clustering method according to claim 1, wherein the inputting the words used in the word frequency statistics as objects into a pre-stored natural language processing model comprises:
and taking the words adopted in the word frequency statistics as objects, and inputting the words into a pre-stored natural language processing model according to the word sequence which is the same as the word sequence mapped by the row-direction elements or the column-direction elements in the word frequency inverse document matrix.
5. The semantic-based document clustering method according to claim 1, wherein the semantic propagation is performed on the word frequency-inverse document matrix according to the similarity matrix, and a calculation formula for obtaining a second word frequency-inverse document matrix is as follows:
A′=A*Net
wherein, A' is a second word frequency-inverse document matrix, A is a word frequency-inverse document matrix, and Net is a similarity matrix.
6. The semantic-based document clustering method according to claim 4, wherein the step of inputting the words used in the word frequency statistics as objects into a pre-stored natural language processing model to obtain a similarity matrix adapted to a word frequency-inverse document matrix, wherein the similarity matrix includes similarity values between the words comprises:
inputting the words adopted in the word frequency statistics as objects into a pre-stored natural language processing model, wherein the natural language processing model is pre-stored in a block chain;
the natural language processing model generates word frequency vectors according to the input sequence of the words and the word frequencies corresponding to the words;
the natural language processing model calculates word frequency vectors corresponding to different words through a preset similarity function, and similarity values among the words are calculated;
the natural language processing model integrates the similarity values among the words through a preset matrix generator to generate a similarity matrix matched with the word frequency-inverse document matrix;
the similarity function calculation formula is as follows:
Figure FDA0002551195150000021
wherein the word A vectorIs { x1,y1The word B vector is { x }2,y2And cos theta is a similarity value.
7. The semantic-based document clustering method according to claim 1, wherein the steps of tagging the document clusters according to the feature words contained in the word clusters and storing the documents and the corresponding tags in association comprise:
according to the feature words contained in the word cluster, the feature words are used as secondary labels in the document cluster for associated storage;
and performing subject word query on a preset vocabulary table according to the characteristic words, querying subject words associated with the characteristic words in the vocabulary table, and performing associated storage by using the subject words as primary labels of the document clusters.
8. A semantic-based document clustering system, comprising:
the preprocessing module is used for acquiring an input document and preprocessing the input document to obtain a processed input document;
the matrix module is used for carrying out word frequency statistics and inverse document frequency calculation on each word contained in the processed input document and constructing a word frequency-inverse document matrix according to the word frequency and the inverse document frequency obtained through calculation;
the similarity module is used for inputting the words adopted in the word frequency statistics into a pre-stored natural language processing model as objects to obtain a similarity matrix matched with the word frequency-inverse document matrix, and the similarity matrix comprises similarity values among the words;
the semantic propagation module is used for performing semantic propagation on the word frequency-inverse document matrix according to the similarity matrix to obtain a second word frequency-inverse document matrix;
and the clustering module is used for performing bidirectional clustering on the second word frequency-inverse document matrix to obtain at least one double cluster, wherein the double cluster comprises a document cluster and a word cluster, labels are given to all documents in the document cluster according to the feature words related in the word cluster, and the documents and the corresponding labels are stored in an associated manner.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the semantic-based document clustering method according to any one of claims 1 to 7.
10. A computer-readable storage medium, having stored therein a computer program executable by at least one processor to cause the at least one processor to perform the semantic-based document clustering method according to any one of claims 1 to 7.
CN202010576446.4A 2020-06-22 2020-06-22 Document clustering method and system based on semantics and computer equipment Active CN111680131B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010576446.4A CN111680131B (en) 2020-06-22 2020-06-22 Document clustering method and system based on semantics and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010576446.4A CN111680131B (en) 2020-06-22 2020-06-22 Document clustering method and system based on semantics and computer equipment

Publications (2)

Publication Number Publication Date
CN111680131A true CN111680131A (en) 2020-09-18
CN111680131B CN111680131B (en) 2022-08-12

Family

ID=72456124

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010576446.4A Active CN111680131B (en) 2020-06-22 2020-06-22 Document clustering method and system based on semantics and computer equipment

Country Status (1)

Country Link
CN (1) CN111680131B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347246A (en) * 2020-10-15 2021-02-09 中科曙光南京研究院有限公司 Self-adaptive document clustering method and system based on spectral decomposition
CN112446361A (en) * 2020-12-16 2021-03-05 上海芯翌智能科技有限公司 Method and equipment for cleaning training data
CN113342970A (en) * 2020-11-24 2021-09-03 中电万维信息技术有限责任公司 Multi-label complex text classification method
CN117010010A (en) * 2023-06-01 2023-11-07 湖南信安数字科技有限公司 Multi-server cooperation high-security storage method based on blockchain

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005063159A (en) * 2003-08-13 2005-03-10 Fuji Xerox Co Ltd Document cluster processor and document cluster processing method
CN104598532A (en) * 2014-12-29 2015-05-06 中国联合网络通信有限公司广东省分公司 Information processing method and device
CN107590218A (en) * 2017-09-01 2018-01-16 南京理工大学 The efficient clustering method of multiple features combination Chinese text based on Spark
CN108197111A (en) * 2018-01-10 2018-06-22 华南理工大学 A kind of text automatic abstracting method based on fusion Semantic Clustering
CN108595706A (en) * 2018-05-10 2018-09-28 中国科学院信息工程研究所 A kind of document semantic representation method, file classification method and device based on theme part of speech similitude
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN109376352A (en) * 2018-08-28 2019-02-22 中山大学 A kind of patent text modeling method based on word2vec and semantic similarity
CN110825877A (en) * 2019-11-12 2020-02-21 中国石油大学(华东) Semantic similarity analysis method based on text clustering
US10685183B1 (en) * 2018-01-04 2020-06-16 Facebook, Inc. Consumer insights analysis using word embeddings

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005063159A (en) * 2003-08-13 2005-03-10 Fuji Xerox Co Ltd Document cluster processor and document cluster processing method
CN104598532A (en) * 2014-12-29 2015-05-06 中国联合网络通信有限公司广东省分公司 Information processing method and device
CN107590218A (en) * 2017-09-01 2018-01-16 南京理工大学 The efficient clustering method of multiple features combination Chinese text based on Spark
US10685183B1 (en) * 2018-01-04 2020-06-16 Facebook, Inc. Consumer insights analysis using word embeddings
CN108197111A (en) * 2018-01-10 2018-06-22 华南理工大学 A kind of text automatic abstracting method based on fusion Semantic Clustering
CN108595706A (en) * 2018-05-10 2018-09-28 中国科学院信息工程研究所 A kind of document semantic representation method, file classification method and device based on theme part of speech similitude
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN109376352A (en) * 2018-08-28 2019-02-22 中山大学 A kind of patent text modeling method based on word2vec and semantic similarity
CN110825877A (en) * 2019-11-12 2020-02-21 中国石油大学(华东) Semantic similarity analysis method based on text clustering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
钱志森 等: "半监督语义动态文本聚类算法", 《电子科技大学学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347246A (en) * 2020-10-15 2021-02-09 中科曙光南京研究院有限公司 Self-adaptive document clustering method and system based on spectral decomposition
CN112347246B (en) * 2020-10-15 2024-04-02 中科曙光南京研究院有限公司 Self-adaptive document clustering method and system based on spectrum decomposition
CN113342970A (en) * 2020-11-24 2021-09-03 中电万维信息技术有限责任公司 Multi-label complex text classification method
CN112446361A (en) * 2020-12-16 2021-03-05 上海芯翌智能科技有限公司 Method and equipment for cleaning training data
CN117010010A (en) * 2023-06-01 2023-11-07 湖南信安数字科技有限公司 Multi-server cooperation high-security storage method based on blockchain
CN117010010B (en) * 2023-06-01 2024-02-13 湖南信安数字科技有限公司 Multi-server cooperation high-security storage method based on blockchain

Also Published As

Publication number Publication date
CN111680131B (en) 2022-08-12

Similar Documents

Publication Publication Date Title
CN111680131B (en) Document clustering method and system based on semantics and computer equipment
US20230195773A1 (en) Text classification method, apparatus and computer-readable storage medium
CN103870973B (en) Information push, searching method and the device of keyword extraction based on electronic information
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN102831184B (en) According to the method and system text description of social event being predicted to social affection
CN110598000A (en) Relationship extraction and knowledge graph construction method based on deep learning model
CN110196893A (en) Non- subjective item method to go over files, device and storage medium based on text similarity
US11782928B2 (en) Computerized information extraction from tables
CN107992542A (en) A kind of similar article based on topic model recommends method
CN106372061A (en) Short text similarity calculation method based on semantics
CN105095444A (en) Information acquisition method and device
CN102663129A (en) Medical field deep question and answer method and medical retrieval system
CN111858940B (en) Multi-head attention-based legal case similarity calculation method and system
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN108319734A (en) A kind of product feature structure tree method for auto constructing based on linear combiner
CN109255012A (en) A kind of machine reads the implementation method and device of understanding
CN106649250A (en) Method and device for identifying emotional new words
CN111813905A (en) Corpus generation method and device, computer equipment and storage medium
CN112328761A (en) Intention label setting method and device, computer equipment and storage medium
CN114997288A (en) Design resource association method
CN114048305A (en) Plan recommendation method for administrative penalty documents based on graph convolution neural network
CN112131453A (en) Method, device and storage medium for detecting network bad short text based on BERT
CN115017320A (en) E-commerce text clustering method and system combining bag-of-words model and deep learning model
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
CN107908749B (en) Character retrieval system and method based on search engine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant