CN111680131A

CN111680131A - Document clustering method and system based on semantics and computer equipment

Info

Publication number: CN111680131A
Application number: CN202010576446.4A
Authority: CN
Inventors: 余显学
Original assignee: Ping An Bank Co Ltd
Current assignee: Ping An Bank Co Ltd
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2020-09-18
Anticipated expiration: 2040-06-22
Also published as: CN111680131B

Abstract

The invention relates to the technical field of artificial intelligence, and provides a semantic-based document clustering method, which comprises the following steps: acquiring an input document and preprocessing the input document; performing word frequency statistics and inverse document frequency calculation on each word contained in the processed input document to construct a word frequency-inverse document matrix; inputting the words adopted in the word frequency statistics as objects into a pre-stored natural language processing model to obtain a similarity matrix matched with the word frequency-inverse document matrix; performing semantic propagation on the word frequency-inverse document matrix according to the similarity matrix to obtain a second word frequency-inverse document matrix; and performing bidirectional clustering on the second word frequency-inverse document matrix to obtain at least one bi-cluster, wherein the bi-cluster comprises a document cluster and a word cluster, and labeling each document in the document cluster according to the characteristic words contained in the word cluster. The invention solves the problem of low accuracy of the document clustering result in the prior art. The invention also relates to the field of blockchains, where natural language processing models can be stored on blockchains.

Description

Document clustering method and system based on semantics and computer equipment

Technical Field

The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a document clustering method and system based on semantics, computer equipment and a computer readable storage medium.

Background

With the popularity of the internet, information is growing at a rapid pace, and we are facing more and more text information processing every day, so that it is easily submerged in the information ocean. With the development of artificial intelligence, people increasingly appeal to personalized services, and how to provide personalized services for people from massive text information is a major challenge for better services and experiences for people, and the foundation for realizing the challenge is to automatically classify texts. The text clustering technology has been widely applied in the aspects of personalized news recommendation, text emotion analysis, text information filtering and the like.

Existing solutions such as NMF, LSA, PLSA, LDA subject-divide documents by constructing a document-word matrix and then building a document-subject-word three-layer model based on the document-word matrix. However, in this type of scheme, text topics are classified according to word frequency statistics, and consideration of semantics cannot be taken into consideration, for example, there are evaluation terms, "excellent", and "bad" in three documents, but if a TF-IDF document-word matrix is established only according to the three words, the matrix does not contain semantic parameters, and the output result is that the three documents belong to the same class, that is, the similarity between the three documents is the same, and actually, the similarity between the first two documents is higher based on semantic analysis.

Disclosure of Invention

In view of the above, an object of the embodiments of the present invention is to provide a semantic-based document clustering method, a semantic-based document clustering system, a computer device, and a computer-readable storage medium, which solve the problem of low clustering accuracy in the angular force technology in the prior art.

In order to achieve the above object, an embodiment of the present invention provides a semantic-based document clustering method, including the following steps:

acquiring an input document and preprocessing the input document to obtain a processed input document;

performing word frequency statistics and inverse document frequency calculation on each word contained in the processed input document, and constructing a word frequency-inverse document matrix according to the calculated word frequency and inverse document frequency;

inputting the words adopted in the word frequency statistics into a pre-stored natural language processing model as objects to obtain a similarity matrix matched with a word frequency-inverse document matrix, wherein the similarity matrix comprises similarity values among the words;

performing semantic propagation on the word frequency-inverse document matrix according to the similarity matrix to obtain a second word frequency-inverse document matrix;

and performing bidirectional clustering on the second word frequency-inverse document matrix to obtain at least one bi-cluster, wherein the bi-cluster comprises a document cluster and a word cluster, and performing label assignment on each document in the document cluster according to the feature words contained in the word cluster and performing associated storage on the document and the corresponding label.

Preferably, the step of obtaining an input document and preprocessing the input document to obtain a processed input document includes:

acquiring an input document;

performing word segmentation processing on the input document to obtain a first intermediate document;

and traversing each word in the first intermediate document after the word segmentation processing, and removing stop words in the first intermediate document to obtain the processed input document.

Preferably, the step of performing word frequency statistics and inverse document frequency calculation on each word contained in the processed input document, and constructing a word frequency-inverse document matrix according to the word frequency and the inverse document frequency obtained by calculation includes:

traversing text data contained in each document in the input documents, and calculating word frequency corresponding to the words according to the number of times of the words appearing in the text and the total word number of the text;

obtaining the inverse document frequency corresponding to the word according to the total number of the documents contained in the input document and the number of the documents containing the word;

and constructing a word frequency-inverse document matrix according to the calculated word frequency and inverse document frequency.

Preferably, the inputting the word used in the word frequency statistics as an object into a pre-stored natural language processing model includes:

and taking the words adopted in the word frequency statistics as objects, and inputting the words into a pre-stored natural language processing model according to the word sequence which is the same as the word sequence mapped by the row-direction elements or the column-direction elements in the word frequency inverse document matrix.

Preferably, the semantic propagation of the word frequency-inverse document matrix according to the similarity matrix to obtain a second word frequency-inverse document matrix is as follows:

A^′＝A*Net

wherein, A' is a second word frequency-inverse document matrix, A is a word frequency-inverse document matrix, and Net is a similarity matrix.

Preferably, the step of inputting the words used in the word frequency statistics as objects into a pre-stored natural language processing model to obtain a similarity matrix adapted to the word frequency-inverse document matrix, where the similarity matrix includes similarity values between the words includes:

inputting the words adopted in the word frequency statistics as objects into a pre-stored natural language processing model, wherein the natural language processing model is pre-stored in a block chain;

the natural language processing model generates word frequency vectors according to the input sequence of the words and the word frequencies corresponding to the words;

the natural language processing model calculates the word frequency vector corresponding to different words through a preset similarity function, and calculates to obtain a similarity value between the words;

the natural language processing model integrates the similarity values among the words through a preset matrix generator to generate a similarity matrix matched with the word frequency-inverse document matrix;

the similarity function calculation formula is as follows:

wherein the word A vector is { x }₁,y₁The word B vector is { x }₂,y₂And cos theta is a similarity value.

Preferably, the step of tagging the document cluster according to the feature words contained in the word cluster and storing the document and the corresponding tags in association includes:

according to the feature words contained in the word cluster, the feature words are used as second-level labels in the document cluster for associated storage;

and performing subject word query on a preset vocabulary table according to the characteristic words, querying subject words associated with the characteristic words in the vocabulary table, and performing associated storage by using the subject words as primary labels of the document clusters.

In order to achieve the above object, an embodiment of the present invention further provides a document clustering system based on semantics, including:

the preprocessing module is used for acquiring an input document and preprocessing the input document to obtain a processed input document;

the matrix module is used for performing word frequency statistics and inverse document frequency calculation on each word contained in the processed input document and constructing a word frequency-inverse document matrix according to the word frequency and the inverse document frequency obtained through calculation;

the similarity module is used for inputting the words adopted in the word frequency statistics into a pre-stored natural language processing model as objects to obtain a similarity matrix matched with a word frequency-inverse document matrix, and the similarity matrix comprises similarity values among the words;

the semantic propagation module is used for performing semantic propagation on the word frequency-inverse document matrix according to the similarity matrix to obtain a second word frequency-inverse document matrix;

and the clustering module is used for performing bidirectional clustering on the second word frequency-inverse document matrix to obtain at least one bi-cluster, wherein the bi-cluster comprises a document cluster and a word cluster, and labels are given to all documents in the document cluster according to the feature words related in the word cluster and are stored in an associated manner.

To achieve the above object, an embodiment of the present invention further provides a computer device, where the computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the computer program, when executed by the processor, implements the steps of the semantic-based document clustering method as described above.

To achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, where the computer program is executable by at least one processor to cause the at least one processor to execute the steps of the semantic-based document clustering method as described above.

According to the document clustering method, the document clustering system, the computer equipment and the computer readable storage medium based on the semantics, provided by the embodiment of the invention, the similarity between words is calculated to be taken as the semantic information, the similarity matrix adapted to the word frequency-inverse document matrix is generated, and the semantic propagation is completed by the product sum operation result of the two, so that the problem that the precision of the actual clustering result is not high due to the fact that the document clustering is lack of consideration on the semantics in the prior art is solved.

Drawings

FIG. 1 is a schematic flow chart of a semantic-based document clustering method according to a first embodiment of the present invention;

FIG. 2 is a schematic flow chart of step 100 in another embodiment of the semantic-based document clustering method of the present invention;

FIG. 3 is a schematic flow chart of step 200 in another embodiment of the semantic-based document clustering method of the present invention;

FIG. 4 is a schematic flow chart diagram illustrating step 300 of another embodiment of a semantic-based document clustering method according to the present invention;

FIG. 5 is a flow chart illustrating step 500 of another embodiment of the semantic-based document clustering method of the present invention;

FIG. 6 is a schematic diagram of program modules of a second embodiment of the semantic-based document clustering system of the present invention;

fig. 7 is a schematic diagram of a hardware structure of a third embodiment of the computer apparatus according to the present invention.

Detailed Description

For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without any inventive step, are within the scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely an association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates a relationship in which the front and rear associated objects are an "or".

It should be understood that although the terms first, second, etc. may be used to describe the designated key in embodiments of the present invention, the designated key should not be limited to these terms. These terms are only used to distinguish specified keywords from each other. For example, the first specified keyword may also be referred to as the second specified keyword, and similarly, the second specified keyword may also be referred to as the first specified keyword, without departing from the scope of embodiments of the present invention.

The word "if" as used herein may be interpreted as referring to "at … …" or "when … …" or "corresponding to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if it is determined" or "if detected (a stated condition or time)" may be interpreted to mean "when determined" or "in response to determining" or "when detected (a stated condition or event)" or "in response to detecting (a stated condition or event)", depending on the context.

Example one

Referring to FIG. 1, a flowchart illustrating steps of a semantic-based document clustering method according to an embodiment of the present invention is shown. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed.

The method comprises the following specific steps:

step 100, acquiring an input document and preprocessing the input document to obtain a processed input document;

specifically, the document represents an article or a sentence or paragraph around a central meaning, and the input document may be one document or a plurality of documents.

The processor pulls the input document along the preset path information, or sets a buffer area, the server places the newly uploaded document into the buffer area, the processor in the server periodically extracts all documents in the buffer area according to the set pulling frequency, and the documents are used as the input document to be subjected to a series of preprocessing to finish clustering output.

Step 200, performing word frequency statistics and inverse document frequency calculation on each word contained in the processed input document, and constructing a word frequency-inverse document matrix according to the word frequency and the inverse document frequency obtained through calculation;

specifically, the word frequency is the frequency of a word appearing in a sentence or a section of speech. The word frequency can more objectively evaluate the contribution degree of the words to the semantics of the text compared with the occurrence frequency of the words, and the higher the word frequency is, the greater the contribution degree to the semantics is. The inverse document frequency is a measure of the general importance of a word, and a word appears in a plurality of documents and has a lower importance, while the opposite is higher.

Step 300, inputting the words adopted in the word frequency statistics into a pre-stored natural language processing model as objects to obtain a similarity matrix adapted to the word frequency-inverse document matrix, wherein the similarity matrix comprises similarity values among the words;

specifically, in training a preset natural language processing model by using a large Chinese sentence corpus, the invention prefers a BERT model, the training is to mask some words in a sentence, and then the model is used for predicting the masked words, and the training technology is the prior art, so the invention is not repeated herein. By embedding words into vectors by the deep learning natural language processing model BERT, the semantic similarity between words can be calculated, for example, we derive by training the second intermediate document that the similarity between the words "excellent" and "excellent" is 0.8, and the similarity is 0 "poor", then we construct the word-word similarity matrix Net:

such as the ratings contained in three documents respectively: excellent, and poor. If we build the word frequency-inverse document matrix A from only these three words in the document

Assuming that the first three rows A (1,: A (2,: A (3,:) respectively represent "excellent", "poor", the calculated similarity of the three documents is the same, and the three documents cannot be distinguished effectively.

Therefore, a network connection matrix Net of a word is constructed according to the similarity of the word and the word based on the semantic information of the word, the size of the network connection matrix Net belongs to [0,1], each value in the matrix represents the similarity of two words, and the larger the value is, the more similar the two words are.

Step 400, performing semantic propagation on the word frequency-inverse document matrix according to the similarity matrix to obtain a second word frequency-inverse document matrix;

specifically, following the above example, the matrix a and the matrix Net are subjected to semantic propagation to obtain a second word frequency matrix a'.

The formula of semantic propagation is designed as follows:

A′＝A*Net

continuing the above example output results are:

at this time, the similarity between A (1): (representing the first row of the matrix, i.e. document 1) and A (2): in matrix A' is improved and is obviously larger than that between A (1): and A (3): and the degree of improvement is related to the connection weight of "excellent" and "excellent" in the word-word network, thereby effectively improving the distance between A (1): and A (2): in the word-word network.

In the prior art, only word frequency is considered when a document is divided, and the semantics of words are not utilized; when a topic classification three-layer model is constructed, information is lost when a high-dimensional sparse document-word matrix is decomposed; the measure of similarity between two documents is based on word similarity between the two documents in the whole space of the word bag, and the actual similarity between the two documents may only be the similarity of some local representative words, while the frequency of a great number of other words is considered to cause noise interference. In order to solve the defects, the proposal designs a similarity matrix between each word in the word bag, and the generated word frequency-inverse document matrix is subjected to semantic propagation once to obtain a second word frequency-inverse document matrix, thereby solving the problem that documents with the same semantic meaning and different words cannot be distinguished in the prior art.

Step 500 is to perform bidirectional clustering on the second word frequency-inverse document matrix to obtain at least one bi-cluster, where the bi-cluster includes a document cluster and a word cluster, and perform label assignment on each document in the document cluster according to the feature words included in the word cluster and perform associated storage on the document and the corresponding label.

The bidirectional clustering algorithm is mainly based on statistics, graph theory, heuristic search, grid fitting, numerical value rearrangement after dispersion and the like. According to the scheme, a pattern segmentation-based mode is preferably adopted for bidirectional clustering, a bidirectional clustering algorithm is adopted for clustering rows and columns of samples at the same time, and samples which show similarity locally can be mined. Such as two documents that are highly similar on some words and are further distracted by a large number of unrelated words, resulting in inseparability. After clustering is carried out through a double clustering algorithm, bi-clusters are obtained from an original document-word matrix, each bi-cluster can clearly indicate that the documents are divided into the same cluster due to words, local information can be mined, at least one document cluster is obtained after the clear interpretable clustering is completed, and the contained characteristic words are used as labels to be embedded into the header of the document cluster.

According to the document clustering method based on the semantics, provided by the embodiment of the invention, the similarity between words is calculated as the semantic information, the similarity matrix matched with the word frequency-inverse document matrix is generated, and the semantic propagation is completed by the product and the operation result of the two, so that the problem of low precision of the actual clustering result caused by the lack of semantic consideration of document clustering in the prior art is solved.

Optionally, referring to fig. 2, in another embodiment, the step 100 of obtaining an input document and preprocessing the input document to obtain a processed input document further includes:

step 110, acquiring an input document;

step 120, performing word segmentation processing on the input document to obtain a first intermediate document;

step 130, traversing each word in the first intermediate document after the word segmentation processing, and removing the stop word in the first intermediate document to obtain the processed input document.

Specifically, the first processing of the input document is word segmentation processing, a first intermediate document is obtained after the processing, the word segmentation processing is to perform word diagram scanning on the input document based on a preset prefix dictionary, word segmentation symbols are added to words identified in the scanning, and the input document containing the word segmentation symbols after the word segmentation processing is obtained.

The preset prefix dictionary arranges words in the dictionary according to the order of prefix inclusion, for example, words beginning with "up" appear in the dictionary, then words beginning with "up" all appear in the part, for example, "shanghai", and further "shanghai city" appears, thereby forming a hierarchical inclusion structure. For example, when a word map is scanned, the word "top" is recognized, then a part of words of "top XX" in the original sentence is matched according to a word group associated with the word "top" in the prefix dictionary as a matching basis, and if the word "top" in the word group is successfully matched, a part of word symbols are added after the third word.

In addition, in order to improve the processing response speed and reduce the calculation time, the word segmentation in the invention is designed to be performed by using a word segmentation device according to a word unit, and the following exemplary examples are as follows:

the input document contains the words "Are Are you curous about tokenization? "\ is used

"Let's see how it works！"\

"We need to analyze a couple of sentences"\

"with punctuations to see it in action."

The result output using the for statement is

#1Are

#2you

#3curious

#4about

#...

#28action

The final output word list containing 28 words is then the first intermediate document.

And after the word segmentation processing is finished, stop word removing processing is carried out on the first intermediate document, a stop word vocabulary is stored in the server in advance, the first intermediate document is matched according to the stop word vocabulary, the matched stop words are removed, and the processed input document is obtained.

Exemplary stop words include words of "without significant information," ground, "having," and the like.

Optionally, referring to fig. 3, in another embodiment, in step 200, performing word frequency statistics and inverse document frequency calculation on each word included in the processed input document, and constructing a word frequency-inverse document matrix according to the word frequency and the inverse document frequency obtained through calculation includes:

step 210, traversing text data contained in each document in the input documents, and calculating word frequency corresponding to the word according to the number of times of the word appearing in the text and the total word number of the text;

step 220, obtaining the inverse document frequency corresponding to the word according to the total number of the documents contained in the input document and the number of the documents containing the word;

step 230 constructs a word frequency-inverse document matrix based on the calculated word frequency and inverse document frequency.

Specifically, the calculation formula of the word frequency and the inverse document frequency is as follows:

and each element in the constructed word frequency-inverse document matrix is the word frequency and the inverse document frequency.

Exemplary word frequencies are as follows:

the source input document is: document 1- "an excellent company with peace of science and technology"

Document 2- "safety financial products are outstanding"

Document 3- "XXX is very poor"

Because the characteristic information of the rows and the columns is hidden in the matrix, the hidden information of the matrix is further explained by the following table:

\	is excellent in	Is excellent in	Poor stiffness
				Document 1	1	0	0
Document 2	0	1	0
				Document 3	0	0	1

TABLE 1

"excellent" appears 1 time in document 1, does not appear in document 2, and does not appear in document 3, and therefore, the first column data of the matrix is 1,0,0.

"excellent" appears 0 times in document 1, 1 times in document 2, and does not appear in document 3, and therefore, the second column data of the matrix is 0,1,0

"poor" appears 0 times in document 1,0 times in document 2, and 1 time in document 3, and thus the third column data of the matrix is 0,0,1

An exemplary code example is as follows:

doc＝'The brown dog is running.

'The black dog is in the black room.

'Running in the room is forbidden.

# build word frequency

# obtaining all feature names

['black','brown','dog','forbidden','in','is','room','running','the']

Outputting word frequency:

#[0 1 1 0 0 1 0 1 1]

#[2 0 1 0 1 1 1 0 2]

#[0 0 0 1 1 1 1 1 1]

the matrix representation form of the code layer is a multi-row array structure.

Optionally, in another embodiment, the inputting the word used in the word frequency statistics as an object into a pre-stored natural language processing model includes:

Specifically, following the above embodiment example, the generated word frequency-inverse document matrix is

The hidden mapping relation of the matrix is as in table 1, and the mapping word order of the row elements in the matrix is { excellent, bad };

therefore, the word order in the input value natural language processing model must also be { excellent, bad }, and such a design can enable the generated similarity matrix to be completely adapted to the word frequency-inverse document matrix, thereby avoiding the problem that the clustering precision is not high due to unnecessary calculation errors.

Optionally, in another embodiment, the semantic propagation on the word frequency-inverse document matrix according to the similarity matrix to obtain a calculation formula of a second word frequency-inverse document matrix is as follows:

A′＝A*Net

Optionally, referring to fig. 4, in another embodiment, in step 300, inputting the words used in the word frequency statistics into a pre-stored natural language processing model as objects to obtain a similarity matrix adapted to a word frequency-inverse document matrix, where the similarity matrix includes similarity values between the words, and the step of:

step 310, inputting the words adopted in the word frequency statistics as objects into a pre-stored natural language processing model, wherein the natural language processing model is pre-stored in a block chain;

specifically, the words used in the word frequency statistics may be encapsulated to obtain encapsulated word input data, and the encapsulated words are uploaded to a natural language processing model stored in a block chain, where the natural language processing model includes but is not limited to being stored in the block chain, and may also be stored in a distributed server or a local server. Uploading the summary information to the natural language processing model stored in the blockchain can ensure the safety and the fair transparency to the user. The user device may download the summary information from the blockchain to verify that the packaged word input data has been tampered with. The blockchain referred to in this example is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, and the like. The block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate the next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Step 320, the natural language processing model generates word frequency vectors according to the input order of the words and the word frequencies corresponding to the words;

step 330, the natural language processing model calculates word frequency vectors corresponding to different words through a preset similarity function, and calculates similarity values between the words;

step 340, the natural language processing model integrates the similarity values among the words through a preset matrix generator to generate a similarity matrix matched with the word frequency-inverse document matrix;

the similarity function calculation formula is as follows:

Specifically, after a word used in word frequency statistics is input into a natural language model, the natural language processing model converts the word into a word frequency vector, and then calculates the similarity between the words, and when the word is converted into the word frequency vector, the vector calibration corresponding to the word needs to be completed by means of a pre-stored dictionary.

Illustratively, given a dictionary: { safety, science and technology }

The input word is 1, safe, 2, science and technology

Then the word vector for "safe" is expressed as: {1,0}

The word vector for "science and technology" is expressed as: {0,1}

If the similarity of the two times is calculated, the numerical values of the two are input into a similarity function to obtain the similarity.

Optionally, referring to fig. 5, in step 500, the step of giving a tag to the document cluster according to the feature words included in the word cluster and storing the document and the corresponding tag in an associated manner includes:

step 510, according to the feature words contained in the word cluster, performing associated storage on the feature words as secondary labels in the document cluster;

step 520, performing subject term query on a preset vocabulary according to the feature terms, querying subject terms associated with the feature terms in the vocabulary, and performing associated storage by using the subject terms as primary tags of the document clusters.

Specifically, in addition to the dictionary for word vectorization provided in the above embodiment, the server is further provided with a topic word dictionary for label assignment, where the topic word dictionary includes a secondary word structure, and the first level is various topic words set by a technician according to requirements, such as: sports, finance, domestic, foreign, real estate, etc., the second level words are the first level sub-level structure, and are various detail feature words surrounding the theme, such as: the second-level vocabularies of sports are football, basketball, table tennis and the like.

After the bidirectional clustering result is obtained, the word clusters in the dual cluster contain a plurality of characteristic words, and the document clusters are classified into a plurality of clusters and are clustered with another dimension to obtain a word cluster, for example, the document cluster 1 corresponds to the word cluster 1, which means that each document of the document cluster 1 is to be clustered with each word in the word cluster 1, the word in the word cluster 1 is assigned as a secondary label of each document in the document cluster 1, then, each word in the word cluster 1 is matched with a preset subject word dictionary, the subject word related to the word cluster 1 is inquired, and the related subject word is assigned as a primary label with each document in the document cluster 1.

Example two

Referring to FIG. 6, a schematic diagram of program modules of the semantic-based document clustering system of the present invention is shown.

In this embodiment, the semantic-based document clustering system 20 may include or be divided into one or more program modules, which are stored in a storage medium and executed by one or more processors, to implement the present invention and implement the above-described semantic-based document clustering method. The program module referred to in the embodiments of the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable for describing the execution process of the semantic-based document clustering system 20 in a storage medium than the program itself. The following description will specifically describe the functions of the program modules of the present embodiment:

the preprocessing module 200 is configured to obtain an input document and preprocess the input document to obtain a processed input document;

a matrix module 210, configured to perform word frequency statistics and inverse document frequency calculation on each word included in the processed input document, and construct a word frequency-inverse document matrix according to the calculated word frequency and inverse document frequency;

a similarity module 220, configured to input the words used in the word frequency statistics as objects into a pre-stored natural language processing model, to obtain a similarity matrix adapted to the word frequency-inverse document matrix, where the similarity matrix includes similarity values between the words;

the semantic propagation module 230 is configured to perform semantic propagation on the word frequency-inverse document matrix according to the similarity matrix to obtain a second word frequency-inverse document matrix;

and a clustering module 240, configured to perform bidirectional clustering on the second word frequency-inverse document matrix to obtain at least one bi-cluster, where the bi-cluster includes a document cluster and a word cluster, and perform label assignment on each document in the document cluster according to a feature word related in the word cluster and perform associated storage on the document and a corresponding label.

In an exemplary embodiment, the preprocessing module 200 is further configured to obtain an input document; performing word segmentation processing on the input document to obtain a first intermediate document; traversing each word in the first intermediate document after the word segmentation processing, and removing stop words in the first intermediate document to obtain the processed input document.

In an exemplary embodiment, the matrix module 210 is further configured to traverse text data included in each document of the input documents, and calculate word frequencies corresponding to the words according to the number of times that the words appear in the text and the total number of words in the text; obtaining the inverse document frequency corresponding to the word according to the total number of the documents contained in the input document and the number of the documents containing the word; and constructing a word frequency-inverse document matrix according to the calculated word frequency and inverse document frequency.

In an exemplary embodiment, the similarity module 220 is further configured to input a word used in the word frequency statistics as an object into a pre-stored natural language processing model, where the natural language processing model is pre-stored in a block chain; the natural language processing model generates word frequency vectors according to the input sequence of the words and the word frequencies corresponding to the words; the natural language processing model calculates word frequency vectors corresponding to different words through a preset similarity function, and similarity values among the words are calculated; the natural language processing model integrates the similarity values among the words through a preset matrix generator to generate a similarity matrix matched with the word frequency-inverse document matrix;

in an exemplary embodiment, the clustering module 240 is further configured to perform, according to the feature words included in the word clusters, associated storage of the feature words as secondary tags in the document clusters; and performing subject word query on a preset vocabulary table according to the characteristic words, querying subject words related to the characteristic words in the vocabulary table, and performing related storage by using the subject words as primary labels of the document clusters.

EXAMPLE III

Fig. 7 is a schematic diagram of a hardware architecture of a computer device according to a third embodiment of the present invention. In the present embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction. The computer device 2 may be a Personal Digital Assistant (PDA), a smart phone, a notebook computer, a netbook, a Personal computer, and other similar devices. As shown, the computer device 2 includes, but is not limited to, at least a memory 21, a processor 22, a network interface 23, and a semantic-based document clustering system 20, communicatively coupled to each other via a system bus. Wherein:

in the present embodiment, the memory 21 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device 20. Of course, the memory 21 may also comprise both internal and external memory units of the computer device 2. In this embodiment, the memory 21 is generally used for storing an operating system installed on the computer device 2 and various application software, such as the program code of the semantic-based document clustering system 20 in the first embodiment. In addition, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 2. In this embodiment, the processor 22 is configured to execute the program code stored in the memory 21 or process data, for example, execute the document clustering system 20 based on semantics to implement the document clustering method based on semantics in the first embodiment.

The network interface 23 may comprise a wireless network interface or a wired network interface, and the network interface 23 is typically used for establishing a communication connection between the computer device 2 and other electronic apparatuses. For example, the network interface 23 is used to connect the computer device 2 to an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 2 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, and the like.

It is noted that fig. 7 only shows the computer device 2 with components 20-23, but it is to be understood that not all of the shown components are required to be implemented, and that more or less components may be implemented instead.

In this embodiment, the semantic-based document clustering system 20 stored in the memory 21 may be further divided into one or more program modules, which are stored in the memory 21 and executed by one or more processors (in this embodiment, the processor 22) to accomplish the present invention.

Example four

The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App (business) store, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer-readable storage medium of this embodiment is used for storing the semantic-based document clustering system 20, and when being executed by a processor, the semantic-based document clustering method of the first embodiment is implemented.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the above embodiment method can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the present specification and drawings, or used directly or indirectly in other related fields, are included in the scope of the present invention.

Claims

1. A document clustering method based on semantics is characterized by comprising the following steps:

and performing bidirectional clustering on the second word frequency-inverse document matrix to obtain at least one bi-cluster, wherein the bi-cluster comprises a document cluster and a word cluster, performing label assignment on each document in the document cluster according to the feature words contained in the word cluster, and performing associated storage on the document and the corresponding label.

2. The semantic-based document clustering method of claim 1, wherein the step of obtaining input documents and preprocessing the input documents to obtain processed input documents comprises:

acquiring an input document;

traversing each word in the first intermediate document after the word segmentation processing, and removing stop words in the first intermediate document to obtain the processed input document.

3. The semantic-based document clustering method according to claim 1, wherein the step of performing word frequency statistics and inverse document frequency calculation on each word contained in the processed input document, and constructing a word frequency-inverse document matrix according to the calculated word frequency and inverse document frequency comprises:

4. The semantic-based document clustering method according to claim 1, wherein the inputting the words used in the word frequency statistics as objects into a pre-stored natural language processing model comprises:

5. The semantic-based document clustering method according to claim 1, wherein the semantic propagation is performed on the word frequency-inverse document matrix according to the similarity matrix, and a calculation formula for obtaining a second word frequency-inverse document matrix is as follows:

A′＝A*Net

6. The semantic-based document clustering method according to claim 4, wherein the step of inputting the words used in the word frequency statistics as objects into a pre-stored natural language processing model to obtain a similarity matrix adapted to a word frequency-inverse document matrix, wherein the similarity matrix includes similarity values between the words comprises:

the natural language processing model calculates word frequency vectors corresponding to different words through a preset similarity function, and similarity values among the words are calculated;

the similarity function calculation formula is as follows:

wherein the word A vectorIs { x₁，y₁The word B vector is { x }₂，y₂And cos theta is a similarity value.

7. The semantic-based document clustering method according to claim 1, wherein the steps of tagging the document clusters according to the feature words contained in the word clusters and storing the documents and the corresponding tags in association comprise:

according to the feature words contained in the word cluster, the feature words are used as secondary labels in the document cluster for associated storage;

8. A semantic-based document clustering system, comprising:

the matrix module is used for carrying out word frequency statistics and inverse document frequency calculation on each word contained in the processed input document and constructing a word frequency-inverse document matrix according to the word frequency and the inverse document frequency obtained through calculation;

the similarity module is used for inputting the words adopted in the word frequency statistics into a pre-stored natural language processing model as objects to obtain a similarity matrix matched with the word frequency-inverse document matrix, and the similarity matrix comprises similarity values among the words;

and the clustering module is used for performing bidirectional clustering on the second word frequency-inverse document matrix to obtain at least one double cluster, wherein the double cluster comprises a document cluster and a word cluster, labels are given to all documents in the document cluster according to the feature words related in the word cluster, and the documents and the corresponding labels are stored in an associated manner.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the semantic-based document clustering method according to any one of claims 1 to 7.

10. A computer-readable storage medium, having stored therein a computer program executable by at least one processor to cause the at least one processor to perform the semantic-based document clustering method according to any one of claims 1 to 7.