CN110941704A

CN110941704A - Text content similarity analysis method

Info

Publication number: CN110941704A
Application number: CN201911282234.9A
Authority: CN
Inventors: 朱玉怀; 谢赟; 韩欣; 黄海清; 吴新野
Original assignee: Shanghai Tak Billiton Information Technology Ltd By Share Ltd
Current assignee: Shanghai Tak Billiton Information Technology Ltd By Share Ltd
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2020-03-31
Anticipated expiration: 2039-12-13
Also published as: CN110941704B

Abstract

The invention discloses a method for analyzing text content similarity, which comprises the following steps: acquiring each text file; processing the text files, and dividing each text file into a target data set and a basic data set; performing Chinese character and word segmentation on the text file in the basic data set to obtain a word segmentation list, generating word bags according to the word segmentation list, and storing the word bags into a first model file; generating a corpus according to the word segmentation list, obtaining a comment set by using a tf-idf algorithm for the corpus, and storing the comment set into a second model file; calculating the similarity of the sparse matrix of the comment set, and storing the calculation result into a third model file; and carrying out Chinese character segmentation on the text file in the target data set, calling the first model file, the second model file and the third model file, and calculating the similarity between the text file in the target data set and the text file in the basic data set. The method can quickly acquire the similar situation of one text and other texts and explore the value of the text file.

Description

Text content similarity analysis method

Technical Field

The invention relates to the technical field of data analysis, in particular to a text content similarity analysis method.

Background

In the current internet period, there are a large number of various text files (text data), and it is difficult to handle various text data: 1. the space occupied by the text data is large, and the memory occupied by the text data during calculation is large. 2. The content of the text data is relatively disordered and is not easy to process. 3. Valuable points are not easily discovered from the text data.

And through the analysis of the text, similar things, similar personnel behaviors, repeated events, associated personnel and the like can be found conveniently. The function of the text data can be continuously explored.

Disclosure of Invention

The invention aims to provide a text content similarity analysis method, which can quickly acquire the similarity between one text and other texts and explore the value of a text file.

The technical scheme for realizing the purpose is as follows:

a method for text content similarity analysis, comprising:

step S1, acquiring each text file;

step S2, processing each text file, removing webpage labels, special characters and stop words, and dividing each text file into a target data set and a basic data set;

step S3, carrying out Chinese character and word segmentation on the text file in the basic data set to obtain a word segmentation list, generating word bags according to the word segmentation list, and storing the word bags into the first model file; generating a corpus according to the word segmentation list, obtaining a comment set by using a tf-idf (tf is the frequency of a certain keyword appearing in a text and idf is the frequency of calculating a text inversion, reducing words which are common in the text but have little influence on a text file) algorithm for the corpus, and storing the comment set into a second model file; calculating the similarity of the sparse matrix of the comment set, and storing the calculation result into a third model file;

step S4, Chinese character segmentation is carried out on the text file in the target data set, the first model file, the second model file and the third model file are called, the similarity between the text file in the target data set and the text file in the basic data set is calculated, and the similar text file and the similarity value are returned.

Preferably, the step S1 includes: and automatically acquiring data of the accessed content from a website by using a crawler code written by Python (Python is a cross-platform computer programming language), and writing the data into a file, wherein the file name represents the unique identifier of the file.

Preferably, the step S2 includes: reading a text file, removing related webpage labels in a regular matching mode, removing special characters in a mode of replacing empty characters, removing related stop words, and obtaining a processed text file; and randomly taking a few text files as a target data set, and taking the rest basic data sets.

Preferably, the step S3 includes:

loading a directory of a basic data set, traversing file data in the directory, storing file names in specified files, performing Chinese and word segmentation on the contents of all specified texts, storing the contents in a specified word segmentation list, converting the word segmentation list into dictionary-format data serving as a word bag, and storing the word bag in a first model file;

traversing the word segmentation list, encoding the words, and storing the words in a designated dictionary as a corpus; obtaining an evaluation set by using a tf-idf algorithm for the material library, and storing the evaluation set into a second model file;

and calculating the similarity of the sparse matrix of the comment set, and storing the calculation result into a third model file.

Preferably, the similarity of the computed sparse matrix refers to: and compressing the sparse matrix into a dense matrix, and performing matrix point multiplication to obtain cosine similarity.

Preferably, the step S4 includes:

loading a target data set directory, traversing file data in the directory, storing file names into specified files, and performing Chinese character segmentation on the contents of all specified texts;

and calling the first model file to encode the participles, calling the second model file to process the encoded participles, calling the third model file to calculate the similarity value between the text file in the target data set and the text file in the basic data set, sequencing the similarity values from large to small, and returning the similar file data and the similarity value.

Preferably, after the step S4, it is verified manually whether the result is accurate.

Preferably, the method further comprises the following steps:

and step S5, continuously acquiring the text files from the network, storing the text files into the directory of the target data set after processing, judging whether the text files exist in the directory of the basic data set or not through the text files in the target data set, if not, moving the text files into the directory of the basic data set, and periodically and repeatedly generating the first model file, the second model file and the third model file.

The invention has the beneficial effects that: according to the method, the text file is divided into the target data set and the basic data set, the model file is generated, the similarity between the target text and the basic text is calculated by utilizing the model file, the similarity between one text and other texts is rapidly acquired, and the value of the text file is explored. Meanwhile, the model file is continuously expanded, so that the analysis result is more comprehensive and accurate.

Drawings

FIG. 1 is a flow chart of a method of text content similarity analysis of the present invention;

FIG. 2 is a flow chart illustrating text file processing in the present invention;

FIG. 3 is a flow chart of generating a model file in the present invention;

FIG. 4 is a flow chart of text similarity calculation in the present invention;

FIG. 5 is a flow chart illustrating the process of expanding a model file according to the present invention;

fig. 6 is a diagram illustrating a calculation result of text similarity in the present invention.

Detailed Description

The invention will be further explained with reference to the drawings.

Referring to fig. 1, the method for analyzing similarity of text contents of the present invention includes the following steps:

step S1, acquiring each text file: and automatically acquiring data of the visiting content from the website by using the written python crawler code, and writing the data into a file, wherein the file name represents the unique identifier of the file.

And step S2, processing each text file, removing webpage labels, special characters and stop words, and dividing each text file into a target data set and a basic data set. As shown in fig. 2, the method includes: reading a text file, removing related webpage labels in a regular matching mode, removing special characters (such as & nbsp, line feed character, tab character and the like) in a mode of replacing with empty characters, removing related stop words (basically, no meaningful words such as & ltu & gt, & ltu & gt and the like), and processing the text file into clean and effective text data (one text file is the text data); and randomly taking a few text files as a target data set (verification data) and storing the target data set into a specified directory, and storing the rest basic data sets (comparison data) into the specified directory.

In step S3, a model file is generated. As shown in fig. 3, the method includes:

loading a directory of a basic data set, traversing file data in the directory, storing file names in a specified file (txt.txt), performing Chinese word combination segmentation (jieba segmentation) on the contents of all specified texts, storing the results in a specified segmentation list, converting the segmentation list into dictionary-format data serving as a word bag, and storing the word bag in a first model file (ditt.dit).

Traversing the word segmentation list, encoding words (for example, the first word is 0, and the next word is 1.. the words are encoded in sequence), and storing the words into a specified dictionary as a corpus; and obtaining a comment set by using a tf-idf algorithm for the material base, and saving the comment set into a second model file (tfidf.

And calculating the similarity of the sparse matrix (compressing the sparse matrix into a dense matrix, performing matrix dot multiplication, wherein the matrix dot multiplication is just cosine similarity) for the comment set, and storing the calculation result into a third model file (index.

Step S4, Chinese character segmentation is carried out on the text file in the target data set, the first model file, the second model file and the third model file are called, the similarity between the text file in the target data set and the text file in the basic data set is calculated, and the similar text file and the similarity value are returned. As shown in fig. 4, the method includes:

and calling the first model file to encode the participles, calling the second model file to process the encoded participles, calling the third model file to calculate the similarity value (0-1) between the text file in the target data set and the text file in the basic data set, sequencing the similarity values from large to small, and returning the data and the similarity value of the similar files. Such as shown in fig. 6.

And (3) verification: and mainly carrying out manual verification, namely opening a corresponding file, comparing the content of the file, and judging whether the comparison result is basically consistent with the actual result.

And step S5, expanding the model file. As shown in fig. 5, the text files are continuously obtained from the network, and after the processing, the text files are stored in the directory of the target data set, and whether the text files exist in the directory of the basic data set is determined by the text files in the target data set, and if the text files do not exist, the text files are moved to the directory of the basic data set, and the first model file, the second model file and the third model file are periodically and repeatedly generated. The content of the model file is continuously increased, so that the analysis result is more complete (the later text file is continuously increased, so that the analysis result is more comprehensive, and certain possibly similar files are not omitted).

The above embodiments are provided only for illustrating the present invention and not for limiting the present invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention, and therefore all equivalent technical solutions should also fall within the scope of the present invention, and should be defined by the claims.

Claims

1. A method for analyzing text content similarity is characterized by comprising the following steps:

step S1, acquiring each text file;

step S3, carrying out Chinese character and word segmentation on the text file in the basic data set to obtain a word segmentation list, generating word bags according to the word segmentation list, and storing the word bags into the first model file; generating a corpus according to the word segmentation list, obtaining a comment set by using a tf-idf algorithm for the corpus, and storing the comment set into a second model file; calculating the similarity of the sparse matrix of the comment set, and storing the calculation result into a third model file;

2. The method for analyzing similarity of text contents according to claim 1, wherein said step S1 includes: and automatically acquiring data of the visiting content from the website by using the written python crawler code, and writing the data into a file, wherein the file name represents the unique identifier of the file.

3. The method for analyzing similarity of text contents according to claim 1, wherein said step S2 includes: reading a text file, removing related webpage labels in a regular matching mode, removing special characters in a mode of replacing empty characters, removing related stop words, and obtaining a processed text file; and randomly taking a few text files as a target data set, and taking the rest basic data sets.

4. The method for analyzing similarity of text contents according to claim 1, wherein said step S3 includes:

5. The method for analyzing similarity of text contents according to claim 4, wherein the calculating the similarity of sparse matrix refers to: and compressing the sparse matrix into a dense matrix, and performing matrix point multiplication to obtain cosine similarity.

6. The method for analyzing similarity of text contents according to claim 1, wherein said step S4 includes:

7. The method for analyzing similarity of text contents according to claim 1, wherein after the step S4, the result is verified manually whether it is correct.

8. The method for text content similarity analysis according to claim 1, further comprising: