CN110941704A - Text content similarity analysis method - Google Patents
Text content similarity analysis method Download PDFInfo
- Publication number
- CN110941704A CN110941704A CN201911282234.9A CN201911282234A CN110941704A CN 110941704 A CN110941704 A CN 110941704A CN 201911282234 A CN201911282234 A CN 201911282234A CN 110941704 A CN110941704 A CN 110941704A
- Authority
- CN
- China
- Prior art keywords
- file
- text
- data set
- similarity
- storing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004458 analytical method Methods 0.000 title claims description 11
- 230000011218 segmentation Effects 0.000 claims abstract description 32
- 238000000034 method Methods 0.000 claims abstract description 23
- 239000011159 matrix material Substances 0.000 claims abstract description 18
- 238000004364 calculation method Methods 0.000 claims abstract description 9
- 238000012545 processing Methods 0.000 claims abstract description 9
- 238000011156 evaluation Methods 0.000 claims description 4
- 239000000463 material Substances 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 238000012795 verification Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for analyzing text content similarity, which comprises the following steps: acquiring each text file; processing the text files, and dividing each text file into a target data set and a basic data set; performing Chinese character and word segmentation on the text file in the basic data set to obtain a word segmentation list, generating word bags according to the word segmentation list, and storing the word bags into a first model file; generating a corpus according to the word segmentation list, obtaining a comment set by using a tf-idf algorithm for the corpus, and storing the comment set into a second model file; calculating the similarity of the sparse matrix of the comment set, and storing the calculation result into a third model file; and carrying out Chinese character segmentation on the text file in the target data set, calling the first model file, the second model file and the third model file, and calculating the similarity between the text file in the target data set and the text file in the basic data set. The method can quickly acquire the similar situation of one text and other texts and explore the value of the text file.
Description
Technical Field
The invention relates to the technical field of data analysis, in particular to a text content similarity analysis method.
Background
In the current internet period, there are a large number of various text files (text data), and it is difficult to handle various text data: 1. the space occupied by the text data is large, and the memory occupied by the text data during calculation is large. 2. The content of the text data is relatively disordered and is not easy to process. 3. Valuable points are not easily discovered from the text data.
And through the analysis of the text, similar things, similar personnel behaviors, repeated events, associated personnel and the like can be found conveniently. The function of the text data can be continuously explored.
Disclosure of Invention
The invention aims to provide a text content similarity analysis method, which can quickly acquire the similarity between one text and other texts and explore the value of a text file.
The technical scheme for realizing the purpose is as follows:
a method for text content similarity analysis, comprising:
step S1, acquiring each text file;
step S2, processing each text file, removing webpage labels, special characters and stop words, and dividing each text file into a target data set and a basic data set;
step S3, carrying out Chinese character and word segmentation on the text file in the basic data set to obtain a word segmentation list, generating word bags according to the word segmentation list, and storing the word bags into the first model file; generating a corpus according to the word segmentation list, obtaining a comment set by using a tf-idf (tf is the frequency of a certain keyword appearing in a text and idf is the frequency of calculating a text inversion, reducing words which are common in the text but have little influence on a text file) algorithm for the corpus, and storing the comment set into a second model file; calculating the similarity of the sparse matrix of the comment set, and storing the calculation result into a third model file;
step S4, Chinese character segmentation is carried out on the text file in the target data set, the first model file, the second model file and the third model file are called, the similarity between the text file in the target data set and the text file in the basic data set is calculated, and the similar text file and the similarity value are returned.
Preferably, the step S1 includes: and automatically acquiring data of the accessed content from a website by using a crawler code written by Python (Python is a cross-platform computer programming language), and writing the data into a file, wherein the file name represents the unique identifier of the file.
Preferably, the step S2 includes: reading a text file, removing related webpage labels in a regular matching mode, removing special characters in a mode of replacing empty characters, removing related stop words, and obtaining a processed text file; and randomly taking a few text files as a target data set, and taking the rest basic data sets.
Preferably, the step S3 includes:
loading a directory of a basic data set, traversing file data in the directory, storing file names in specified files, performing Chinese and word segmentation on the contents of all specified texts, storing the contents in a specified word segmentation list, converting the word segmentation list into dictionary-format data serving as a word bag, and storing the word bag in a first model file;
traversing the word segmentation list, encoding the words, and storing the words in a designated dictionary as a corpus; obtaining an evaluation set by using a tf-idf algorithm for the material library, and storing the evaluation set into a second model file;
and calculating the similarity of the sparse matrix of the comment set, and storing the calculation result into a third model file.
Preferably, the similarity of the computed sparse matrix refers to: and compressing the sparse matrix into a dense matrix, and performing matrix point multiplication to obtain cosine similarity.
Preferably, the step S4 includes:
loading a target data set directory, traversing file data in the directory, storing file names into specified files, and performing Chinese character segmentation on the contents of all specified texts;
and calling the first model file to encode the participles, calling the second model file to process the encoded participles, calling the third model file to calculate the similarity value between the text file in the target data set and the text file in the basic data set, sequencing the similarity values from large to small, and returning the similar file data and the similarity value.
Preferably, after the step S4, it is verified manually whether the result is accurate.
Preferably, the method further comprises the following steps:
and step S5, continuously acquiring the text files from the network, storing the text files into the directory of the target data set after processing, judging whether the text files exist in the directory of the basic data set or not through the text files in the target data set, if not, moving the text files into the directory of the basic data set, and periodically and repeatedly generating the first model file, the second model file and the third model file.
The invention has the beneficial effects that: according to the method, the text file is divided into the target data set and the basic data set, the model file is generated, the similarity between the target text and the basic text is calculated by utilizing the model file, the similarity between one text and other texts is rapidly acquired, and the value of the text file is explored. Meanwhile, the model file is continuously expanded, so that the analysis result is more comprehensive and accurate.
Drawings
FIG. 1 is a flow chart of a method of text content similarity analysis of the present invention;
FIG. 2 is a flow chart illustrating text file processing in the present invention;
FIG. 3 is a flow chart of generating a model file in the present invention;
FIG. 4 is a flow chart of text similarity calculation in the present invention;
FIG. 5 is a flow chart illustrating the process of expanding a model file according to the present invention;
fig. 6 is a diagram illustrating a calculation result of text similarity in the present invention.
Detailed Description
The invention will be further explained with reference to the drawings.
Referring to fig. 1, the method for analyzing similarity of text contents of the present invention includes the following steps:
step S1, acquiring each text file: and automatically acquiring data of the visiting content from the website by using the written python crawler code, and writing the data into a file, wherein the file name represents the unique identifier of the file.
And step S2, processing each text file, removing webpage labels, special characters and stop words, and dividing each text file into a target data set and a basic data set. As shown in fig. 2, the method includes: reading a text file, removing related webpage labels in a regular matching mode, removing special characters (such as & nbsp, line feed character, tab character and the like) in a mode of replacing with empty characters, removing related stop words (basically, no meaningful words such as & ltu & gt, & ltu & gt and the like), and processing the text file into clean and effective text data (one text file is the text data); and randomly taking a few text files as a target data set (verification data) and storing the target data set into a specified directory, and storing the rest basic data sets (comparison data) into the specified directory.
In step S3, a model file is generated. As shown in fig. 3, the method includes:
loading a directory of a basic data set, traversing file data in the directory, storing file names in a specified file (txt.txt), performing Chinese word combination segmentation (jieba segmentation) on the contents of all specified texts, storing the results in a specified segmentation list, converting the segmentation list into dictionary-format data serving as a word bag, and storing the word bag in a first model file (ditt.dit).
Traversing the word segmentation list, encoding words (for example, the first word is 0, and the next word is 1.. the words are encoded in sequence), and storing the words into a specified dictionary as a corpus; and obtaining a comment set by using a tf-idf algorithm for the material base, and saving the comment set into a second model file (tfidf.
And calculating the similarity of the sparse matrix (compressing the sparse matrix into a dense matrix, performing matrix dot multiplication, wherein the matrix dot multiplication is just cosine similarity) for the comment set, and storing the calculation result into a third model file (index.
Step S4, Chinese character segmentation is carried out on the text file in the target data set, the first model file, the second model file and the third model file are called, the similarity between the text file in the target data set and the text file in the basic data set is calculated, and the similar text file and the similarity value are returned. As shown in fig. 4, the method includes:
loading a target data set directory, traversing file data in the directory, storing file names into specified files, and performing Chinese character segmentation on the contents of all specified texts;
and calling the first model file to encode the participles, calling the second model file to process the encoded participles, calling the third model file to calculate the similarity value (0-1) between the text file in the target data set and the text file in the basic data set, sequencing the similarity values from large to small, and returning the data and the similarity value of the similar files. Such as shown in fig. 6.
And (3) verification: and mainly carrying out manual verification, namely opening a corresponding file, comparing the content of the file, and judging whether the comparison result is basically consistent with the actual result.
And step S5, expanding the model file. As shown in fig. 5, the text files are continuously obtained from the network, and after the processing, the text files are stored in the directory of the target data set, and whether the text files exist in the directory of the basic data set is determined by the text files in the target data set, and if the text files do not exist, the text files are moved to the directory of the basic data set, and the first model file, the second model file and the third model file are periodically and repeatedly generated. The content of the model file is continuously increased, so that the analysis result is more complete (the later text file is continuously increased, so that the analysis result is more comprehensive, and certain possibly similar files are not omitted).
The above embodiments are provided only for illustrating the present invention and not for limiting the present invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention, and therefore all equivalent technical solutions should also fall within the scope of the present invention, and should be defined by the claims.
Claims (8)
1. A method for analyzing text content similarity is characterized by comprising the following steps:
step S1, acquiring each text file;
step S2, processing each text file, removing webpage labels, special characters and stop words, and dividing each text file into a target data set and a basic data set;
step S3, carrying out Chinese character and word segmentation on the text file in the basic data set to obtain a word segmentation list, generating word bags according to the word segmentation list, and storing the word bags into the first model file; generating a corpus according to the word segmentation list, obtaining a comment set by using a tf-idf algorithm for the corpus, and storing the comment set into a second model file; calculating the similarity of the sparse matrix of the comment set, and storing the calculation result into a third model file;
step S4, Chinese character segmentation is carried out on the text file in the target data set, the first model file, the second model file and the third model file are called, the similarity between the text file in the target data set and the text file in the basic data set is calculated, and the similar text file and the similarity value are returned.
2. The method for analyzing similarity of text contents according to claim 1, wherein said step S1 includes: and automatically acquiring data of the visiting content from the website by using the written python crawler code, and writing the data into a file, wherein the file name represents the unique identifier of the file.
3. The method for analyzing similarity of text contents according to claim 1, wherein said step S2 includes: reading a text file, removing related webpage labels in a regular matching mode, removing special characters in a mode of replacing empty characters, removing related stop words, and obtaining a processed text file; and randomly taking a few text files as a target data set, and taking the rest basic data sets.
4. The method for analyzing similarity of text contents according to claim 1, wherein said step S3 includes:
loading a directory of a basic data set, traversing file data in the directory, storing file names in specified files, performing Chinese and word segmentation on the contents of all specified texts, storing the contents in a specified word segmentation list, converting the word segmentation list into dictionary-format data serving as a word bag, and storing the word bag in a first model file;
traversing the word segmentation list, encoding the words, and storing the words in a designated dictionary as a corpus; obtaining an evaluation set by using a tf-idf algorithm for the material library, and storing the evaluation set into a second model file;
and calculating the similarity of the sparse matrix of the comment set, and storing the calculation result into a third model file.
5. The method for analyzing similarity of text contents according to claim 4, wherein the calculating the similarity of sparse matrix refers to: and compressing the sparse matrix into a dense matrix, and performing matrix point multiplication to obtain cosine similarity.
6. The method for analyzing similarity of text contents according to claim 1, wherein said step S4 includes:
loading a target data set directory, traversing file data in the directory, storing file names into specified files, and performing Chinese character segmentation on the contents of all specified texts;
and calling the first model file to encode the participles, calling the second model file to process the encoded participles, calling the third model file to calculate the similarity value between the text file in the target data set and the text file in the basic data set, sequencing the similarity values from large to small, and returning the similar file data and the similarity value.
7. The method for analyzing similarity of text contents according to claim 1, wherein after the step S4, the result is verified manually whether it is correct.
8. The method for text content similarity analysis according to claim 1, further comprising:
and step S5, continuously acquiring the text files from the network, storing the text files into the directory of the target data set after processing, judging whether the text files exist in the directory of the basic data set or not through the text files in the target data set, if not, moving the text files into the directory of the basic data set, and periodically and repeatedly generating the first model file, the second model file and the third model file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911282234.9A CN110941704B (en) | 2019-12-13 | 2019-12-13 | Text content similarity analysis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911282234.9A CN110941704B (en) | 2019-12-13 | 2019-12-13 | Text content similarity analysis method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110941704A true CN110941704A (en) | 2020-03-31 |
CN110941704B CN110941704B (en) | 2023-11-03 |
Family
ID=69910777
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911282234.9A Active CN110941704B (en) | 2019-12-13 | 2019-12-13 | Text content similarity analysis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110941704B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104408033A (en) * | 2014-11-25 | 2015-03-11 | 中国人民解放军国防科学技术大学 | Text message extracting method and system |
US20160125872A1 (en) * | 2014-11-05 | 2016-05-05 | At&T Intellectual Property I, L.P. | System and method for text normalization using atomic tokens |
CN108573411A (en) * | 2018-04-17 | 2018-09-25 | 重庆理工大学 | Depth sentiment analysis and multi-source based on user comment recommend the mixing of view fusion to recommend method |
CN109858028A (en) * | 2019-01-30 | 2019-06-07 | 神思电子技术股份有限公司 | A kind of short text similarity calculating method based on probabilistic model |
-
2019
- 2019-12-13 CN CN201911282234.9A patent/CN110941704B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160125872A1 (en) * | 2014-11-05 | 2016-05-05 | At&T Intellectual Property I, L.P. | System and method for text normalization using atomic tokens |
CN104408033A (en) * | 2014-11-25 | 2015-03-11 | 中国人民解放军国防科学技术大学 | Text message extracting method and system |
CN108573411A (en) * | 2018-04-17 | 2018-09-25 | 重庆理工大学 | Depth sentiment analysis and multi-source based on user comment recommend the mixing of view fusion to recommend method |
CN109858028A (en) * | 2019-01-30 | 2019-06-07 | 神思电子技术股份有限公司 | A kind of short text similarity calculating method based on probabilistic model |
Non-Patent Citations (3)
Title |
---|
徐敏;李广建;: "基于词频均值波动和概率语言模型的短文本热点主题探测研究", 情报杂志 * |
李心蕾;王昊;刘小敏;邓三鸿;: "面向微博短文本分类的文本向量化方法比较研究", 数据分析与知识发现 * |
王义真;郑啸;后盾;胡昊;: "基于SVM的高维混合特征短文本情感分类", 计算机技术与发展 * |
Also Published As
Publication number | Publication date |
---|---|
CN110941704B (en) | 2023-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
CN109726274B (en) | Question generation method, device and storage medium | |
CN108108426B (en) | Understanding method and device for natural language question and electronic equipment | |
CN110134777B (en) | Question duplication eliminating method and device, electronic equipment and computer readable storage medium | |
CN108875065B (en) | Indonesia news webpage recommendation method based on content | |
CN111291177A (en) | Information processing method and device and computer storage medium | |
CN115563287A (en) | Data processing system for obtaining associated object | |
CN110008473B (en) | Medical text named entity identification and labeling method based on iteration method | |
CN111651986A (en) | Event keyword extraction method, device, equipment and medium | |
CN110866102A (en) | Search processing method | |
CN111125295A (en) | Method and system for obtaining food safety question answers based on LSTM | |
CN111241310A (en) | Deep cross-modal Hash retrieval method, equipment and medium | |
CN111078839A (en) | Structured processing method and processing device for referee document | |
CN112667775A (en) | Keyword prompt-based retrieval method and device, electronic equipment and storage medium | |
CN112667780A (en) | Comment information generation method and device, electronic equipment and storage medium | |
CN110019820B (en) | Method for detecting time consistency of complaints and symptoms of current medical history in medical records | |
CN114510923B (en) | Text theme generation method, device, equipment and medium based on artificial intelligence | |
CN111984845A (en) | Website wrongly-written character recognition method and system | |
CN113420542B (en) | Dialogue generation method, device, electronic equipment and storage medium | |
CN113934834A (en) | Question matching method, device, equipment and storage medium | |
CN114528413A (en) | Knowledge graph updating method, system and readable storage medium supported by crowdsourced marking | |
CN112632395A (en) | Search recommendation method and device, server and computer-readable storage medium | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
CN110413996B (en) | Method and device for constructing zero-index digestion corpus | |
CN111104422A (en) | Training method, device, equipment and storage medium of data recommendation model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |