CN110941704A - Text content similarity analysis method - Google Patents

Text content similarity analysis method Download PDF

Info

Publication number
CN110941704A
CN110941704A CN201911282234.9A CN201911282234A CN110941704A CN 110941704 A CN110941704 A CN 110941704A CN 201911282234 A CN201911282234 A CN 201911282234A CN 110941704 A CN110941704 A CN 110941704A
Authority
CN
China
Prior art keywords
file
text
data set
similarity
storing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911282234.9A
Other languages
Chinese (zh)
Other versions
CN110941704B (en
Inventor
朱玉怀
谢赟
韩欣
黄海清
吴新野
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Tak Billiton Information Technology Ltd By Share Ltd
Original Assignee
Shanghai Tak Billiton Information Technology Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Tak Billiton Information Technology Ltd By Share Ltd filed Critical Shanghai Tak Billiton Information Technology Ltd By Share Ltd
Priority to CN201911282234.9A priority Critical patent/CN110941704B/en
Publication of CN110941704A publication Critical patent/CN110941704A/en
Application granted granted Critical
Publication of CN110941704B publication Critical patent/CN110941704B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for analyzing text content similarity, which comprises the following steps: acquiring each text file; processing the text files, and dividing each text file into a target data set and a basic data set; performing Chinese character and word segmentation on the text file in the basic data set to obtain a word segmentation list, generating word bags according to the word segmentation list, and storing the word bags into a first model file; generating a corpus according to the word segmentation list, obtaining a comment set by using a tf-idf algorithm for the corpus, and storing the comment set into a second model file; calculating the similarity of the sparse matrix of the comment set, and storing the calculation result into a third model file; and carrying out Chinese character segmentation on the text file in the target data set, calling the first model file, the second model file and the third model file, and calculating the similarity between the text file in the target data set and the text file in the basic data set. The method can quickly acquire the similar situation of one text and other texts and explore the value of the text file.

Description

Text content similarity analysis method
Technical Field
The invention relates to the technical field of data analysis, in particular to a text content similarity analysis method.
Background
In the current internet period, there are a large number of various text files (text data), and it is difficult to handle various text data: 1. the space occupied by the text data is large, and the memory occupied by the text data during calculation is large. 2. The content of the text data is relatively disordered and is not easy to process. 3. Valuable points are not easily discovered from the text data.
And through the analysis of the text, similar things, similar personnel behaviors, repeated events, associated personnel and the like can be found conveniently. The function of the text data can be continuously explored.
Disclosure of Invention
The invention aims to provide a text content similarity analysis method, which can quickly acquire the similarity between one text and other texts and explore the value of a text file.
The technical scheme for realizing the purpose is as follows:
a method for text content similarity analysis, comprising:
step S1, acquiring each text file;
step S2, processing each text file, removing webpage labels, special characters and stop words, and dividing each text file into a target data set and a basic data set;
step S3, carrying out Chinese character and word segmentation on the text file in the basic data set to obtain a word segmentation list, generating word bags according to the word segmentation list, and storing the word bags into the first model file; generating a corpus according to the word segmentation list, obtaining a comment set by using a tf-idf (tf is the frequency of a certain keyword appearing in a text and idf is the frequency of calculating a text inversion, reducing words which are common in the text but have little influence on a text file) algorithm for the corpus, and storing the comment set into a second model file; calculating the similarity of the sparse matrix of the comment set, and storing the calculation result into a third model file;
step S4, Chinese character segmentation is carried out on the text file in the target data set, the first model file, the second model file and the third model file are called, the similarity between the text file in the target data set and the text file in the basic data set is calculated, and the similar text file and the similarity value are returned.
Preferably, the step S1 includes: and automatically acquiring data of the accessed content from a website by using a crawler code written by Python (Python is a cross-platform computer programming language), and writing the data into a file, wherein the file name represents the unique identifier of the file.
Preferably, the step S2 includes: reading a text file, removing related webpage labels in a regular matching mode, removing special characters in a mode of replacing empty characters, removing related stop words, and obtaining a processed text file; and randomly taking a few text files as a target data set, and taking the rest basic data sets.
Preferably, the step S3 includes:
loading a directory of a basic data set, traversing file data in the directory, storing file names in specified files, performing Chinese and word segmentation on the contents of all specified texts, storing the contents in a specified word segmentation list, converting the word segmentation list into dictionary-format data serving as a word bag, and storing the word bag in a first model file;
traversing the word segmentation list, encoding the words, and storing the words in a designated dictionary as a corpus; obtaining an evaluation set by using a tf-idf algorithm for the material library, and storing the evaluation set into a second model file;
and calculating the similarity of the sparse matrix of the comment set, and storing the calculation result into a third model file.
Preferably, the similarity of the computed sparse matrix refers to: and compressing the sparse matrix into a dense matrix, and performing matrix point multiplication to obtain cosine similarity.
Preferably, the step S4 includes:
loading a target data set directory, traversing file data in the directory, storing file names into specified files, and performing Chinese character segmentation on the contents of all specified texts;
and calling the first model file to encode the participles, calling the second model file to process the encoded participles, calling the third model file to calculate the similarity value between the text file in the target data set and the text file in the basic data set, sequencing the similarity values from large to small, and returning the similar file data and the similarity value.
Preferably, after the step S4, it is verified manually whether the result is accurate.
Preferably, the method further comprises the following steps:
and step S5, continuously acquiring the text files from the network, storing the text files into the directory of the target data set after processing, judging whether the text files exist in the directory of the basic data set or not through the text files in the target data set, if not, moving the text files into the directory of the basic data set, and periodically and repeatedly generating the first model file, the second model file and the third model file.
The invention has the beneficial effects that: according to the method, the text file is divided into the target data set and the basic data set, the model file is generated, the similarity between the target text and the basic text is calculated by utilizing the model file, the similarity between one text and other texts is rapidly acquired, and the value of the text file is explored. Meanwhile, the model file is continuously expanded, so that the analysis result is more comprehensive and accurate.
Drawings
FIG. 1 is a flow chart of a method of text content similarity analysis of the present invention;
FIG. 2 is a flow chart illustrating text file processing in the present invention;
FIG. 3 is a flow chart of generating a model file in the present invention;
FIG. 4 is a flow chart of text similarity calculation in the present invention;
FIG. 5 is a flow chart illustrating the process of expanding a model file according to the present invention;
fig. 6 is a diagram illustrating a calculation result of text similarity in the present invention.
Detailed Description
The invention will be further explained with reference to the drawings.
Referring to fig. 1, the method for analyzing similarity of text contents of the present invention includes the following steps:
step S1, acquiring each text file: and automatically acquiring data of the visiting content from the website by using the written python crawler code, and writing the data into a file, wherein the file name represents the unique identifier of the file.
And step S2, processing each text file, removing webpage labels, special characters and stop words, and dividing each text file into a target data set and a basic data set. As shown in fig. 2, the method includes: reading a text file, removing related webpage labels in a regular matching mode, removing special characters (such as & nbsp, line feed character, tab character and the like) in a mode of replacing with empty characters, removing related stop words (basically, no meaningful words such as & ltu & gt, & ltu & gt and the like), and processing the text file into clean and effective text data (one text file is the text data); and randomly taking a few text files as a target data set (verification data) and storing the target data set into a specified directory, and storing the rest basic data sets (comparison data) into the specified directory.
In step S3, a model file is generated. As shown in fig. 3, the method includes:
loading a directory of a basic data set, traversing file data in the directory, storing file names in a specified file (txt.txt), performing Chinese word combination segmentation (jieba segmentation) on the contents of all specified texts, storing the results in a specified segmentation list, converting the segmentation list into dictionary-format data serving as a word bag, and storing the word bag in a first model file (ditt.dit).
Traversing the word segmentation list, encoding words (for example, the first word is 0, and the next word is 1.. the words are encoded in sequence), and storing the words into a specified dictionary as a corpus; and obtaining a comment set by using a tf-idf algorithm for the material base, and saving the comment set into a second model file (tfidf.
And calculating the similarity of the sparse matrix (compressing the sparse matrix into a dense matrix, performing matrix dot multiplication, wherein the matrix dot multiplication is just cosine similarity) for the comment set, and storing the calculation result into a third model file (index.
Step S4, Chinese character segmentation is carried out on the text file in the target data set, the first model file, the second model file and the third model file are called, the similarity between the text file in the target data set and the text file in the basic data set is calculated, and the similar text file and the similarity value are returned. As shown in fig. 4, the method includes:
loading a target data set directory, traversing file data in the directory, storing file names into specified files, and performing Chinese character segmentation on the contents of all specified texts;
and calling the first model file to encode the participles, calling the second model file to process the encoded participles, calling the third model file to calculate the similarity value (0-1) between the text file in the target data set and the text file in the basic data set, sequencing the similarity values from large to small, and returning the data and the similarity value of the similar files. Such as shown in fig. 6.
And (3) verification: and mainly carrying out manual verification, namely opening a corresponding file, comparing the content of the file, and judging whether the comparison result is basically consistent with the actual result.
And step S5, expanding the model file. As shown in fig. 5, the text files are continuously obtained from the network, and after the processing, the text files are stored in the directory of the target data set, and whether the text files exist in the directory of the basic data set is determined by the text files in the target data set, and if the text files do not exist, the text files are moved to the directory of the basic data set, and the first model file, the second model file and the third model file are periodically and repeatedly generated. The content of the model file is continuously increased, so that the analysis result is more complete (the later text file is continuously increased, so that the analysis result is more comprehensive, and certain possibly similar files are not omitted).
The above embodiments are provided only for illustrating the present invention and not for limiting the present invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention, and therefore all equivalent technical solutions should also fall within the scope of the present invention, and should be defined by the claims.

Claims (8)

1. A method for analyzing text content similarity is characterized by comprising the following steps:
step S1, acquiring each text file;
step S2, processing each text file, removing webpage labels, special characters and stop words, and dividing each text file into a target data set and a basic data set;
step S3, carrying out Chinese character and word segmentation on the text file in the basic data set to obtain a word segmentation list, generating word bags according to the word segmentation list, and storing the word bags into the first model file; generating a corpus according to the word segmentation list, obtaining a comment set by using a tf-idf algorithm for the corpus, and storing the comment set into a second model file; calculating the similarity of the sparse matrix of the comment set, and storing the calculation result into a third model file;
step S4, Chinese character segmentation is carried out on the text file in the target data set, the first model file, the second model file and the third model file are called, the similarity between the text file in the target data set and the text file in the basic data set is calculated, and the similar text file and the similarity value are returned.
2. The method for analyzing similarity of text contents according to claim 1, wherein said step S1 includes: and automatically acquiring data of the visiting content from the website by using the written python crawler code, and writing the data into a file, wherein the file name represents the unique identifier of the file.
3. The method for analyzing similarity of text contents according to claim 1, wherein said step S2 includes: reading a text file, removing related webpage labels in a regular matching mode, removing special characters in a mode of replacing empty characters, removing related stop words, and obtaining a processed text file; and randomly taking a few text files as a target data set, and taking the rest basic data sets.
4. The method for analyzing similarity of text contents according to claim 1, wherein said step S3 includes:
loading a directory of a basic data set, traversing file data in the directory, storing file names in specified files, performing Chinese and word segmentation on the contents of all specified texts, storing the contents in a specified word segmentation list, converting the word segmentation list into dictionary-format data serving as a word bag, and storing the word bag in a first model file;
traversing the word segmentation list, encoding the words, and storing the words in a designated dictionary as a corpus; obtaining an evaluation set by using a tf-idf algorithm for the material library, and storing the evaluation set into a second model file;
and calculating the similarity of the sparse matrix of the comment set, and storing the calculation result into a third model file.
5. The method for analyzing similarity of text contents according to claim 4, wherein the calculating the similarity of sparse matrix refers to: and compressing the sparse matrix into a dense matrix, and performing matrix point multiplication to obtain cosine similarity.
6. The method for analyzing similarity of text contents according to claim 1, wherein said step S4 includes:
loading a target data set directory, traversing file data in the directory, storing file names into specified files, and performing Chinese character segmentation on the contents of all specified texts;
and calling the first model file to encode the participles, calling the second model file to process the encoded participles, calling the third model file to calculate the similarity value between the text file in the target data set and the text file in the basic data set, sequencing the similarity values from large to small, and returning the similar file data and the similarity value.
7. The method for analyzing similarity of text contents according to claim 1, wherein after the step S4, the result is verified manually whether it is correct.
8. The method for text content similarity analysis according to claim 1, further comprising:
and step S5, continuously acquiring the text files from the network, storing the text files into the directory of the target data set after processing, judging whether the text files exist in the directory of the basic data set or not through the text files in the target data set, if not, moving the text files into the directory of the basic data set, and periodically and repeatedly generating the first model file, the second model file and the third model file.
CN201911282234.9A 2019-12-13 2019-12-13 Text content similarity analysis method Active CN110941704B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911282234.9A CN110941704B (en) 2019-12-13 2019-12-13 Text content similarity analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911282234.9A CN110941704B (en) 2019-12-13 2019-12-13 Text content similarity analysis method

Publications (2)

Publication Number Publication Date
CN110941704A true CN110941704A (en) 2020-03-31
CN110941704B CN110941704B (en) 2023-11-03

Family

ID=69910777

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911282234.9A Active CN110941704B (en) 2019-12-13 2019-12-13 Text content similarity analysis method

Country Status (1)

Country Link
CN (1) CN110941704B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408033A (en) * 2014-11-25 2015-03-11 中国人民解放军国防科学技术大学 Text message extracting method and system
US20160125872A1 (en) * 2014-11-05 2016-05-05 At&T Intellectual Property I, L.P. System and method for text normalization using atomic tokens
CN108573411A (en) * 2018-04-17 2018-09-25 重庆理工大学 Depth sentiment analysis and multi-source based on user comment recommend the mixing of view fusion to recommend method
CN109858028A (en) * 2019-01-30 2019-06-07 神思电子技术股份有限公司 A kind of short text similarity calculating method based on probabilistic model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160125872A1 (en) * 2014-11-05 2016-05-05 At&T Intellectual Property I, L.P. System and method for text normalization using atomic tokens
CN104408033A (en) * 2014-11-25 2015-03-11 中国人民解放军国防科学技术大学 Text message extracting method and system
CN108573411A (en) * 2018-04-17 2018-09-25 重庆理工大学 Depth sentiment analysis and multi-source based on user comment recommend the mixing of view fusion to recommend method
CN109858028A (en) * 2019-01-30 2019-06-07 神思电子技术股份有限公司 A kind of short text similarity calculating method based on probabilistic model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
徐敏;李广建;: "基于词频均值波动和概率语言模型的短文本热点主题探测研究", 情报杂志 *
李心蕾;王昊;刘小敏;邓三鸿;: "面向微博短文本分类的文本向量化方法比较研究", 数据分析与知识发现 *
王义真;郑啸;后盾;胡昊;: "基于SVM的高维混合特征短文本情感分类", 计算机技术与发展 *

Also Published As

Publication number Publication date
CN110941704B (en) 2023-11-03

Similar Documents

Publication Publication Date Title
CN106649818B (en) Application search intention identification method and device, application search method and server
CN109726274B (en) Question generation method, device and storage medium
CN108108426B (en) Understanding method and device for natural language question and electronic equipment
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN111291177A (en) Information processing method and device and computer storage medium
CN115563287A (en) Data processing system for obtaining associated object
CN110008473B (en) Medical text named entity identification and labeling method based on iteration method
CN111651986A (en) Event keyword extraction method, device, equipment and medium
CN110866102A (en) Search processing method
CN111125295A (en) Method and system for obtaining food safety question answers based on LSTM
CN111241310A (en) Deep cross-modal Hash retrieval method, equipment and medium
CN111078839A (en) Structured processing method and processing device for referee document
CN112667775A (en) Keyword prompt-based retrieval method and device, electronic equipment and storage medium
CN112667780A (en) Comment information generation method and device, electronic equipment and storage medium
CN110019820B (en) Method for detecting time consistency of complaints and symptoms of current medical history in medical records
CN114510923B (en) Text theme generation method, device, equipment and medium based on artificial intelligence
CN111984845A (en) Website wrongly-written character recognition method and system
CN113420542B (en) Dialogue generation method, device, electronic equipment and storage medium
CN113934834A (en) Question matching method, device, equipment and storage medium
CN114528413A (en) Knowledge graph updating method, system and readable storage medium supported by crowdsourced marking
CN112632395A (en) Search recommendation method and device, server and computer-readable storage medium
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN110413996B (en) Method and device for constructing zero-index digestion corpus
CN111104422A (en) Training method, device, equipment and storage medium of data recommendation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant