CN116028592A - Text data screening association method and system - Google Patents

Text data screening association method and system Download PDF

Info

Publication number
CN116028592A
CN116028592A CN202111247550.XA CN202111247550A CN116028592A CN 116028592 A CN116028592 A CN 116028592A CN 202111247550 A CN202111247550 A CN 202111247550A CN 116028592 A CN116028592 A CN 116028592A
Authority
CN
China
Prior art keywords
comparison
screening
information
word
text data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111247550.XA
Other languages
Chinese (zh)
Inventor
邱方孝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Feizide Information Co ltd
Original Assignee
Feizide Information Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Feizide Information Co ltd filed Critical Feizide Information Co ltd
Priority to CN202111247550.XA priority Critical patent/CN116028592A/en
Publication of CN116028592A publication Critical patent/CN116028592A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

A method and system for screening and associating text data features that after the word breaking, screening, association and integration of several comparison text data, the association index file is formed based on the adjacent word breaking, so quickly sorting out the brief information of comparison text data, and further analyzing the originality of text data to be compared.

Description

Text data screening association method and system
Technical Field
The invention relates to a method and a system for screening and associating text data; particularly, the method and the system for screening and associating the original text data can be used for rapidly sorting and analyzing the text data based on the screening word which is adjacent to the text data before and after, and can analyze the text data to be compared against the text data.
Background
In recent years, the plagiarism incident layer of papers is endless, the public starts to worry about originality of papers, and although a plurality of detection systems for comparison of papers and articles are available on the market at present, the systems are used for comparison detection of plagiarism under the condition of taking suspicion attitudes to copyright persons who release the papers for research, and are unfair to copyright persons. In addition, some units even require copyrighters to submit plagiarism comparison results first and require similarity to be in a certain proportion to enable paper copyrighters to be published, so that the copyrighters need to prove that the copyrighters do not plagiarism other people by using the method, and the method is an untrustworthy attitude for the copyrighters and is quite inappropriate. The inventor considers that the paper publication aiming at the copyright should be reversely considered and forward provided with a tool for detecting originality, and refers to the paper publication of the inventor, the publication unit can formulate originality proportion as a reference basis for paper quality management.
Regarding plagiarism comparison systems, in recent years, the issue of paper plagiarism has become more serious in academic research, and as the issue continues to burn, plagiarism detection (plagiarism Detection) has become more and more important, and plagiarism (plagiarism) issues are mainly classified into the following categories: 1. copy/paste/clone plagiarism without modification. 2. Paragraph rewrites (Paraphrasing plagiarism): by plagiarism of paragraphs, switching vocabulary, or altering sentence structure or grammatical style. 3. Metaphor plagiarism (Metaphor plagiarism): through clarity, the idea mode of others is better expressed. 4. Idea plagiarism: ideas or solutions are borrowed from other sources as own research papers. 5. Self-plagiarism (Self/recycled plagiarism): the published seal is used as a new research result to be tabulated again. 6. Citation plagiarism refers to a reference of appropriate origin, but whose description is similar to the original content in terms of words and sentences, even in structural grammar.
Among these kinds of plagiarism, the most attention is paid to the copy-on or fragment plagiarism without modification and the fragment rewrite, and the plagiarism behavior can be clearly seen by comparing the paper with the plagiarism literature data, so that the two are most offensive.
In 1995, a learner studied the paper for copy detection on digital files, and with the development of natural language processing and hardware devices, many different methods have been proposed in recent years, and in the field of plagiarism detection, the paper is mainly divided into several methods: 1. the method is a method for detecting the biggest paper plagiarism, and the paper to be compared is compared with the existing paper database, and the proportion of paper plagiarism is judged by searching the conforming strings, so that the method can tell the system user about plagiarism paragraphs and sentences. Shretha and Solorio published in 2013, plagiarism was detected by n-grams of stop words, named entities and all words, by considering whether the detection papers and text database articles have articles with too high n-grams. Nguyen et al proposed in 2016 to detect whether Vietnam articles are plagiated by plagiarism detection by means of substring n-gram. Such methods have three drawbacks: 1. if the paper appears the text which is not available in the paper database, the similar sentences are not compared, so that the copied paper is not detected; 2. the user can avoid the detection mode of the method by changing the vocabulary or exchanging the vocabulary sequence, so that similar words and sentences cannot be detected; 3. because the method compares strings, if the length of the input string is too long, the input string is easy to dilute, so that the plagiarism similarity is reduced. 2. Vector-Based Methods (Vector-Based Methods) that extract lexical and grammatical functions and classify them as vectors rather than strings. The similarity is usually measured by the methods of the jacent's coefficient (Jaccard coefficient), the weight Dice coefficient (Dice coefficient), the overlapping coefficient (Overlap coefficient) or the cosine similarity (Cosine Similarity), and the like. Mahdavi et al published a vector space model to detect whether the Boss article was plagiated or not, and compared the similarity of the articles by converting the articles into TF-IDF. Jiffriya et al in 2013 proposed that articles were turned into vectors and then clustered by a K-means algorithm, and after clustering, the articles were plagiarism detected based on tri-grams. The disadvantage of this method is that the importance of a word in an article is measured by word frequency, sometimes the number of occurrences of the important word may be insufficient, resulting in poor comparison results, and such calculation cannot reflect the importance of the location information and the word context. 3. grammar-Based Methods (syncax-Based Methods) that detect plagiarism by using syntactic features like parts of speech, dependency trees of sentences, and word-to-word, use parts of speech to present word architecture and calculate similarity. The method can find paragraphs with similar sentence structures, but cannot find the paragraphs to rewrite, extract and replace words and convert the plagiarism of sentence structures. The grammar-based method has several defects, namely, the Chinese grammar is much more complex than the English grammar, and if the paper plagiarism is detected by a grammar mode, the comparison result is extremely bad; 2. the method detects the plagiarism content through the syntactic characteristics, which can lead to finding similar syntactic characteristics, but has no plagiarism characters, and only has the same syntax, so that the discrimination is wrong. 4. Semantic-Based Methods (Semantic-Based Methods) that can be used to detect the order of the transformations, the dominant and passive, by letting the system know the Semantic of the paragraphs, converting the articles into vectors, but the method cannot find the paragraphs and sentences that are plagiated. Torres in 2009 proposed to assist in detecting plagiarism by creating a dictionary, resnik in 1999 assisted in using semantic meaning through external resources. The problem that similar semantic papers can be found by detecting the plagiarism is solved in a semantic manner, but the paragraphs and the vocabulary of the plagiarism cannot be known, and the plagiarism cannot be verified.
The present inventors have studied for many years, have made experiments and improvements to develop the present invention.
Disclosure of Invention
The invention aims to provide a text data screening and associating method capable of rapidly sorting out brief information of text data.
The method for achieving the aim of the invention comprises the following steps: s11, performing word segmentation processing on a text data based on a word segmentation vocabulary library to generate word segmentation information; s12, screening the word breaking information to generate screened word breaking information; the screening word-breaking information comprises more than two screening word-breaking information; s13, carrying out relevance processing on the screening word breaking information to generate a plurality of relevance sequence information; the relevance sequence information is composed of more than two screening broken words which are adjacent to each other in front and back.
Preferably, a step S110 is performed before the step S11 is performed; the step S110 is: collecting author self-defined keywords in the text data to establish a professional keyword vocabulary library, and importing the professional keyword vocabulary library into the word breaking vocabulary library so as to obtain the association sequence information closer to the intention of the text data.
Preferably, in the step S12, after the screening process, the synonym process may be performed first, and then the subsequent steps may be performed; the synonym processing is as follows: and carrying out text synonym inspection on the screened broken words after the screening treatment, and converting the synonyms into standard characters.
Still another object of the present invention is to provide a text data screening and associating system capable of rapidly sorting out brief information of text data.
The structure of the invention for achieving the above purpose comprises: the storage module is used for storing a word breaking vocabulary library; the word breaking processing module is used for carrying out word breaking processing on a word data to generate word breaking information; the screening processing module is used for screening the word breaking information to generate screening word breaking information; and the relevance processing module is used for carrying out relevance processing on the screening word breaking information to generate a plurality of relevance sequence information.
Still another object of the present invention is to provide a screening and associating method for rapidly sorting out brief information of a plurality of comparison text data, and integrating brief information of each comparison text data together, so as to conveniently analyze originality of text data to be compared.
The method for achieving the aim of the invention comprises the following steps: s21, establishing a comparison set of information by using more than two parts of comparison text data; s22, based on a word breaking vocabulary library, carrying out word breaking processing on the plurality of comparison text data to respectively generate comparison word breaking information; s23, screening the plurality of comparison word breaking information to generate comparison screening word breaking information respectively; the screening word breaking information is provided with more than two screening word breaking information respectively; s24, carrying out relevance processing on the plurality of comparison screening word breaking information to respectively generate a plurality of comparison relevance sequence information; the plurality of comparison relevance sequence information is respectively composed of more than two screening broken words which are adjacent in front and back; s25, integrating all the contrast relevance sequence information to establish a relevance index file.
Preferably, a step S220 is performed before the step S22 is performed; the step S220 is: collecting the self-defined keywords of a part or all of the author in the plurality of comparison text data and the text data to be compared to establish a professional keyword vocabulary library, and importing the professional keyword vocabulary library into the word breaking vocabulary library so as to obtain an association index file closer to the meaning of the text data.
Preferably, after the step S25, steps S26 to S29 are performed; step S26 is: word breaking processing, screening processing and relevance processing are carried out on the text data to be compared so as to generate a plurality of relevance sequence information to be compared; step S27 is: the plurality of pieces of correlation sequence information to be compared are respectively compared with the correlation index file, and each piece of comparison text data with the same comparison correlation sequence information as the plurality of pieces of correlation sequence information to be compared is found out; step S28 is: establishing an intersection sequence, and arranging all the correlation sequence information of the comparison identical with the correlation sequence information to be compared in sequence; step S29 is: and analyzing each piece of comparison text data with the same correlation sequence information with the text data to be compared so as to analyze originality of the text data to be compared.
Preferably, in the step S23, after the filtering process, the synonym processing may be performed first, and then the subsequent steps may be performed, so as to increase the correlation comparison effect.
Still another object of the present invention is to provide a screening and associating system for rapidly sorting out brief information of a plurality of comparison text data, and integrating brief information of each comparison text data together, so as to conveniently analyze originality of text data to be compared.
The structure of the invention for achieving the above purpose comprises: the storage module is used for storing a word breaking vocabulary library and a comparison set message;
the word breaking processing module is used for carrying out word breaking processing on each comparison word data of the comparison set information so as to respectively generate comparison word breaking information; the screening processing module is used for screening the plurality of comparison word breaking information to generate comparison screening word breaking information respectively; the relevance processing module is used for performing relevance processing on the comparison screening word breaking information to respectively generate a plurality of comparison relevance sequence information; and the integration module is used for integrating all the comparison relevance sequence information together to establish a relevance index file.
Preferably, the word breaking processing module, the screening processing module and the relevance processing module perform word breaking processing, screening processing and relevance processing on a word material to be compared to generate a plurality of relevance sequence information to be compared, and the screening and relevance system of the word material further comprises: the comparison module is used for respectively comparing the plurality of pieces of to-be-compared relevance sequence information with the relevance index file and finding out each piece of comparison text data with the same comparison relevance sequence information as the plurality of pieces of to-be-compared relevance sequence information; an intersection module for arranging all the correlation sequence information with the same comparison correlation sequence information to be compared in sequence, thereby establishing an intersection sequence; and the analysis module is used for analyzing each piece of comparison text data with the same correlation sequence information with the text data to be compared.
To achieve the foregoing and other objects, and in accordance with the purpose of the invention, as embodied and broadly described, a preferred embodiment of the present invention is illustrated and described below.
Drawings
FIG. 1 is a flowchart of a text data screening association method according to a first embodiment of the present invention;
FIG. 2 is a block diagram of one embodiment of a method of the present invention that may be automatically performed in accordance with the first embodiment;
FIG. 3 is a flowchart of a text data screening association method according to a second embodiment of the present invention;
FIG. 4 is a block diagram of one embodiment of a method of the present invention that may be automatically performed in accordance with the second embodiment.
Reference numerals and signs
100. 100a: text data screening and associating system
1. 1a: storage module
2. 2a: word breaking processing module
3. 3a: screening processing module
4. 4a: relevance processing module
5a: integrated module
6a: comparison module
7a: intersection module
8a: analysis module
9. 9a: word breaking system
Detailed Description
Fig. 1 to 2 show a first embodiment of the present invention. As shown in FIG. 1, the text data screening and associating method of the present invention comprises the following steps: s11, performing word segmentation processing on a word data based on a word segmentation vocabulary library to generate word segmentation information; s12, screening the word breaking information to generate screened word breaking information; the screening word-breaking information comprises more than two screening word-breaking information; s13, carrying out relevance processing on the screening word breaking information to generate a plurality of relevance sequence information; the plurality of relevance sequence information is respectively composed of more than two screening broken words which are adjacent to each other in front and back; by the method, the brief information of the text data can be quickly arranged. As will be described in detail below.
Step S11 is to perform word breaking processing on a text data based on a word breaking vocabulary base to generate word breaking information.
The text data may be various published text data such as a bloodshot paper, an academic paper, a general chapter or sentence, etc. In addition, for text data with a large space such as papers, the papers can be directly regarded as one text data, or a plurality of text data can be formed after the papers are segmented. The manner of segmentation processing is numerous and is illustrated below. When the segmentation processing is performed, for example, a character data can be divided into a plurality of character data by using a character such as a line feed symbol, a continuous space, an exclamation mark (|), a division mark (:), a wave number (-), a question mark (. When the segmentation processing is performed, a text data can be divided into a plurality of text data based on each chapter and section of the text data. When the segmentation processing is performed, the method can be used together with a word segmentation vocabulary library, and a predetermined number of selected word segments such as ten, twenty … are used as one segment, so that one piece of text data is divided into a plurality of pieces of text data.
The word breaking process is to convert the text data into word breaking information according to a plurality of words recorded in the word breaking vocabulary library. The words in the word-breaking vocabulary library may be classified according to parts of speech, for example, by various parts of speech such as normal nouns (Na), foreign language (FW), action and physical words (VC), action failed Verbs (VA), local words (Nc), proper nouns (Nb), state enabling Verbs (VHC), colon (coloncata) …, and the like.
Step S12, screening the word breaking information to generate a screened word breaking information; the screening word-breaking information has more than two screening word-breaking. The filtering process is to retain part of speech having meaning in the word breaking information and remove other parts of speech, such as retaining common noun (Na), foreign language (FW), action and object Verb (VC), action insufficient Verb (VA), local word (Nc), proper noun (Nb), state enabling movable Verb (VHC) …, and the like. All words that remain after the screening process are collectively referred to as screening break words.
Step S13, carrying out relevance processing on the screening word breaking information to generate a plurality of relevance sequence information; the relevance sequence information is composed of more than two screening broken words which are adjacent to each other in front and back. Through the relevance processing, more than two adjacent screening fragments are combined together, so that the text data in the same field but with different technical characteristics can be distinguished to a certain extent, and particularly, the difference between text data with most of the same keywords can be distinguished.
The first embodiment of the invention is to quickly sort out the screening word breaking information which is closer to the meaning of the text data than the keywords, and can be used for analyzing the text data of other people or the text data of the user, thereby achieving the purpose of quickly sorting out the brief information of the text data and further being convenient for analyzing and utilizing the text data.
As shown in fig. 1, step S110 may be performed before step S11 is performed; step S110 is: collecting author self-defined keywords in the text data to establish a professional keyword vocabulary library, and importing the professional keyword vocabulary library into the word breaking vocabulary library. Generally, text data such as papers have author-customized keywords, which include proper names, scientific and technical names …, and the like, and the author-customized keywords are imported into a word segmentation vocabulary library and then subjected to word segmentation processing and subsequent steps, so that the intended relevance sequence information more similar to the text data can be obtained.
FIG. 2 is a diagram showing one embodiment of a text data screening and associating system capable of automatically executing the text data screening and associating method according to the first embodiment. As shown in fig. 2, the present invention provides a text data screening association system 100, which includes: a storage module 1 for storing a word breaking vocabulary library; a word breaking processing module 2 for performing word breaking processing on a text data to generate word breaking information; a screening processing module 3, configured to perform screening processing on the word breaking information to generate a screened word breaking information; and the relevance processing module 4 is used for carrying out relevance processing on the screening word breaking information to generate a plurality of relevance sequence information. The storage module 1, the word breaking processing module 2, the screening processing module 3, the relevance processing module 4 and the like can be established in one or more computers and/or cloud servers. When the text data screening and associating system 100 is built in a cloud server, a corresponding web page may be provided, and after inputting text data, a user may obtain a plurality of association sequence information (not shown).
Fig. 3 to 4 show a second embodiment of the present invention. As shown in fig. 3 to 4, the method for screening and associating text data according to the present invention comprises the following steps: s21, establishing a comparison set of information by using more than two parts of comparison text data; s22, based on a word breaking vocabulary library, carrying out word breaking processing on the plurality of comparison text data to respectively generate comparison word breaking information; s23, screening the plurality of comparison word breaking information to generate comparison screening word breaking information respectively; the screening word breaking information is provided with more than two screening word breaking information respectively; s24, carrying out relevance processing on the plurality of comparison screening word breaking information to respectively generate a plurality of comparison relevance sequence information; the plurality of comparison relevance sequence information is composed of more than two screening broken words which are adjacent in front and back respectively; s25, integrating all the contrast relevance sequence information to establish a relevance index file; by the method, the brief information of a plurality of pieces of comparison text data can be quickly arranged, and the brief information of the comparison text data can be integrated together, so that originality of the text data to be compared can be conveniently analyzed.
Step S21 is to establish a comparison set of information by using more than two pieces of comparison text data. The control collection information may include various text data, such as some or all of the articles included in the taiwan blooi article knowledge addition system. In addition, when the comparison collection information is established, for example, different pieces of illumination collection information may be established in different ranges such as electronic type, mechanical type, and 10-year text …. The comparison text data in the second embodiment is the same as the text data in the first embodiment, and may be various published text data, such as a blossoming paper, an academic paper, a general article or sentence, etc., which is different in that the comparison text data in the second embodiment needs to be compared with each comparison text data one by one, so that different names are used to distinguish.
The steps S22-S24 are to perform word breaking processing, screening processing and relevance processing on each piece of comparison text data in the comparison collection information, so as to generate comparison word breaking information, comparison screening word breaking information and a plurality of comparison relevance sequence information.
Step S25 is to integrate all the correlation sequence information to build a correlation index file. The established association index file can be conveniently compared with the text data to be compared, so that originality of the text data to be compared can be conveniently analyzed.
As shown in fig. 3, step S220 may be performed before step S22 is performed; step S220 is: collecting the self-defined keywords of a part or all of the authors in the comparison text data and the comparison text data to establish a professional keyword vocabulary library, and converging the professional keyword vocabulary library into the word breaking vocabulary library, thereby obtaining an original relevance index file closer to the text data. In addition, the work of removing repetition can be added in the work of arranging the professional keyword vocabulary library, thereby increasing the processing efficiency.
According to the second embodiment of the invention, the brief information of the comparison text data can be quickly arranged, and the brief information of the comparison text data can be further integrated together, so that a user can conveniently compare and analyze the text data to be compared. For example, the originality of the text data to be compared is analyzed through the following steps S26 to S29.
Step S26 is to perform word breaking, screening and relevance processing on a text to be compared to generate a plurality of relevance sequence information to be compared. The processing manners of steps S22-S24 and step S26 are the same as those of steps S11-S13, so that the generated comparison correlation sequence information and the correlation sequence information to be compared have corresponding types, and the comparison can be conveniently performed. In addition, in S12, S23, and/or S26, after the filtering process, the synonym processing may be performed first, and then the subsequent steps may be performed. The synonym processing is as follows: and (3) performing text synonym inspection on the screened broken words after screening treatment, and converting part or all of synonyms (except for special words which are not suitable for synonym treatment) into standard text, so that the relevance comparison effect can be increased. For example, "cold air", "air conditioner" is changed to "cold air", etc. In addition, the correlation sequence information to be compared and the correlation sequence information to be compared can be composed of more than two screening broken words which are adjacent in front and back. In the comparison relevance sequence information and the comparison relevance sequence information, the more the number of the screening broken words is, the easier the comparison relevance sequence information and the comparison relevance sequence information reflect the concept of the corresponding text data, but the condition that the comparison relevance sequence information and the comparison relevance sequence information are limited too much and the comparison text data similar to the text data to be compared cannot be found can be formed. Therefore, basically, two front and rear adjacent screening word groups are adopted to compare the relevance sequence information and the relevance sequence information to be compared, and when the quantity of the comparison text data in the comparison set information is extremely large, for example, three or more front and rear adjacent screening word groups can be adopted to compare the relevance sequence information and the relevance sequence information to be compared in order to increase the analysis speed.
Step S27 is to compare the plurality of pieces of correlation sequence information to be compared with the correlation index file respectively, and find out each piece of comparison text data with the same comparison correlation sequence information as the plurality of pieces of correlation sequence information to be compared. By the method for screening and correlating the text data, the relevance between the text data to be compared and each comparison text data can be rapidly analyzed, so that originality of the text data to be compared can be conveniently analyzed. In addition, the association index file has simple format, can conveniently add new association sequence information, and can overcome the defect that the conventional reverse database needs frequent system reformation due to the new data.
The manner in which the processing of the break … is performed is outlined in the following example. The numbers of the examples are provided for convenience of description only, and the meaning of the present invention should not be limited in this way. In step S21, the comparison set information is created, and each comparison text data is numbered sequentially, for example, the comparison text data with the number 1 is recorded as ID1. The reference set information is a set storing ID1, ID2, …, IDn.
Example 1
Example
ID1 Originality latest inspection of bloos related papersMeasuring method and system
ID2 Paper originality analysis method and system
Step S22 performs word breaking processing.
Example 2
Figure BDA0003321015750000091
Figure BDA0003321015750000101
Step S23 performs filtering processing, and may number each filtering word sequentially, for example, mark the first filtering word of ID1 that is reserved as ID1tp1.
Example 3
Figure BDA0003321015750000102
In step S24, the correlation process is performed, and each piece of correlation sequence information may be numbered sequentially, for example, the first correlation sequence information of ID1 is denoted as ID1S1.
Example 4
Figure BDA0003321015750000103
Figure BDA0003321015750000111
In step S25, an association Index file is established, each piece of comparison screening word breaking information can be regarded as an Index (i.e. Index or Key) of the association Index file, and the number of the comparison screening word breaking information can be used as the Data (Data) of the association Index file. When the association index file is established, any one piece of comparison screening word breaking information may be identical to another piece of comparison screening word breaking information (for example, ID1S2, ID2S 1). Thus, an index can be used to reference a plurality of different data, the number of data being numerous, the total data length stored being increased as more reference text data is added.
Example 5
Figure BDA0003321015750000112
In step S26, word breaking … is performed on the text data to be compared, and the text data to be compared can be marked as IDx.
Example 6
Figure BDA0003321015750000113
Figure BDA0003321015750000121
Step S27: and searching by using the to-be-compared relevance sequence information as an index, and reading all the data with the same index in the relevance index file.
Example 7
Figure BDA0003321015750000122
In step S28, an intersection sequence is established, and all the pieces of correlation sequence information with the same comparison correlation sequence information to be compared can be arranged in sequence (i.e. classification sort).
Example 8
{ID1S2,ID1S3,ID1S5,ID2S1,ID2S4,……}
Step S29 is to analyze each of the comparison text data having the same associated sequence information as the text data to be compared, thereby generating an original analysis result of the text data to be compared relative to each of the comparison text data. The comparison method is many, for example, a statistical analysis method is used to analyze the similarity reference proportion of each comparison text data in the comparison set information, and the common theory such as the Dice Coefficient rule can be used. In addition, the method can be simply and easily understood to perform simple analysis.
Example 9
Figure BDA0003321015750000123
Figure BDA0003321015750000131
By the method for screening and correlating the text data, the relevance between the text data to be compared and each comparison text data can be rapidly analyzed, and the originality of the text data to be compared can be further analyzed.
FIG. 4 is a diagram showing one embodiment of a text data screening and associating system capable of automatically executing the text data screening and associating method according to the second embodiment. As shown in fig. 4, the present invention provides a text data screening association system 100a, which includes: a storage module 1a for storing a word-breaking vocabulary library and a comparison set information; a word breaking processing module 2a for performing word breaking processing on each comparison text data of the comparison set information to generate a comparison word breaking information respectively; a screening processing module 3a, configured to perform screening processing on the plurality of comparison word breaking information to generate a comparison screening word breaking information respectively; a relevance processing module 4a, configured to perform relevance processing on the plurality of comparison screening word breaking information to generate a plurality of comparison relevance sequence information respectively; an integrating module 5a for integrating the whole correlation sequence information to build a correlation index file. In addition, the word breaking processing module 2a, the filtering processing module 3a and the relevance processing module 4a may further perform word breaking processing, filtering processing and relevance processing on a word material to be compared to generate a plurality of relevance sequence information to be compared, and the filtering related system 100a of the word material further includes: a comparison module 6a, for respectively comparing the plurality of pieces of correlation sequence information to be compared with the correlation index file, and finding out each piece of comparison text data with the same comparison correlation sequence information as the plurality of pieces of correlation sequence information to be compared; an intersection module 7a, which arranges the sequence of all the correlation sequence information which is the same as the correlation sequence information to be compared and controls the correlation sequence information to establish an intersection sequence; an analysis module 8a analyzes each piece of the comparison text data having the same correlation sequence information as the text data to be compared.
The storage module 1a, the word breaking processing module 2a, the screening processing module 3a, the relevance processing module 4a, the integration module 5a, the comparison module 6a, the intersection module 7a, the analysis module 8a and the like can be built in one or more computers and/or cloud servers. When the text data screening association system 100a is built in a cloud server, a corresponding web page may be provided, and the user may obtain an original analysis result (not shown) after inputting the text data to be compared.
In addition, the above-mentioned parts related to word segmentation process, such as steps S11 and S22 and word segmentation vocabulary library, may employ known word segmentation systems 9 and 9a, such as CKIP developed by taiwan district' central institute of china or disclosed computer program codes, so as to save cost.
As described above, the text data may be various text data which have been disclosed, and for example, text data of a large size such as a paper may be regarded as one text data directly, or a plurality of text data may be formed after the paper is segmented. The text data formed by the segmentation process can be further correlated to form a unified original analysis result. For example, the number of a paper is IDa1, and the numbers of the paper after being segmented (e.g., segmented by chapters) are IDa2 to IDan, respectively, that is, not only the paper is regarded as a piece of text data, but also each segment (each chapter) of the paper can be regarded as a piece of text data. Thus, after analysis, not only the original analysis result of the text data to be compared relative to the paper, but also the original analysis result of each section (each section) of the text data to be compared relative to the paper can be obtained.
The foregoing description of the embodiments of the present invention is provided for the purpose of illustration only, and is not intended to limit the scope of the invention, i.e., the various modifications within the scope of the appended claims should be construed as broadly as possible.

Claims (10)

1. A method for screening and associating text data is characterized by comprising the following steps:
s11, performing word segmentation processing on a word data based on a word segmentation vocabulary library to generate word segmentation information;
s12, screening the word breaking information to generate screened word breaking information; the screening word-breaking information comprises more than two screening word-breaking information;
s13, carrying out relevance processing on the screening word breaking information to generate a plurality of relevance sequence information; the relevance sequence information is composed of more than two screening broken words which are adjacent to each other in front and back.
2. The method of claim 1, wherein a step S110 is performed before the step S11; the step S110 is: collecting author self-defined keywords in the text data to establish a professional keyword vocabulary library, and importing the professional keyword vocabulary library into the word breaking vocabulary library.
3. The method according to claim 1, wherein in the step S12, after the filtering process, the synonym processing is performed, and then the following steps are performed; the synonym processing is as follows: and carrying out text synonym inspection on the screened broken words after the screening treatment, and converting the synonyms into standard characters.
4. A system for screening and correlating text data, comprising:
the storage module is used for storing a word breaking vocabulary library;
the word breaking processing module is used for carrying out word breaking processing on a word data to generate word breaking information;
the screening processing module is used for screening the word breaking information to generate screening word breaking information;
and the relevance processing module is used for carrying out relevance processing on the screening word breaking information to generate a plurality of relevance sequence information.
5. A method for screening and associating text data is characterized by comprising the following steps:
s21, establishing a comparison set of information by using more than two parts of comparison text data;
s22, based on a word breaking vocabulary library, carrying out word breaking processing on the plurality of comparison text data to respectively generate comparison word breaking information;
s23, screening the plurality of comparison word breaking information to generate comparison screening word breaking information respectively; the screening word breaking information is provided with more than two screening word breaking information respectively;
s24, carrying out relevance processing on the plurality of comparison screening word breaking information to respectively generate a plurality of comparison relevance sequence information; the plurality of comparison relevance sequence information is respectively composed of more than two screening broken words which are adjacent in front and back;
s25, integrating all the contrast relevance sequence information to establish a relevance index file.
6. The method according to claim 5, wherein steps S26 to S29 are performed after the step S25; step S26 is: word breaking processing, screening processing and relevance processing are carried out on word data to be compared so as to generate a plurality of relevance sequence information to be compared; step S27 is: the plurality of pieces of correlation sequence information to be compared are respectively compared with the correlation index file, and each piece of comparison text data with the same comparison correlation sequence information as the plurality of pieces of correlation sequence information to be compared is found out; step S28 is: establishing an intersection sequence, and arranging all the correlation sequence information which are identical with the correlation sequence information to be compared in a comparison sequence; step S29 is: and analyzing each piece of comparison text data with the same correlation sequence information with the text data to be compared.
7. The method of claim 6, wherein a step S220 is performed before the step S22; the step S220 is: collecting the self-defined keywords of a part or all of the author in the comparison text data and the comparison text data to establish a professional keyword vocabulary library, and converging the professional keyword vocabulary library into the word breaking vocabulary library.
8. The method according to claim 5, wherein in step S23, after the filtering process, the synonym processing is performed, and the following steps are performed.
9. A system for screening and correlating text data, comprising:
the storage module is used for storing a word breaking vocabulary library and a comparison set message;
the word breaking processing module is used for carrying out word breaking processing on each comparison text data of the comparison set information so as to respectively generate comparison word breaking information;
the screening processing module is used for screening the plurality of comparison word breaking information to generate comparison screening word breaking information respectively;
the relevance processing module is used for performing relevance processing on the comparison screening word breaking information to respectively generate a plurality of comparison relevance sequence information;
and the integration module is used for integrating all the comparison relevance sequence information together to establish a relevance index file.
10. The system of claim 9, wherein the word breaking processing module, the screening processing module and the relevance processing module perform word breaking processing, screening processing and relevance processing on a word to be compared to generate a plurality of relevance sequence information to be compared, and the system further comprises: the comparison module is used for respectively comparing the plurality of pieces of to-be-compared relevance sequence information with the relevance index file and finding out each piece of comparison text data with the same comparison relevance sequence information as the plurality of pieces of to-be-compared relevance sequence information; an intersection module for arranging all the correlation sequence information of the comparison identical to the correlation sequence information to be compared in sequence, thereby establishing an intersection sequence; and the analysis module is used for analyzing each piece of comparison text data with the same correlation sequence information with the text data to be compared.
CN202111247550.XA 2021-10-26 2021-10-26 Text data screening association method and system Pending CN116028592A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111247550.XA CN116028592A (en) 2021-10-26 2021-10-26 Text data screening association method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111247550.XA CN116028592A (en) 2021-10-26 2021-10-26 Text data screening association method and system

Publications (1)

Publication Number Publication Date
CN116028592A true CN116028592A (en) 2023-04-28

Family

ID=86090024

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111247550.XA Pending CN116028592A (en) 2021-10-26 2021-10-26 Text data screening association method and system

Country Status (1)

Country Link
CN (1) CN116028592A (en)

Similar Documents

Publication Publication Date Title
Neal et al. Surveying stylometry techniques and applications
Stamatatos A survey of modern authorship attribution methods
Zhang et al. Keyword extraction using support vector machine
US8781817B2 (en) Phrase based document clustering with automatic phrase extraction
US20150120738A1 (en) System and method for document classification based on semantic analysis of the document
Zhang et al. An empirical study of TextRank for keyword extraction
El-Shishtawy et al. An accurate arabic root-based lemmatizer for information retrieval purposes
Zhang et al. Duplicate detection in programming question answering communities
Yalcin et al. An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding
Singh et al. Writing Style Change Detection on Multi-Author Documents.
Ali et al. Analytical Study of Traditional and Intelligent Textual Plagiarism Detection Approaches
Osman et al. Plagiarism detection using graph-based representation
Momtaz et al. Graph-based Approach to Text Alignment for Plagiarism Detection in Persian Documents.
Castillo et al. Authorship verification using a graph knowledge discovery approach
Brixtel et al. Any language early detection of epidemic diseases from web news streams
Ullah et al. Pattern and semantic analysis to improve unsupervised techniques for opinion target identification
TWM623980U (en) System of screening for text data relevance
CN116028592A (en) Text data screening association method and system
Al-Arfaj et al. Arabic NLP tools for ontology construction from Arabic text: An overview
Baishya et al. Present state and future scope of Assamese text processing
Sanabila et al. Automatic Wayang Ontology Construction using Relation Extraction from Free Text
TWI813028B (en) Method and system of screening for text data relevance
Li et al. Keyword extraction based on lexical chains and word co-occurrence for Chinese news web pages
Doostyar et al. Plagiarism detection for Afghan national languages (Pashto and Dari)
Uyar Near-duplicate news detection using named entities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination