CN108920633B - Paper similarity detection method - Google Patents

Paper similarity detection method Download PDF

Info

Publication number
CN108920633B
CN108920633B CN201810704162.1A CN201810704162A CN108920633B CN 108920633 B CN108920633 B CN 108920633B CN 201810704162 A CN201810704162 A CN 201810704162A CN 108920633 B CN108920633 B CN 108920633B
Authority
CN
China
Prior art keywords
text
similarity
sentence
server
paper
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810704162.1A
Other languages
Chinese (zh)
Other versions
CN108920633A (en
Inventor
向湘杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei Tongyuan Gezhi Technology Co ltd
Original Assignee
Hubei Tongyuan Gezhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei Tongyuan Gezhi Technology Co ltd filed Critical Hubei Tongyuan Gezhi Technology Co ltd
Priority to CN201810704162.1A priority Critical patent/CN108920633B/en
Publication of CN108920633A publication Critical patent/CN108920633A/en
Application granted granted Critical
Publication of CN108920633B publication Critical patent/CN108920633B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for detecting the similarity of a paper, which comprises the steps of obtaining a paper file to be detected in similarity, dividing text contents in the paper file into a plurality of texts, and comparing the texts with character information of the paper in a paper library to obtain a first text of which the similarity between the text information and the character information in the paper library exceeds a first preset threshold; the server acquires a plurality of papers of which the similarity with the character information in the first text exceeds a second preset threshold value from a paper library, and merges the acquired papers into a second text; and the server judges the similarity value between the first text and the second text as the detection similarity value of the paper to be detected. The process of the invention can be automatically completed by the server without manual reading, thereby improving the working efficiency.

Description

Paper similarity detection method
Technical Field
The invention belongs to the field of data processing, and particularly relates to a method for detecting the similarity of a paper.
Background
Modern society is a society of information explosion, and massive data exists on the internet.
In the prior art, a user may have a need to compare two texts, for example, after receiving a contribution from a publishing company, the publishing company needs to compare whether the contribution is a plagiarized contribution, such as determining whether a paper is a plagiarized article.
In the comparison method in the prior art, the core thought of the manuscript is summarized after the manuscript is read manually, then the key words are summarized, and the key words are used for searching on the internet, however, if the amount of information in the manuscript is large, a large amount of time is needed for reading, and the working efficiency is influenced.
Therefore, the prior art is subject to further improvement.
Disclosure of Invention
In view of the above disadvantages in the prior art, the present invention provides a method for detecting a paper similarity for a user, which overcomes the drawback of paper similarity detection in the prior art.
The invention provides a method for detecting thesis similarity, wherein the method comprises the following steps:
the method comprises the steps that a server obtains a thesis file to be subjected to similarity detection and input by a user, and text content contained in the thesis file is divided into a plurality of texts;
the server compares the character information of each text with the character information of the thesis in the paper library in sequence to obtain a first text of which the similarity between the text information of each text and the character information in the thesis library exceeds a first preset threshold;
the server acquires a plurality of papers of which the similarity with the character information in the first text exceeds a second preset threshold value from a paper library, and merges the acquired papers into a second text;
and the server judges the similarity value between the first text and the second text as the detection similarity value of the paper to be detected.
Optionally, the step of determining the similarity value between the first text and the second text as the detection similarity value of the paper to be detected further includes:
the server disassembles the first text to obtain a plurality of candidate sentences;
the server determines the importance scores of the candidate sentences;
the server extracts a target sentence with an importance score larger than a preset value as key information of the first text;
and the server compares the similarity of the key information of the first text with the key information of the second text, and judges the compared similarity value as the similarity value between the first text and the second text.
Optionally, in the step of parsing the first text by the server to obtain a plurality of candidate sentences, the method for parsing the first text includes:
disassembling according to punctuation marks; and when the punctuations are the pause signs, the colon signs and the quotation marks, the punctuations are not disassembled.
Optionally, the step of the server determining the importance score of each candidate sentence includes:
judging whether the candidate sentences contain Chinese sentences and/or webpage link addresses;
if only the Chinese sentence is contained, taking the sum of the weights of all phrases in the Chinese sentence as the importance score of the candidate sentence;
if only the webpage link address is contained, taking the sum of the weights of the page elements contained in the webpage corresponding to the webpage link address as the importance score of the candidate sentence;
and if the candidate sentences contain the Chinese sentences and the webpage link addresses, taking the weighted average of the sum of the weights of all phrases in the Chinese sentences and the sum of the weights of the page elements contained in the webpage corresponding to the webpage link addresses as the importance scores of the candidate sentences.
Optionally, the step of taking the sum of the weights of the phrases in the chinese sentence as the importance score of the candidate sentence includes:
splitting each candidate sentence into a plurality of phrases according to a semantic analysis mode;
carrying out full-text retrieval, and calculating the occurrence times of each phrase;
sequencing the phrases according to the sequence of the occurrence times from high to low, wherein each phrase is endowed with a corresponding weight according to the occurrence times, and the higher the occurrence times, the higher the weight;
and calculating the importance score of each candidate sentence according to the weight of each phrase, wherein the importance score is the sum of the weights of each phrase in the candidate sentence.
Optionally, the step of taking the sum of the weights of the page elements included in the web page corresponding to the web page link address as the importance score of the candidate sentence includes:
the server background opens a target webpage corresponding to the webpage link address;
and the server determines the importance score of the target webpage according to the page elements contained in the target webpage.
Optionally, the step of determining, by the server, the importance score of the target web page according to the page elements included in the target web page includes:
determining an importance score for the target web page using the following formula;
Figure BDA0001715067900000031
where S (Vi) is the importance score of the target web page, d is a damping coefficient, typically set to 0.85, and in (Vi) is the set of web pages for which there is a link to the target web page. out (Vj) is the set of web pages pointed to by the links in web page j, out (Vj) is the absolute value used to represent the number of elements in the set of web pages, and S (Vj) is the importance score of web page j.
Optionally, the step of comparing, by the server, the similarity between the key information of the first text and the key information of the second text includes:
calculating cosine similarity of a first sentence in the key information of the first text and a second sentence in the key information of the second text;
and if the cosine similarity is higher than the preset value, determining that the first text is approximate to the second text.
Optionally, the cosine similarity calculation method includes:
splitting the first sentence into a plurality of phrases;
splitting the second sentence into a plurality of phrases;
comparing the two groups of phrases one by one, if the phrases exist, recording the phrases as 1, if the phrases do not exist, recording the phrases as 0, and obtaining a first sequence and a second sequence;
and calculating the cosine similarity between the first sequence and the second sequence and taking the cosine similarity as the cosine similarity between the first sentence and the second sentence.
Optionally, the calculation of the cosine similarity between the first sequence and the second sequence is calculated using the following formula:
Figure BDA0001715067900000041
wherein ab represents the integral addition of multiplication of the middle elements of the a sequence and the corresponding elements of the b sequence, and the denominator represents the multiplication of the square sum of the root of all the elements in the a sequence and the square sum of the root of all the elements in the a sequence.
The method has the advantages that by acquiring the paper file to be subjected to similarity detection, the text content in the paper file is divided into a plurality of texts and then compared with the text information of the papers in the paper library, so that the first text with the similarity between the text information in the paper library and the text information in the paper library exceeding a first preset threshold is obtained; the server acquires a plurality of papers of which the similarity with the character information in the first text exceeds a second preset threshold value from a paper library, and merges the acquired papers into a second text; and the server judges the similarity value between the first text and the second text as the detection similarity value of the paper to be detected. The process of the invention can be automatically completed by the server without manual reading, thereby improving the working efficiency.
Drawings
Fig. 1 is a flowchart illustrating steps of a method for detecting similarity of papers according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a method for detecting the similarity of papers, which comprises the following steps of:
step S101, a server acquires a paper file to be subjected to similarity detection, which is input by a user, and divides text content contained in the paper file into a plurality of texts.
The method comprises the steps that a user uploads a paper file to be detected to a server, the server divides the content contained in the paper file according to the page or the chapter of the paper file after obtaining the paper file, a plurality of texts are obtained, and the texts are stored respectively.
In the step, a plurality of texts are divided for longer papers, so that the comparison result can be obtained quickly, and the corresponding similarity can be obtained for each text, so that the effect better than that of similarity detection of a single paper can be obtained.
And S102, the server compares the character information of each text with the character information of the papers in the paper library in sequence to obtain a first text of which the similarity with the character information in the paper library exceeds a first preset threshold value.
And comparing the character information of each text with the character information of the papers in the paper library in sequence to obtain the similarity between each text and each paper in the paper library, screening out the papers with the similarity exceeding a preset first preset threshold value between the texts and the papers to be detected, and integrating the documents with the similarity exceeding the first preset threshold value in the papers to be detected and the papers in the papers library to obtain the first text. In practice, the first predetermined threshold may be customized or set by default, and preferably, the first predetermined threshold is set to zero-five percent.
Step S103, the server obtains a plurality of papers in a paper library, the similarity of which to the character information in the first text exceeds a second preset threshold, and merges the obtained papers into a second text.
And integrating the thesis with the content in the first text, wherein the similarity of the thesis and the content in the first text exceeds a preset second threshold value to obtain a second text. While a plurality of papers are integrated, repeated parts in the papers can be deleted or non-text contents in the files can be deleted.
And step S104, the server judges the similarity value between the first text and the second text as the detection similarity value of the paper to be detected.
And calculating the similarity between the first text obtained in the step S102 and the second text obtained in the step S103, and determining the similarity between the first text and the second text as the similarity value of the paper to be detected, that is, the repetition rate value.
Specifically, in order to obtain a more accurate similarity value between the first text and the second text, the step of determining the similarity value between the first text and the second text as the detection similarity value of the paper to be detected further includes:
the server disassembles the first text to obtain a plurality of candidate sentences; the method for splitting the first text comprises the following steps: disassembling according to punctuation marks; and when the punctuations are the pause signs, the colon signs and the quotation marks, the punctuations are not disassembled.
The server determines the importance scores of the candidate sentences;
the server extracts a target sentence with an importance score larger than a preset value as key information of the first text;
and the server compares the similarity of the key information of the first text with the key information of the second text, and judges the compared similarity value as the similarity value between the first text and the second text.
Further, since each candidate sentence may include information with different attributes, that is, the candidate sentence may include a chinese sentence or a web page link address, before the calculating of the importance score, the step of determining the importance score of each candidate sentence by the server includes:
judging whether the candidate sentences contain Chinese sentences and/or webpage link addresses;
if only the Chinese sentence is contained, taking the sum of the weights of all phrases in the Chinese sentence as the importance score of the candidate sentence;
if only the webpage link address is contained, taking the sum of the weights of the page elements contained in the webpage corresponding to the webpage link address as the importance score of the candidate sentence;
and if the candidate sentences contain the Chinese sentences and the webpage link addresses, taking the weighted average of the sum of the weights of all phrases in the Chinese sentences and the sum of the weights of the page elements contained in the webpage corresponding to the webpage link addresses as the importance scores of the candidate sentences.
The step of taking the sum of the weights of all phrases in the Chinese sentence as the importance scores of the candidate sentences comprises the following steps:
splitting each candidate sentence into a plurality of phrases according to a semantic analysis mode;
carrying out full-text retrieval, and calculating the occurrence times of each phrase;
sequencing the phrases according to the sequence of the occurrence times from high to low, wherein each phrase is endowed with a corresponding weight according to the occurrence times, and the higher the occurrence times, the higher the weight;
and calculating the importance score of each candidate sentence according to the weight of each phrase, wherein the importance score is the sum of the weights of each phrase in the candidate sentence.
For example, one paper contains the following:
today XX association held a work meeting in beijing with good weather, about 30 degrees centigrade, no rain and good traffic, on the work meeting, the chairman summarized the work of XX association in the last year and also showed excellent employees of XX association.
The candidate sentences include:
A. today XX associations held a conference in beijing;
B. weather is good;
C. approximately 30 degrees celsius;
D. no rain is present;
E. the traffic situation is also good;
F. on a working meeting;
G. the Zhang Congress summarized the past year work of the XX Association;
H. also show excellent employees of the XX association.
The words obtained by disassembling include:
today: appear 1 time and have a weight of 1
XX Association: appear 3 times and have a weight of 3
Beijing: 1 time, weight 1
And (3) opening: 1 time, weight 1
And (4) working meeting: 2 times, weight 2
Weather: 1 time, weight 1
30 ℃ of: 1 time, weight 1
Rain: 1 time, weight 1
Traffic conditions are as follows: 1 time, weight 1
Lengthening the sheet: 1 time, weight 1
The work in the last year: 1 time, weight 1
To summarize: 1 time, weight 1
Carrying out exterior recognition: 1 time, weight 1
Excellent staff: 1 time, weight 1
The importance scores of the above candidate sentences are respectively: no. 8, No. 1, No. 2, No. 6 and No. 5.
Assuming that the preset value is 2 points, the target sentences are No. A, No. F, No. G and No. H, and the final key information is as follows: today XX associations held a conference in beijing; on a working meeting; the Zhang Congress summarized the past year work of the XX Association; also show excellent employees of the XX association.
Further, the step of taking the sum of the weights of the page elements contained in the web page corresponding to the web page link address as the importance score of the candidate sentence includes:
the server background opens a target webpage corresponding to the webpage link address;
and the server determines the importance score of the target webpage according to the page elements contained in the target webpage.
The server determines the importance score of the target webpage according to the page elements contained in the target webpage, and the method comprises the following steps:
determining an importance score for the target web page using the following formula;
Figure BDA0001715067900000081
where S (Vi) is the importance score of the target web page, d is a damping coefficient, typically set to 0.85, and in (Vi) is the set of web pages for which there is a link to the target web page. out (Vj) is the set of web pages pointed to by the links in web page j, out (Vj) is the absolute value used to represent the number of elements in the set of web pages, and S (Vj) is the importance score of web page j.
Specifically, the step of comparing the similarity between the key information of the first text and the key information of the second text by the server includes:
calculating cosine similarity of a first sentence in the key information of the first text and a second sentence in the key information of the second text;
and if the cosine similarity is higher than the preset value, determining that the first text is approximate to the second text.
Specifically, the cosine similarity calculation method includes:
splitting the first sentence into a plurality of phrases;
splitting the second sentence into a plurality of phrases;
comparing the two groups of phrases one by one, if the phrases exist, recording the phrases as 1, if the phrases do not exist, recording the phrases as 0, and obtaining a first sequence and a second sequence;
and calculating the cosine similarity between the first sequence and the second sequence and taking the cosine similarity as the cosine similarity between the first sentence and the second sentence.
For example:
the first sentence is: today the association held meetings in Beijing.
The second sentence is: the association held a conference of the general law in beijing.
Figure BDA0001715067900000091
The first sequence a is (1, 1, 1, 1, 0, 1) and the second sequence b is (0, 1, 1, 1, 1, 1).
Preferably, the method step calculates the cosine similarity between the first sequence and the second sequence using the following formula:
Figure BDA0001715067900000092
wherein ab represents the integral addition of multiplication of the middle elements of the a sequence and the corresponding elements of the b sequence, and the denominator represents the multiplication of the square sum of the root of all the elements in the a sequence and the square sum of the root of all the elements in the a sequence.
For example, the two sentences above calculate the results as:
Figure BDA0001715067900000101
the final calculation result is: 0.8.
the invention discloses a method for detecting the similarity of a paper, which comprises the steps of obtaining a paper file to be detected in similarity, dividing text contents in the paper file into a plurality of texts, and comparing the texts with character information of the paper in a paper library to obtain a first text of which the similarity between the text information and the character information in the paper library exceeds a first preset threshold; the server acquires a plurality of papers of which the similarity with the character information in the first text exceeds a second preset threshold value from a paper library, and merges the acquired papers into a second text; and the server judges the similarity value between the first text and the second text as the detection similarity value of the paper to be detected. The process of the invention can be automatically completed by the server without manual reading, thereby improving the working efficiency.
It should be understood that equivalents and modifications of the technical solution and inventive concept thereof may occur to those skilled in the art, and all such modifications and alterations should fall within the scope of the appended claims.

Claims (4)

1. A method for detecting similarity of papers, the method comprising:
the method comprises the steps that a server obtains a thesis file to be subjected to similarity detection and input by a user, and text content contained in the thesis file is divided into a plurality of texts;
the server compares the character information of each text with the character information of the papers in the paper library in sequence to obtain the similarity between each text and each paper in the paper library, screens out the papers of which the similarity with the papers to be detected exceeds a preset first preset threshold value, and integrates the documents of which the similarity between the papers to be detected and the papers in the paper library exceeds the first preset threshold value to obtain a first text;
the server acquires a plurality of papers of which the similarity with the character information in the first text exceeds a second preset threshold value from a paper library, and merges the acquired papers into a second text;
the server judges the similarity value between the first text and the second text as the detection similarity value of the paper to be detected;
the step of determining the similarity value between the first text and the second text as the detection similarity value of the paper to be detected further includes:
the server disassembles the first text to obtain a plurality of candidate sentences;
the server determines the importance scores of the candidate sentences;
the server extracts a target sentence with an importance score larger than a preset value as key information of the first text;
the server compares the similarity of the key information of the first text with the key information of the second text, and judges the compared similarity value as the similarity value between the first text and the second text;
in the step of the server disassembling the first text to obtain a plurality of candidate sentences, the method for disassembling the first text comprises the following steps:
disassembling according to punctuation marks; when the punctuations are semicolons, commas and periods, the punctuations are not disassembled;
the step of the server determining the importance score of each candidate sentence comprises:
judging whether the candidate sentences contain Chinese sentences and/or webpage link addresses;
if only the Chinese sentence is contained, taking the sum of the weights of all phrases in the Chinese sentence as the importance score of the candidate sentence;
if only the webpage link address is contained, taking the sum of the weights of the page elements contained in the webpage corresponding to the webpage link address as the importance score of the candidate sentence;
if the candidate sentences contain the Chinese sentences and the webpage link addresses simultaneously, taking the weighted average of the sum of the weights of all phrases in the Chinese sentences and the sum of the weights of the page elements contained in the webpage corresponding to the webpage link addresses as the importance scores of the candidate sentences;
the step of taking the sum of the weights of all phrases in the Chinese sentence as the importance scores of the candidate sentences comprises the following steps:
splitting each candidate sentence into a plurality of phrases according to a semantic analysis mode;
carrying out full-text retrieval, and calculating the occurrence times of each phrase;
sequencing the phrases according to the sequence of the occurrence times from high to low, wherein each phrase is endowed with a corresponding weight according to the occurrence times, and the higher the occurrence times, the higher the weight;
calculating the importance score of each candidate sentence according to the weight of each phrase, wherein the importance score is the sum of the weights of each phrase in the candidate sentence;
the step of taking the sum of the weights of the page elements contained in the webpage corresponding to the webpage link address as the importance score of the candidate sentence comprises the following steps:
the server background opens a target webpage corresponding to the webpage link address;
the server determines the importance score of the target webpage according to the page elements contained in the target webpage;
the server determines the importance score of the target webpage according to the page elements contained in the target webpage, and the method comprises the following steps:
determining an importance score for the target web page using the following formula;
Figure FDA0003128083830000021
wherein S (Vi) is the importance score of the target web page, d is a damping coefficient set to 0.85, and in (Vi) is the set of web pages where links to the target web page exist; out (Vj) is the set of web pages pointed to by the links in web page j, out (Vj) is the absolute value used to represent the number of elements in the set of web pages, and S (Vj) is the importance score of web page j.
2. The method for detecting similarity of thesis as claimed in claim 1, wherein the step of comparing the similarity of the key information of the first text with the key information of the second text by the server comprises:
calculating cosine similarity of a first sentence in the key information of the first text and a second sentence in the key information of the second text;
and if the cosine similarity is higher than the preset value, determining that the first text is approximate to the second text.
3. The method for detecting thesis similarity as claimed in claim 2, wherein the cosine similarity is calculated by:
splitting the first sentence into a plurality of phrases;
splitting the second sentence into a plurality of phrases;
comparing the two groups of phrases one by one, if the phrases exist, recording the phrases as 1, if the phrases do not exist, recording the phrases as 0, and obtaining a first sequence and a second sequence;
and calculating the cosine similarity between the first sequence and the second sequence and taking the cosine similarity as the cosine similarity between the first sentence and the second sentence.
4. The paper similarity detection method according to claim 3, wherein the calculation of cosine similarity between the first sequence and the second sequence is calculated using the following formula:
Figure FDA0003128083830000031
wherein ab represents the integral addition of multiplication of the middle elements of the a sequence and the corresponding elements of the b sequence, and the denominator represents the multiplication of the square sum of the root of all the elements in the a sequence and the square sum of the root of all the elements in the a sequence.
CN201810704162.1A 2018-07-01 2018-07-01 Paper similarity detection method Expired - Fee Related CN108920633B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810704162.1A CN108920633B (en) 2018-07-01 2018-07-01 Paper similarity detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810704162.1A CN108920633B (en) 2018-07-01 2018-07-01 Paper similarity detection method

Publications (2)

Publication Number Publication Date
CN108920633A CN108920633A (en) 2018-11-30
CN108920633B true CN108920633B (en) 2021-12-03

Family

ID=64422822

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810704162.1A Expired - Fee Related CN108920633B (en) 2018-07-01 2018-07-01 Paper similarity detection method

Country Status (1)

Country Link
CN (1) CN108920633B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635087A (en) * 2018-12-12 2019-04-16 广东小天才科技有限公司 Composition scoring method and family education equipment
CN110472201B (en) * 2019-07-26 2020-07-21 阿里巴巴集团控股有限公司 Text similarity detection method and device based on block chain and electronic equipment
US10909317B2 (en) 2019-07-26 2021-02-02 Advanced New Technologies Co., Ltd. Blockchain-based text similarity detection method, apparatus and electronic device
CN112445891A (en) * 2019-08-30 2021-03-05 智慧芽信息科技(苏州)有限公司 Text information navigation browsing method, device, server and storage medium
CN110717158B (en) * 2019-09-06 2024-03-01 冉维印 Information verification method, device, equipment and computer readable storage medium
CN111400446A (en) * 2020-03-11 2020-07-10 中国计量大学 Standard text duplicate checking method and system
CN112163418A (en) * 2020-08-31 2021-01-01 深圳市修远文化创意有限公司 Text comparison method and related device
CN113139375A (en) * 2021-04-21 2021-07-20 洛阳墨潇网络科技有限公司 Paper similarity detection method and device based on big data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562824A (en) * 2017-08-21 2018-01-09 昆明理工大学 A kind of text similarity detection method
CN107992470A (en) * 2017-11-08 2018-05-04 中国科学院计算机网络信息中心 A kind of text duplicate checking method and system based on similarity

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562824A (en) * 2017-08-21 2018-01-09 昆明理工大学 A kind of text similarity detection method
CN107992470A (en) * 2017-11-08 2018-05-04 中国科学院计算机网络信息中心 A kind of text duplicate checking method and system based on similarity

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于中心性和 PageRank 的网页综合评分方法;乔少杰等;《西南交通大学学报》;20110630;第46卷(第3期);456-460 *
基于句子相似度的文本比对算法研究;杨茂;《中国优秀硕士学位论文全文数据库信息科技辑》;20110315(第3期);第12-13、48-62页 *

Also Published As

Publication number Publication date
CN108920633A (en) 2018-11-30

Similar Documents

Publication Publication Date Title
CN108920633B (en) Paper similarity detection method
CN109033212B (en) Text classification method based on similarity matching
CN108052500B (en) Text key information extraction method and device based on semantic analysis
US7983903B2 (en) Mining bilingual dictionaries from monolingual web pages
US8386240B2 (en) Domain dictionary creation by detection of new topic words using divergence value comparison
US8185532B2 (en) Method for filtering out identical or similar documents
US20040236566A1 (en) System and method for identifying special word usage in a document
US20050021323A1 (en) Method and apparatus for identifying translations
US9727556B2 (en) Summarization of a document
CN107885717B (en) Keyword extraction method and device
Layton et al. Recentred local profiles for authorship attribution
CN105550168B (en) A kind of method and apparatus of the notional word of determining object
JPH07114572A (en) Document classifying device
CN110196910B (en) Corpus classification method and apparatus
CN103377185B (en) One kind adds tagged method and device automatically for short text
CN110765767B (en) Extraction method, device, server and storage medium of local optimization keywords
CN109002508B (en) Text information crawling method based on web crawler
CN112926297B (en) Method, apparatus, device and storage medium for processing information
CN108959263B (en) Entry weight calculation model training method and device
CN109033093A (en) A kind of text interpretation method based on similarity mode
JP5339628B2 (en) Sentence classification program, method, and sentence analysis server for classifying sentences containing unknown words
CN109062981B (en) Website similarity detection method
JP5495425B2 (en) Sentence correction program, method, and sentence analysis server for correcting sentences containing unknown words
JP4047895B2 (en) Document proofing apparatus and program storage medium
JP2011113097A6 (en) Sentence correction program, method, and sentence analysis server for correcting sentences containing unknown words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20211112

Address after: 430070 floors 3-6 of creative world project phase I commercial center and 5 of high-rise building 11, No. 16, yezhihu West Road, Hongshan District, Wuhan City, Hubei Province (chuangkexing Incubator - stations 21 and 22, B8 District)

Applicant after: Hubei Tongyuan Gezhi Technology Co.,Ltd.

Address before: 523073 Room 403, No. 35, Sanxiang, Xiping xiashou new village, Nancheng District, Dongguan City, Guangdong Province

Applicant before: DONGGUAN HUARUI ELECTRONIC TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20211203

CF01 Termination of patent right due to non-payment of annual fee