CN117763106A - Document duplicate checking method and device, storage medium and electronic equipment - Google Patents

Document duplicate checking method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN117763106A
CN117763106A CN202311696616.2A CN202311696616A CN117763106A CN 117763106 A CN117763106 A CN 117763106A CN 202311696616 A CN202311696616 A CN 202311696616A CN 117763106 A CN117763106 A CN 117763106A
Authority
CN
China
Prior art keywords
text
sentence
checked
type
repetition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311696616.2A
Other languages
Chinese (zh)
Other versions
CN117763106B (en
Inventor
王猛
张智雄
于改红
叶志飞
李涵昱
刘熠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Science Library Chinese Academy Of Sciences
Original Assignee
National Science Library Chinese Academy Of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Science Library Chinese Academy Of Sciences filed Critical National Science Library Chinese Academy Of Sciences
Priority to CN202311696616.2A priority Critical patent/CN117763106B/en
Publication of CN117763106A publication Critical patent/CN117763106A/en
Application granted granted Critical
Publication of CN117763106B publication Critical patent/CN117763106B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a document duplicate checking method, a device, a storage medium and electronic equipment, wherein the method comprises the following steps: comparing the heavy text to be checked with a comparison library to obtain a first screening result; under the condition that the text content exists in the first screening result, sentence vector similarity calculation is carried out on the text content and the text to be checked, a second screening result is obtained, and the text to be checked and the target text both contain at least one sentence type; performing repetition degree calculation on the target text and the text to be checked, and obtaining sentence repetition degree values corresponding to each sentence type in the at least one sentence type; and acquiring a text duplication checking result of the text to be checked according to the sentence duplication value corresponding to each sentence type and the weight value of each sentence type. According to the embodiment of the application, the duplicate checking efficiency and the accuracy of the text can be improved.

Description

Document duplicate checking method and device, storage medium and electronic equipment
Technical Field
The present application relates to the field of text processing technologies, and in particular, to a method and apparatus for document duplicate checking, a storage medium, and an electronic device.
Background
Document duplication checking is a way of detecting the duplication rate of document texts, and the duplication checking result is determined by analyzing the similarity between different document texts.
At present, the implementation flow of the existing duplicate checking algorithm is as follows: firstly, preprocessing the uploaded text to be checked. And then extracting the characteristics in the text to be checked. The system then compares the extracted features to a database of known sources to screen out similar content. And finally, calculating the number of repeated words of the similar sentences or the similar paragraphs and outputting a final duplicate checking result. However, the duplicate checking algorithm does not consider that the importance degree of different types of sentences in the text is different, and the semantic features of the sentences are ignored in a manner of taking the sentences as features, so that the accuracy of the duplicate checking result obtained by the duplicate checking algorithm cannot be ensured.
Therefore, how to provide a method for searching for documents with high accuracy is a technical problem to be solved.
Disclosure of Invention
The embodiments of the present application provide a method, an apparatus, a storage medium, and an electronic device for document duplication checking, by which accuracy of text duplication checking can be improved, and practicality is high.
In a first aspect, some embodiments of the present application provide a method for document review, including: comparing the heavy text to be checked with a comparison library to obtain a first screening result, wherein the first screening result represents whether text content similar to the heavy text to be checked exists in the comparison library or not; under the condition that the text content exists in the first screening result, sentence vector similarity calculation is carried out on the text content and the text to be checked again, and a second screening result is obtained, wherein the second screening result represents a target text similar to the text to be checked again in the text content, and the text to be checked again and the target text both contain at least one sentence type; performing repetition degree calculation on the target text and the text to be checked, and obtaining sentence repetition degree values corresponding to each sentence type in the at least one sentence type; and acquiring a text duplication checking result of the text to be checked according to the sentence duplication value corresponding to each sentence type and the weight value of each sentence type.
In some embodiments of the present application, the heavy text to be checked and the comparison library are subjected to primary screening to obtain a first screening result, and then the heavy text to be checked and the comparison library are subjected to secondary screening to obtain a second screening result. And then, carrying out repetition degree calculation on the target file and the text to be checked in the second screening result to obtain sentence repetition degree values corresponding to different sentence types, and finally, combining the weight values of the different sentence types to obtain the text check result. According to the embodiment of the application, different weight values can be given to different sentence types, the accuracy of text check is improved, and the practicability is higher.
In some embodiments, comparing the heavy text to be checked with the comparison library to obtain a first screening result includes: extracting keywords from the text to be checked to obtain text keywords; word segmentation is carried out on the text keywords to obtain important words; combining the text keywords with the important words in pairs and screening to obtain search keywords; and searching in the comparison library by taking the search keyword as an index to obtain the first screening result.
According to the method and the device for searching the heavy text, the search keywords are obtained after the keywords of the heavy text to be searched are extracted and processed, and the search keywords are searched in the comparison library to obtain the first screening result, so that the efficiency of searching the heavy text can be improved.
In some embodiments, the calculating the repetition degree of the target text and the text to be checked to obtain a sentence repetition degree value corresponding to each sentence type in the at least one sentence type includes: obtaining a maximum public subsequence of the target text and the text to be checked, wherein the maximum public subsequence is at least one; and comparing the ratio of the maximum public subsequence to the text to be checked with a preset threshold value, and determining the calculation mode of the sentence repetition degree value to obtain the sentence repetition degree value.
According to the method and the device for calculating the sentence repetition degree, after the maximum public subsequence of the target text and the text to be checked is determined, the ratio of the maximum public subsequence to the text to be checked is compared with the preset threshold value, and the calculation mode of the sentence repetition degree value is determined, so that the accuracy and the efficiency of sentence repetition degree calculation can be guaranteed.
In some embodiments, the determining the calculation mode of the sentence repetition degree value by comparing the ratio of the maximum common subsequence to the text to be checked with a preset threshold value includes: if the ratio is not larger than the preset threshold, determining the statement repetition value according to the following method: searching the longest continuous public subsequence from the maximum public subsequence; calculating the repetition degree of the longest continuous public subsequence and the repeated text to be checked to obtain the sentence repetition degree value; if the ratio is confirmed to be larger than the preset threshold value, determining the statement repetition value according to the following method: and calculating the repetition degree of the maximum public subsequence and the repeated text to be checked to obtain the sentence repetition degree value.
According to some embodiments of the method, the sentence repetition degree value is calculated in different modes through the comparison result of the ratio of the maximum public subsequence to the repeated text to be checked and the preset threshold value, and the accuracy and the efficiency of sentence repetition degree calculation are guaranteed.
In some embodiments, before the text duplication checking result of the text to be checked is obtained according to the sentence duplication value corresponding to each sentence type and the weight value of each sentence type, the method further includes: performing sentence classification on the text to be checked to obtain the at least one sentence type, wherein the at least one sentence type comprises: at least one of a research background sentence, a research objective sentence, a research method sentence, a research conclusion sentence, and a research result sentence; and setting a weight value corresponding to each statement type in the at least one statement type.
According to the method and the device, the weight values corresponding to different sentence types are set by considering the importance of the sentence types, data support is provided for text check, and the accuracy of text check is improved.
In some embodiments, the obtaining the text duplication checking result of the text to be checked according to the sentence duplication value corresponding to each sentence type and the weight value of each sentence type includes: and carrying out weighted summation on the sentence repetition degree value corresponding to each sentence type and the weight value of each sentence type to obtain the text check result.
Some embodiments of the application determine text duplicate checking results in a weighted summation manner, which is simple and efficient.
In some embodiments, the method further comprises: and if the text content does not exist in the first screening result, the sentence repetition degree value is zero.
In a second aspect, some embodiments of the present application provide a document review device, including: the first screening module is used for comparing the heavy text to be checked with the comparison library to obtain a first screening result, wherein the first screening result represents whether text content similar to the heavy text to be checked exists in the comparison library; the second screening module is used for carrying out sentence vector similarity calculation on the text content and the text to be checked under the condition that the text content exists in the first screening result to obtain a second screening result, wherein the second screening result represents a target text similar to the text to be checked in the text content, and the text to be checked and the target text both contain at least one sentence type; the repetition degree calculation module is used for calculating the repetition degree of the target text and the repeated text to be checked and obtaining statement repetition degree values corresponding to each statement type in the at least one statement type; and the duplicate checking module is used for acquiring a text duplicate checking result of the text to be checked according to the sentence duplicate rating value corresponding to each sentence type and the weight value of each sentence type.
In a third aspect, some embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs a method according to any of the embodiments of the first aspect.
In a fourth aspect, some embodiments of the present application provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, can implement a method according to any of the embodiments of the first aspect.
In a fifth aspect, some embodiments of the present application provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor, is adapted to carry out the method according to any of the embodiments of the first aspect.
Drawings
In order to more clearly illustrate the technical solutions of some embodiments of the present application, the drawings that are required to be used in some embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort to a person having ordinary skill in the art.
FIG. 1 is a system diagram of document review provided by some embodiments of the present application;
FIG. 2 is one of the flow charts of the method for document review provided in some embodiments of the present application;
FIG. 3 is a second flowchart of a method for document review provided in some embodiments of the present application;
FIG. 4 is a block diagram of a document review device provided in some embodiments of the present application;
fig. 5 is a schematic diagram of an electronic device according to some embodiments of the present application.
Detailed Description
The technical solutions in some embodiments of the present application will be described below with reference to the drawings in some embodiments of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.
In the related art, the flow of the existing document duplication checking algorithm includes: firstly, preprocessing the uploaded text to be checked, including word segmentation, sentence segmentation, punctuation mark removal, space removal, line feed and the like, so as to better extract information later. Then, extracting the characteristics in the text to be checked, including keywords, phrases, sentence structures and the like, wherein the characteristics are used for subsequent comparison. Then, the system compares the extracted characteristics with a database of known sources to screen out similar contents; and finally, calculating the number of repeated words of the similar sentences or the similar paragraphs in the similar contents and outputting a final duplicate checking result.
However, the end result of the existing document duplication is to directly output the duplication degree of the paragraphs, and ignoring the types of different sentences in the paragraphs may have different importance degrees. For example: in the literature text, the importance of research content sentences and research method sentences tends to be higher than the importance of research background sentences in paragraphs. In the document duplicate checking algorithm, similar text comparison is performed by using extracted sentence features (such as keywords, etc.), and then similarity calculation is directly performed on similar sentences. On the one hand, the algorithm can greatly increase the number of compared texts, because the number of texts with the same keywords can be large; on the other hand, the sentence features such as keywords and sentence structures are not completely characterized, the comparison of text semantic similarity is needed to be considered, and the duplicate checking step is complex.
As can be seen from the above related art, the existing literature check and repeat algorithm is not comprehensive, and the accuracy and efficiency of the literature check and repeat result cannot be ensured.
In view of this, some embodiments of the present application provide a document duplication checking method, which first compares a text to be duplicated to a comparison library to obtain a first screening result. And then vector similarity calculation is carried out on the first screening result and the text to be checked, so as to obtain a second screening result. And then obtaining sentence repetition values corresponding to different sentence types through the second screening result and the text to be checked. And finally, combining the weight values corresponding to different sentence types and the sentence repetition degree values corresponding to different sentence types to determine a final text check result. According to the method and the device, the difference of importance degrees of sentence types of different chapters in documents is considered, different weights are set, so that the comprehensiveness and the accuracy of text duplicate checking are improved, duplicate checking efficiency is guaranteed, and the practicability is high.
The overall composition of the document weight checking system provided in some embodiments of the present application is exemplarily described below with reference to fig. 1.
As shown in fig. 1, some embodiments of the present application provide a system for document duplication checking, the system for document duplication checking including: a terminal 100 and a data server 200. The terminal 100 may transmit the re-text to be checked to the data server 200. The data server 200 may compare the text to be checked with a document database (as a specific example of a comparison library) built therein, to obtain a first screening result. And then, the data server 200 performs sentence vector similarity calculation on the text content in the first screening result and the text to be checked again to obtain a second screening result. Then, the data server 200 calculates the repetition degree of the target text and the text to be checked in the second filtering result, and outputs sentence repetition degree values corresponding to different sentence types according to the sentence types. And finally, combining weight values corresponding to different sentence types to obtain a text check result. Wherein, a sentence type corresponds to a sentence repetition value, and a weight value.
In some embodiments of the present application, the terminal 100 may be a mobile terminal or a non-portable computer terminal, which is not specifically limited herein.
In other embodiments of the present application, if the terminal 100 has a document database of the data server 200, and has the functions of comparing the text to be checked with the document database, and performing sentence vector similarity calculation, repetition value calculation, and obtaining text check result, the data server 200 may not be set. Specifically, the setting may be performed according to an actual application scenario, and the embodiment of the present application is not limited thereto.
The implementation of document review by the data server 200 provided in some embodiments of the present application is described below by way of example in conjunction with fig. 2.
Referring to fig. 2, fig. 2 is a flowchart of a method for document duplication checking according to some embodiments of the present application, where the method for document duplication checking may include:
s210, comparing the heavy text to be checked with a comparison library to obtain a first screening result, wherein the first screening result represents whether text content similar to the heavy text to be checked exists in the comparison library or not.
For example, in some embodiments of the present application, the data server 200 first compares the received text to be checked corresponding to the document to be checked with a document database (as a specific example of a comparison library), and confirms from the document database whether there is text content similar to the document to be checked. It should be noted that the comparison library is a library centrally managed by all articles published in public, and may include text data such as published papers, journals, published textbooks, and web publications.
In some embodiments of the present application, S210 may include: extracting keywords from the text to be checked to obtain text keywords; word segmentation is carried out on the text keywords to obtain important words; combining the text keywords with the important words in pairs and screening to obtain search keywords; and searching in the comparison library by taking the search keyword as an index to obtain the first screening result.
For example, in some embodiments of the present application, a keyword extraction model is used to extract keywords from sentences in paragraphs in heavy text to be checked, and the extracted text keywords are segmented to obtain important words. And then, matching the keywords in the text to be checked with the segmented important words in a double-word combination manner and removing the duplication. The double-word combination pairing is a mode of traversing all the two-by-two combinations of the identified text keywords, then removing the repeated combination mode, and reserving the repeated combination mode as the screened search keywords. And finally, searching in a comparison library by taking the search keywords as indexes to obtain a first screening result. If the similar content is searched, the first screening result comprises similar text content, and if the similar content is not searched, the first screening result does not contain the similar text content.
As a specific example, the process of the duplex combination pairing is: for example, using a keyword recognition algorithm (i.e., a chinese keyword extraction model), a total of 3 text keywords in a paragraph are recognized: semantics, context, check weight. The three important words of the semantic meaning, the context and the check weight can be obtained through word segmentation. Then, the two-by-two combination is traversed, and a total of 6 combination modes are: semantic, context; [ semantic, check weight ], [ context, semantic ]; [ context, check weight ]; [ check weight, semantics ]; [ check weight, context ]. Then, the duplicates are removed, and finally [ semantics, context ] is left; the term "semantic, search weight" and "context, search weight" are three groups of search keywords. And searching similar text contents in the comparison library by utilizing the three groups of search keywords.
It can be understood that the keyword index is the step of the preliminary screening, and the purpose is to quickly find and collect sentences in the text to be checked possibly existing in the comparison library by using the keyword index, and the function of similar positioning is achieved (if the keyword index is not available, the text to be checked is directly subjected to similar calculation with the report in the comparison library, so that the efficiency is low, and the preliminary screening is required). On one hand, the method can improve the efficiency of duplicate checking, and on the other hand, the collected similar text content can be used as a corpus of re-screening (namely, subsequent vector screening).
S220, under the condition that the text content exists in the first screening result, sentence vector similarity calculation is conducted on the text content and the text to be checked, and a second screening result is obtained, wherein the second screening result represents a target text similar to the text to be checked in the text content, and the text to be checked and the target text both contain at least one sentence type.
For example, in some embodiments of the present application, text content in the first screening result after the above-mentioned prescreening is rescreened. Specifically, the similarity between text content and the text to be checked in the first screening result can be calculated by using a Sentence-BERT Sentence vector model, so that similar target texts are further obtained. Sentence vector screening is used as re-screening, and aims to calculate the similarity of sentence vectors of sentences which possibly exist in text content collected after preliminary screening, select a target sentence (namely a target text) which is most likely to be repeated in terms of semantics, and then calculate the repeatability, so as to solve the defect that the semantics of the sentences cannot be completely represented when the sentence feature similarity calculation is simply adopted in the prior art.
In other embodiments of the present application, S220 may further include: and if the text content does not exist in the first screening result, the sentence repetition degree value is zero.
For example, in some embodiments of the present application, when there is no similar text content in the first screening result after the preliminary screening, it may be confirmed that there is no similar content to the text to be checked, and the sentence repetition degree may be regarded as zero.
Through the above-mentioned primary screening and secondary screening, it can be seen that the primary screening and secondary screening differ in that: (1) processing object is different: in the primary screening, the search keywords are used as indexes, the processing objects are keywords, and text contents which possibly have repetition are positioned and obtained by using keyword repetition; the sentence vector model is adopted in the re-screening, and the processing object is text content after the primary screening; (2) different purposes: the purpose of the primary screening is: quickly find and collect text content that may exist in the duplicate report; the purpose of vector screening is to further screen the prescreened sentences for repetition calculations.
S230, calculating the repetition degree of the target text and the repeated text to be checked, and acquiring statement repetition degree values corresponding to each statement type in the at least one statement type.
For example, in some embodiments of the present application, the repetition of the target text and the text to be repeated is calculated in conjunction with the longest common subsequence and the longest continuous common subsequence algorithm. I.e. to calculate a sentence repetition score (as a specific example of a sentence repetition value) using the Longest Common Subsequence (LCS) and the Longest Continuous Common Subsequence (LCCS) for the rescreened target text. It should be noted that, in the literature, the content of each chapter is generally divided into several chapters, and the sentence types can be divided into: research background sentences, research objective sentences, research method sentences, research conclusion sentences, research result sentences and the like. The statement type is used as a basic unit in the calculation of the statement repetition degree, so that the accuracy of check repetition is improved.
In some embodiments of the present application, S230 may include: obtaining a maximum public subsequence of the target text and the text to be checked, wherein the maximum public subsequence is at least one; and comparing the ratio of the maximum public subsequence to the text to be checked with a preset threshold value, and determining the calculation mode of the sentence repetition degree value to obtain the sentence repetition degree value.
For example, in some embodiments of the present application, the LCS algorithm is used to calculate the target text and the text to be re-examined, resulting in the largest common subsequence. In order to facilitate calculation of the duplicate checking result, a specific method for calculating the sentence duplicate degree value is determined by setting a ratio of the maximum public subsequence to the length of the duplicate checking text and a preset threshold as a standard for measuring the sentence duplicate degree.
For ease of understanding, the meaning of "maximum common subsequence" is exemplarily set forth below: for example, sequence 1,3,5,4,2,6,8,7 is aligned with sequence 1,4,8,6,7,5, 1,4,6 are a common subsequence thereof, 1,4,8 are also a common subsequence, and 1,4,8,7 is also a common subsequence. The longest common subsequence refers to the common subsequence that contains the most elements, and again in the above example, 1,4,8,7 and 1,4,6,7 are both the longest common subsequences, which are of length 4, as seen not to be unique.
Specifically, in some embodiments of the present application, S230 may include: if the ratio is not larger than the preset threshold, determining the statement repetition value according to the following method: searching the longest continuous public subsequence from the maximum public subsequence; and calculating the repetition degree of the longest continuous public subsequence and the repeated text to be checked to obtain the sentence repetition degree value.
For example, in some embodiments of the present application, when it is determined that the ratio of the maximum common subsequence to the length of the text to be checked is less than or equal to a preset threshold, the LCCS algorithm is used to obtain the longest continuous common subsequence from the maximum common subsequence, identify the similar situation of the continuous phrase, and output the repeated text, so as to obtain the repetition score (as a specific example of the sentence repetition value) of the text to be checked.
For ease of understanding, the meaning of "longest contiguous common subsequence" is exemplarily set forth below, comparing to the longest common subsequence, the longest contiguous common subsequence requires: the common subsequences must be contiguous. For example: 1,2,3,4 and 1,2,3,5, the longest contiguous common subsequence being 1,2,3; because 1,2,3 are not only common subsequences, they are contiguous in each sequence, not separated by additional numbers. However, 1,3,2,4 and 1,2,3,4, although all have a common subsequence 1,2,4, are not consecutive common subsequences in the original sequence, as they are separated by 3 in the original sequence.
From the above, it can be appreciated that the longest contiguous common subsequence may be more demanding than the longest common subsequence, or it can be understood that: the longest common subsequence comprises the longest continuous common subsequence. When similarity calculation is performed on two similar sentences, calculation is performed on the longest public subsequence first, so as to obtain two public subsequences of the similar sentences (or called similar texts), namely repeated fragments, (there may be several repeated fragments, because the longest public subsequence is not unique); then, the number of words continuously repeated by the longest continuous public subsequence (i.e. the number of words completely continuously repeated by the text to be checked) is found out from the longest public subsequence, and the final repetition score is calculated.
However, in practical applications a calculation is found that avoids the longest consecutive common subsequence. For example, adding "or other words to the consecutively repeated text breaks the consecutively repeated content, resulting in a shorter longest consecutive common subsequence and a smaller calculation result. For example: "weather today" and "weather today" plus a "result in an original longest contiguous common subsequence of 6 (i.e.:" weather good today "), now becomes 4 (i.e.," weather good "). Especially the break of longer sentences will greatly reduce the repetition.
In order to solve the problem, the method and the device firstly use the ratio of the maximum public subsequence to the length of the text to be checked as a standard for measuring the sentence repetition, and consider repeated sentences and ignore broken sentences whenever a preset threshold value is exceeded and the sentences are expressed. Thus, in other embodiments of the present application, S230 may include: if the ratio is confirmed to be larger than the preset threshold value, determining the statement repetition value according to the following method: and calculating the repetition degree of the maximum public subsequence and the repeated text to be checked to obtain the sentence repetition degree value.
For example, in some embodiments of the present application, when the ratio is greater than a preset threshold, the LCCS algorithm may be directly used to calculate the maximum common subsequence and the repetition degree of the text to be checked, so as to obtain the repetition degree score of the text to be checked. For example, "weather today is good" and "weather today is good", the LCCS algorithm outputs directly with a 100% repeatability score.
In some embodiments of the present application, before performing S240, the method of document review may further include: performing sentence classification on the text to be checked to obtain the at least one sentence type, wherein the at least one sentence type comprises: at least one of a research background sentence, a research objective sentence, a research method sentence, a research conclusion sentence, and a research result sentence; and setting a weight value corresponding to each statement type in the at least one statement type.
The final result of the conventional document check is to directly output the repeatability of the paragraphs, and the sentence types of different sentences are ignored, so that different importance degrees can be achieved. For example: the importance of the research content sentences and the research method sentences tends to be higher than the importance of the research background sentences in the paragraphs, and higher weights should be given. Thus, in some embodiments of the present application, different weight values need to be assigned to sentence types (e.g., background sentence weight, destination sentence weight, question sentence weight, method sentence weight, result sentence weight, etc.) of different sentences. Specifically, the weight assignment is carried out on sentences of different sentence types in the heavy text to be checked by using a language step recognition model, and the weight value corresponding to each sentence type is obtained.
It should be noted that, the sentence types may be divided according to the actual text types to be checked, and the weight value may be flexibly set based on the importance of the actual check, which is not limited in this application. This is because the content they focus on varies from unit to unit or business to business. For example, universities are more concerned with overall repetition, so that the weights may be averaged when they are set; enterprises and institutions may pay more attention to results, so the weight of research results sentences may be higher; innovative enterprises will pay more attention to the research methods, so the research methods may be weighted more heavily. The customized setting mode can be widely applied to different scenes, so that a more accurate duplicate checking result considered by a user in each scene is obtained.
S240, obtaining a text duplication checking result of the text to be checked according to the sentence duplication degree value corresponding to each sentence type and the weight value of each sentence type.
For example, in some embodiments of the present application, a text duplication checking result with higher accuracy may be obtained through the obtained duplication score and the set weight value corresponding to different sentence types.
In some embodiments of the present application, S240 may include: and carrying out weighted summation on the sentence repetition degree value corresponding to each sentence type and the weight value of each sentence type to obtain the text check result.
For example, in some embodiments of the present application, sentences of different sentence types in the text to be checked are identified by using a speech step recognition algorithm, weight distribution is performed on the obtained repeatability scores according to the sentence types of each sentence in a paragraph, and finally, the repeatability results of the whole paragraph in the text to be checked (as a specific example of the text check result) are obtained through weighted summation.
The specific process of document review provided in some embodiments of the present application is described below by way of example in conjunction with fig. 3.
Referring to fig. 3, fig. 3 is a flowchart of a method for document review according to some embodiments of the present application.
The above-described process is exemplarily set forth below.
S310, extracting keywords from the heavy text to be checked to obtain text keywords.
For example, taking the text to be checked in table 1 as an example, a corresponding text keyword may be obtained.
TABLE 1
S320, classifying sentences of the heavy text to be checked to obtain at least one sentence type.
For example, the text to be checked is identified through a speech step identification model, sentence types are obtained, and research background sentences and research method sentences are obtained as shown in table 2.
TABLE 2
It should be noted that S310 and S320 may be executed synchronously or separately, and the embodiment of the present application is not limited thereto.
S330, word segmentation is carried out on the text keywords to obtain important words; and combining the text keywords with the important words in pairs and screening to obtain the search keywords.
For example, the text keywords in table 1 are separated to be important words, and the search keywords shown in table 3 are obtained by pairing and de-duplication of double keywords (double keywords, i.e., text keywords and important words).
TABLE 3 Table 3
And S340, searching in a comparison library by taking the search keywords as indexes to obtain a first screening result.
For example, through the search keywords in table 3 being screened in the comparison library, text content similar to the text to be searched is obtained, as shown in table 4, for example, sentences 1 to 4 are screened.
TABLE 4 Table 4
S350, sentence vector similarity calculation is carried out on the text content and the text to be checked, and a second screening result is obtained.
For example, the primary screening sentences 1-4 in table 4 are respectively subjected to sentence vector similarity calculation with the heavy text to be checked to obtain the most similar sentences, namely the re-screening results 1-4 shown in table 5. Then, by setting the threshold value to 0.8 (the threshold value may be flexibly set, not limited thereto), the rescreened results having the similarity score greater than 0.8 are taken as target texts, namely, rescreened results 1 and 2 (as a specific example of the second screening result).
TABLE 5
S360, obtaining the maximum public subsequence of the target text and the text to be checked, and obtaining the ratio of the maximum public subsequence to the text to be checked.
For example, the LCS algorithm is used to calculate the largest common subsequence of the re-screening results 1 and 2, respectively, with the re-text to be examined. The ratio of the maximum public subsequence to the text to be checked is that of counting the number of repeated words to the total number of words of the text to be checked. For example, the number of repeated words in the largest common subsequence is 143, the total number of words is 240, and the ratio is 143/240×100% = 59.98%.
And S370, judging whether the ratio is larger than a preset threshold, if so, executing S371, otherwise, executing S372.
S371, calculating the repetition degree of the maximum public subsequence and the repeated text to be checked to obtain statement repetition degree values of each statement type.
S372, searching the longest continuous public subsequence from the largest public subsequence; and calculating the repetition degree of the longest continuous public subsequence and the repeated text to be checked to obtain the sentence repetition degree value of each sentence type.
And S380, carrying out weighted summation on the sentence repetition degree value of each sentence type and the weight value of each sentence type to obtain a text check result.
For example, statement types include: study background sentences and study method sentences. Study background sentence weight 0.3 (as a specific example of a weight value) was set, study method sentence weight 0.7. The sentence repetition value of the research background sentence is 37/139, and the sentence repetition value of the research method sentence is 106/126. Finally, text review results = (37/139) ×0.3+ (106/126) ×0.7.
It should be noted that, the specific implementation procedures of S310 to S380 may refer to the method embodiments provided above, and detailed descriptions are omitted here as appropriate to avoid repetition.
Referring to fig. 4, fig. 4 illustrates a block diagram of a document review device provided in some embodiments of the present application. It should be understood that the apparatus for checking duplication of document corresponds to the above method embodiments, and can perform the steps related to the above method embodiments, and specific functions of the apparatus for checking duplication of document may be referred to the above description, and detailed description is omitted herein as appropriate to avoid duplication.
The document duplication checking apparatus of fig. 4 includes at least one software functional module which can be stored in a memory in the form of software or firmware or solidified in the document duplication checking apparatus, the document duplication checking apparatus including: the first screening module 410 is configured to compare the heavy text to be checked with a comparison library, and obtain a first screening result, where the first screening result characterizes whether text content similar to the heavy text to be checked exists in the comparison library; the second screening module 420 is configured to perform sentence vector similarity calculation on the text content and the text to be checked if the text content exists in the first screening result, so as to obtain a second screening result, where the second screening result characterizes a target text similar to the text to be checked in the text content, and the text to be checked and the target text each contain at least one sentence type; the repetition calculating module 430 is configured to perform repetition calculation on the target text and the text to be checked, and obtain a sentence repetition value corresponding to each sentence type in the at least one sentence type; and the check and repeat module 440 is configured to obtain a text check and repeat result of the text to be checked and repeat according to the sentence repetition value corresponding to each sentence type and the weight value of each sentence type.
In some embodiments of the present application, the first screening module 410 is configured to perform keyword extraction on the text to be checked to obtain text keywords; word segmentation is carried out on the text keywords to obtain important words; combining the text keywords with the important words in pairs and screening to obtain search keywords; and searching in the comparison library by taking the search keyword as an index to obtain the first screening result.
In some embodiments of the present application, the repeatability calculation module 430 is configured to obtain a maximum common subsequence of the target text and the text to be re-checked, where the maximum common subsequence is at least one; and comparing the ratio of the maximum public subsequence to the text to be checked with a preset threshold value, and determining the calculation mode of the sentence repetition degree value to obtain the sentence repetition degree value.
In some embodiments of the present application, the repetition calculating module 430 is configured to determine the sentence repetition value according to the following method if it is determined that the ratio is not greater than the preset threshold: searching the longest continuous public subsequence from the maximum public subsequence; calculating the repetition degree of the longest continuous public subsequence and the repeated text to be checked to obtain the sentence repetition degree value; if the ratio is confirmed to be larger than the preset threshold value, determining the statement repetition value according to the following method: and calculating the repetition degree of the maximum public subsequence and the repeated text to be checked to obtain the sentence repetition degree value.
In some embodiments of the present application, the review module 440 is configured to classify the text to be reviewed into sentences, and obtain the at least one sentence type, where the at least one sentence type includes: at least one of a research background sentence, a research objective sentence, a research method sentence, a research conclusion sentence, and a research result sentence; and setting a weight value corresponding to each statement type in the at least one statement type.
In some embodiments of the present application, the query module 440 is configured to perform weighted summation on the sentence repetition value corresponding to each sentence type and the weight value of each sentence type, so as to obtain the text query result.
In some embodiments of the present application, the first filtering module 410 is configured to, if it is determined that the text content does not exist in the first filtering result, set the sentence repetition value to zero.
It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding procedure in the foregoing method for the specific working procedure of the apparatus described above, and this will not be repeated here.
Some embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program, which when executed by a processor, may implement operations of the method corresponding to any of the above-described methods provided by the above-described embodiments.
Some embodiments of the present application further provide a computer program product, where the computer program product includes a computer program, where the computer program when executed by a processor may implement operations of a method corresponding to any of the foregoing methods provided by the foregoing embodiments.
As shown in fig. 5, some embodiments of the present application provide an electronic device 500, the electronic device 500 comprising: memory 510, processor 520, and a computer program stored on memory 510 and executable on processor 520, wherein processor 520 may implement a method as in any of the embodiments described above when reading the program from memory 510 and executing the program via bus 530.
Processor 520 may process the digital signals and may include various computing structures. Such as a complex instruction set computer architecture, a reduced instruction set computer architecture, or an architecture that implements a combination of instruction sets. In some examples, processor 520 may be a microprocessor.
Memory 510 may be used for storing instructions to be executed by processor 520 or data related to execution of the instructions. Such instructions and/or data may include code to implement some or all of the functions of one or more modules described in embodiments of the present application. The processor 520 of the disclosed embodiments may be configured to execute instructions in the memory 510 to implement the methods shown above. Memory 510 includes dynamic random access memory, static random access memory, flash memory, optical memory, or other memory known to those skilled in the art.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A method for document duplication, comprising:
comparing the heavy text to be checked with a comparison library to obtain a first screening result, wherein the first screening result represents whether text content similar to the heavy text to be checked exists in the comparison library or not;
under the condition that the text content exists in the first screening result, sentence vector similarity calculation is carried out on the text content and the text to be checked again, and a second screening result is obtained, wherein the second screening result represents a target text similar to the text to be checked again in the text content, and the text to be checked again and the target text both contain at least one sentence type;
performing repetition degree calculation on the target text and the text to be checked, and obtaining sentence repetition degree values corresponding to each sentence type in the at least one sentence type;
and acquiring a text duplication checking result of the text to be checked according to the sentence duplication value corresponding to each sentence type and the weight value of each sentence type.
2. The method of claim 1, wherein comparing the heavy text to be checked with the comparison library to obtain a first screening result comprises:
Extracting keywords from the text to be checked to obtain text keywords;
word segmentation is carried out on the text keywords to obtain important words;
combining the text keywords with the important words in pairs and screening to obtain search keywords;
and searching in the comparison library by taking the search keyword as an index to obtain the first screening result.
3. The method of claim 1 or 2, wherein the calculating the repetition degree of the target text and the text to be checked to obtain a sentence repetition degree value corresponding to each sentence type in the at least one sentence type includes:
obtaining a maximum public subsequence of the target text and the text to be checked, wherein the maximum public subsequence is at least one;
and comparing the ratio of the maximum public subsequence to the text to be checked with a preset threshold value, and determining the calculation mode of the sentence repetition degree value to obtain the sentence repetition degree value.
4. The method of claim 3, wherein the determining the calculation mode of the sentence repetition value by comparing the ratio of the maximum common subsequence to the text to be checked with a preset threshold value includes:
If the ratio is not larger than the preset threshold, determining the statement repetition value according to the following method:
searching the longest continuous public subsequence from the maximum public subsequence;
calculating the repetition degree of the longest continuous public subsequence and the repeated text to be checked to obtain the sentence repetition degree value;
if the ratio is confirmed to be larger than the preset threshold value, determining the statement repetition value according to the following method: and calculating the repetition degree of the maximum public subsequence and the repeated text to be checked to obtain the sentence repetition degree value.
5. The method according to claim 1 or 2, wherein before the text query result of the text to be queried is obtained according to the sentence repetition level value corresponding to each sentence type and the weight value of each sentence type, the method further comprises:
performing sentence classification on the text to be checked to obtain the at least one sentence type, wherein the at least one sentence type comprises: at least one of a research background sentence, a research objective sentence, a research method sentence, a research conclusion sentence, and a research result sentence;
and setting a weight value corresponding to each statement type in the at least one statement type.
6. The method of claim 1 or 2, wherein the obtaining the text query result of the text to be queried according to the sentence repetition degree value corresponding to each sentence type and the weight value of each sentence type includes:
and carrying out weighted summation on the sentence repetition degree value corresponding to each sentence type and the weight value of each sentence type to obtain the text check result.
7. The method of claim 1 or 2, wherein the method further comprises:
and if the text content does not exist in the first screening result, the sentence repetition degree value is zero.
8. A document weight checking device, comprising:
the first screening module is used for comparing the heavy text to be checked with the comparison library to obtain a first screening result, wherein the first screening result represents whether text content similar to the heavy text to be checked exists in the comparison library;
the second screening module is used for carrying out sentence vector similarity calculation on the text content and the text to be checked under the condition that the text content exists in the first screening result to obtain a second screening result, wherein the second screening result represents a target text similar to the text to be checked in the text content, and the text to be checked and the target text both contain at least one sentence type;
The repetition degree calculation module is used for calculating the repetition degree of the target text and the repeated text to be checked and obtaining statement repetition degree values corresponding to each statement type in the at least one statement type;
and the duplicate checking module is used for acquiring a text duplicate checking result of the text to be checked according to the sentence duplicate rating value corresponding to each sentence type and the weight value of each sentence type.
9. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program, wherein the computer program when run by a processor performs the method according to any of claims 1-7.
10. An electronic device comprising a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the computer program when run by the processor performs the method of any one of claims 1-7.
CN202311696616.2A 2023-12-11 2023-12-11 Document duplicate checking method and device, storage medium and electronic equipment Active CN117763106B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311696616.2A CN117763106B (en) 2023-12-11 2023-12-11 Document duplicate checking method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311696616.2A CN117763106B (en) 2023-12-11 2023-12-11 Document duplicate checking method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN117763106A true CN117763106A (en) 2024-03-26
CN117763106B CN117763106B (en) 2024-06-18

Family

ID=90311712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311696616.2A Active CN117763106B (en) 2023-12-11 2023-12-11 Document duplicate checking method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN117763106B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170103077A1 (en) * 2015-10-07 2017-04-13 Harmon.Ie R&D Ltd. System and method for cross-cloud identification of topics
JP2019109654A (en) * 2017-12-18 2019-07-04 ヤフー株式会社 Similar text extraction device, automatic response system, similar text extraction method, and program
KR102085217B1 (en) * 2019-10-14 2020-03-04 (주)디앤아이파비스 Method, apparatus and system for determining similarity of patent documents
CN111104794A (en) * 2019-12-25 2020-05-05 同方知网(北京)技术有限公司 Text similarity matching method based on subject words
CN111753066A (en) * 2020-03-19 2020-10-09 北京信聚知识产权有限公司 Method, device and equipment for expanding technical background text
CN112836009A (en) * 2021-02-19 2021-05-25 东莞理工学院 Thesis duplicate checking method and system supporting privacy protection
CN112905421A (en) * 2021-03-18 2021-06-04 中科九度(北京)空间信息技术有限责任公司 Container abnormal behavior detection method of LSTM network based on attention mechanism
CN112948545A (en) * 2021-02-25 2021-06-11 平安国际智慧城市科技股份有限公司 Duplicate checking method, terminal equipment and computer readable storage medium
CN113536759A (en) * 2021-06-29 2021-10-22 北京清格科技有限公司 Text duplicate checking method, device and equipment
CN114218371A (en) * 2021-12-17 2022-03-22 平安养老保险股份有限公司 Multilevel directory name retrieval matching method, device, equipment and medium
CN114970489A (en) * 2022-05-18 2022-08-30 武汉数博科技有限责任公司 Thesis duplicate checking method, system and equipment
CN115344719A (en) * 2022-08-11 2022-11-15 中国科学院文献情报中心 Automatic science and technology searching method and system
CN116431763A (en) * 2023-04-06 2023-07-14 河南中烟工业有限责任公司 Domain-oriented science and technology project duplicate checking method and system
US20230289375A1 (en) * 2022-03-11 2023-09-14 Fujitsu Limited Storage medium, search device, and search method

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170103077A1 (en) * 2015-10-07 2017-04-13 Harmon.Ie R&D Ltd. System and method for cross-cloud identification of topics
JP2019109654A (en) * 2017-12-18 2019-07-04 ヤフー株式会社 Similar text extraction device, automatic response system, similar text extraction method, and program
KR102085217B1 (en) * 2019-10-14 2020-03-04 (주)디앤아이파비스 Method, apparatus and system for determining similarity of patent documents
CN111104794A (en) * 2019-12-25 2020-05-05 同方知网(北京)技术有限公司 Text similarity matching method based on subject words
CN111753066A (en) * 2020-03-19 2020-10-09 北京信聚知识产权有限公司 Method, device and equipment for expanding technical background text
CN112836009A (en) * 2021-02-19 2021-05-25 东莞理工学院 Thesis duplicate checking method and system supporting privacy protection
CN112948545A (en) * 2021-02-25 2021-06-11 平安国际智慧城市科技股份有限公司 Duplicate checking method, terminal equipment and computer readable storage medium
CN112905421A (en) * 2021-03-18 2021-06-04 中科九度(北京)空间信息技术有限责任公司 Container abnormal behavior detection method of LSTM network based on attention mechanism
CN113536759A (en) * 2021-06-29 2021-10-22 北京清格科技有限公司 Text duplicate checking method, device and equipment
CN114218371A (en) * 2021-12-17 2022-03-22 平安养老保险股份有限公司 Multilevel directory name retrieval matching method, device, equipment and medium
US20230289375A1 (en) * 2022-03-11 2023-09-14 Fujitsu Limited Storage medium, search device, and search method
CN114970489A (en) * 2022-05-18 2022-08-30 武汉数博科技有限责任公司 Thesis duplicate checking method, system and equipment
CN115344719A (en) * 2022-08-11 2022-11-15 中国科学院文献情报中心 Automatic science and technology searching method and system
CN116431763A (en) * 2023-04-06 2023-07-14 河南中烟工业有限责任公司 Domain-oriented science and technology project duplicate checking method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
OMID SHAHMIRZADI等: "Text Similarity in Vector Space Models: A Comparative Study", 《2019 18TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA)》, 17 February 2020 (2020-02-17), pages 659 - 666 *
李伊仝等: "融入新闻标题信息的新闻文本与评论的语义相似度计算方法", 《吉林大学学报(理学版)》, vol. 60, no. 6, 18 November 2022 (2022-11-18), pages 1399 - 1406 *
谢靖等: "科技文献检索系统语义丰富化框架的设计与实践", 《数据分析与知识发现》, vol. 1, no. 4, 25 April 2017 (2017-04-25), pages 84 - 93 *

Also Published As

Publication number Publication date
CN117763106B (en) 2024-06-18

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
CN108509474B (en) Synonym expansion method and device for search information
CN108197117B (en) Chinese text keyword extraction method based on document theme structure and semantics
CN109460455B (en) Text detection method and device
US20110112995A1 (en) Systems and methods for organizing collective social intelligence information using an organic object data model
US10346257B2 (en) Method and device for deduplicating web page
CN109271489B (en) Text detection method and device
CN106951530B (en) Event type extraction method and device
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
RU2491622C1 (en) Method of classifying documents by categories
CN107463616B (en) Enterprise information analysis method and system
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN111831824A (en) Public opinion positive and negative face classification method
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
US11893537B2 (en) Linguistic analysis of seed documents and peer groups
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
CN114116973A (en) Multi-document text duplicate checking method, electronic equipment and storage medium
CN113032584B (en) Entity association method, entity association device, electronic equipment and storage medium
CN114911917A (en) Asset meta-information searching method and device, computer equipment and readable storage medium
CN112579783A (en) Short text clustering method based on Laplace map
CN109284384B (en) Text analysis method and device, electronic equipment and readable storage medium
CN117763106B (en) Document duplicate checking method and device, storage medium and electronic equipment
US10140289B2 (en) Identifying propaganda in global social media
Martín-del-Campo-Rodríguez et al. Unsupervised authorship attribution using feature selection and weighted cosine similarity
JP2014235584A (en) Document analysis system, document analysis method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant