CN110472203B

CN110472203B - Article duplicate checking and detecting method, device, equipment and storage medium

Info

Publication number: CN110472203B
Application number: CN201910748782.XA
Authority: CN
Inventors: 李陟
Original assignee: Shanghai Xiaoi Robot Technology Co Ltd
Current assignee: Shanghai Xiaoi Robot Technology Co Ltd
Priority date: 2019-08-14
Filing date: 2019-08-14
Publication date: 2024-03-19
Anticipated expiration: 2039-08-14
Also published as: CN110472203A

Abstract

The embodiment of the invention discloses a duplicate checking and detecting method, device and equipment for articles and a storage medium. The article duplicate checking and detecting method comprises the following steps: carrying out semantic analysis on the article to be checked, and determining at least one key sentence set corresponding to the article to be checked; acquiring at least one key description feature corresponding to at least one reference article respectively; and matching each key sentence set of the article to be checked with each key description characteristic of each reference article respectively, and determining the key characteristic similarity between the article to be checked and each reference article according to a matching result so as to check and re-detect the article to be checked. According to the technical scheme provided by the embodiment of the invention, the core views in the article to be checked are matched with the core views of the reference article, so that the influence on the check duplicate detection result caused by synonym replacement or article content sequence change is avoided, and the accuracy of the article check duplicate detection is improved.

Description

Article duplicate checking and detecting method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to an information processing technology, in particular to a duplicate checking and detecting method, device and equipment for an article and a storage medium.

Background

With the rapid development of network technology, network users can easily obtain research results and academic papers released by other people on the network. There are many demands for writing papers, such as teacher, doctor and student graduation, and the repeated examination is usually performed to verify originality of the papers.

The existing paper duplicate checking system can find the similarity between the paper to be checked and the paper uploaded by other people on the network by comparing texts, but some cheating software replaces the paper duplicate checking system by a large amount of synonyms, so that the paper duplicate checking system for checking duplicate detection by comparing texts fails, and the original text content sequence is artificially changed, and the paper duplicate checking system is interfered, so that the accuracy of duplicate checking detection is affected.

Disclosure of Invention

The embodiment of the invention provides a duplicate-checking detection method, device and equipment for an article and a storage medium, so as to improve the accuracy of duplicate-checking detection of the article.

In a first aspect, an embodiment of the present invention provides a duplicate detection method for an article, where the method includes:

carrying out semantic analysis on a to-be-checked article, and determining at least one keyword set corresponding to the to-be-checked article, wherein the keywords in the same keyword set correspond to the same article viewpoint;

Acquiring at least one key description feature corresponding to at least one reference article respectively, wherein different key description features correspond to different article views respectively;

and matching each key sentence set of the article to be checked with each key description characteristic of each reference article respectively, and determining the key characteristic similarity between the article to be checked and each reference article according to a matching result so as to check and re-detect the article to be checked.

In a second aspect, an embodiment of the present invention further provides an article duplicate detection apparatus, where the apparatus includes:

the keyword set determining module is used for carrying out semantic analysis on the article to be checked, determining at least one keyword set corresponding to the article to be checked, wherein the keywords in the same keyword set correspond to the same article viewpoint;

the key description feature acquisition module is used for acquiring at least one key description feature corresponding to at least one reference article respectively, and different article perspectives are corresponding to different key description features respectively;

and the similarity determining module is used for respectively matching each key sentence set of the article to be checked with each key description characteristic of each reference article, and determining the similarity of the key characteristics between the article to be checked and each reference article according to the matching result so as to check and re-detect the article to be checked.

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

one or more processors;

a memory for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors implement the duplicate detection method of the article provided by any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a computer program is stored, where the program when executed by a processor implements the duplicate detection method of an article provided by any embodiment of the present invention.

According to the technical scheme, semantic analysis is carried out on the article to be checked, at least one keyword set corresponding to the article to be checked is obtained, the keyword set is matched with the obtained keyword description features of the reference article, and finally, the similarity of the keyword features between the article to be checked and each reference article is determined according to the matching result, so that the article to be checked is checked and re-detected, the matching of the core view of the article to be checked and the core view of the reference article is realized, the influence on the check and re-detection result caused by synonym replacement or article content sequence change is avoided, and the accuracy of article check and re-detection is improved.

Drawings

FIG. 1 is a flowchart of an article duplicate detection method in a first embodiment of the invention;

FIG. 2 is a flowchart of an article duplicate detection method in a second embodiment of the invention;

FIG. 3 is a flowchart of an article duplicate detection method in a third embodiment of the invention;

FIG. 4 is a flowchart of an article duplicate detection method in a fourth embodiment of the invention;

FIG. 5 is a schematic diagram of a duplicate detection device according to a fifth embodiment of the present invention;

fig. 6 is a schematic structural diagram of an apparatus according to a sixth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Example 1

Fig. 1 is a flowchart of an article duplicate checking method in a first embodiment of the present invention, where the technical solution of the present embodiment is suitable for performing article duplicate checking according to a keyword extracted from an article to be checked and a keyword description feature extracted from a reference article, and the method may be performed by an article duplicate checking device, which may be implemented by software and/or hardware and may be integrated in various general-purpose computer devices, and specifically includes the following steps:

Step 110, carrying out semantic analysis on an article to be checked, and determining at least one keyword set corresponding to the article to be checked, wherein the keywords in the same keyword set correspond to the same article viewpoint;

the key sentence set is extracted from the article to be checked, and can represent the sentence set of the article core view.

In this embodiment, by performing semantic analysis on the article to be checked by taking sentences with preset length as units and performing category classification according to the semantics of all the sentences contained in the article, for example, sentences with similarity greater than a set threshold value can be classified into one category, and finally, key sentences which can represent the core views of the article most are extracted from at least one category of sentences according to a set rule, and the key sentences form a key sentence set.

Step 120, obtaining at least one key description feature corresponding to at least one reference article respectively, wherein different key description features correspond to different article views respectively;

the key description features are description information corresponding to the reference articles, which is obtained by processing the reference articles in advance, and each key description feature corresponds to one main viewpoint contained in the reference articles.

In this embodiment, key description features corresponding to at least one reference article to be matched with the article to be checked are obtained, so as to check and re-detect the article to be checked, where one reference article corresponds to at least one key description feature.

And 130, matching each key sentence set of the article to be checked with each key description characteristic of each reference article, and determining the similarity of the key characteristics between the article to be checked and each reference article according to the matching result so as to check and re-detect the article to be checked.

In this embodiment, the key sentence set of the article to be checked obtained in step 110 is sequentially compared with the key description features corresponding to each reference article, the similarity between the article to be checked and each reference article is determined according to the comparison result, whether the article to be checked passes the check-repeat detection or not is determined according to the similarity information, and a detection report is provided, for example, a similarity threshold may be preset, when the similarity between the article to be checked and a certain reference article exceeds the similarity threshold, it is determined that the article to be checked does not pass the check-repeat detection, or the ratio of the key sentence matched with the reference article in the key sentence set may be calculated, and if the ratio exceeds the preset ratio threshold, it is determined that the article to be checked does not pass the check-repeat detection.

Example two

Fig. 2 is a flowchart of an article duplicate detection method provided in a second embodiment of the present invention, where the embodiment is further refined on the basis of the foregoing embodiment, and provides a specific step of performing semantic analysis on an article to be checked and determining at least one keyword set corresponding to the article to be checked. The following describes a duplicate checking and detecting method for an article according to a second embodiment of the present invention with reference to fig. 2, which includes the following steps:

and 210, filtering sentences included in the article to be checked according to preset conditions to obtain an alternative key sentence set.

In this embodiment, in order to extract the core views of the article to be checked, the article to be checked is split firstly by taking the sentences as units to obtain a set of all sentences contained in the article, but some sentences which are less related to the core views of the article may be contained therein, which is to screen all sentences in the set by adopting a preset condition so as to filter partial sentences which have a small relationship with the core views of the article, and finally the filtered set of sentences (i.e., the set of sentences obtained by deleting the screened partial sentences which have a small relationship with the core views of the article from the set of all sentences contained in the article) is used as the candidate set of key sentences.

For example, the above-mentioned preset condition may be that sentences whose sentence length does not satisfy the preset length are deleted from the sentence collection, or that sentences in which the core views of the articles to be detected cannot be represented in the section of the chapter titles of "introduction" or "research background and meaning" are deleted from the collection, which is not limited specifically herein.

Optionally, filtering the sentences included in the article to be checked according to a preset rule to obtain an alternative keyword set, including:

splitting the article to be checked into a plurality of sentences by taking punctuation marks as references;

Screening out the sentences with sentence lengths not meeting a preset effective threshold value, and forming the rest sentences into the candidate key sentence sets.

In this optional embodiment, a specific rule is provided for filtering sentences included in a to-be-checked article to obtain a candidate keyword set, the article is split according to punctuation marks in the to-be-checked article, the content in the middle of two continuous punctuation marks of any type can be used as a sentence, the content in the middle of two continuous punctuation marks of specified type can also be used as a sentence, and after the article is split according to the rule, all sentence sets of the article are obtained; and secondly, judging whether sentences contained in the sentence set are larger than an effective threshold value in sequence, deleting sentences smaller than the effective threshold value from the sentence set, and forming an alternative key sentence set by the rest sentences. For example, the content between any two consecutive punctuations may be defined as a sentence, or the content between two consecutive commas or periods may be defined as a sentence.

It can be understood that after splitting the heavy article to be checked, some phrases may be included in the set of all sentences, but many phrases have no explicit meaning, for example, "first," "summarizing," and so on, so that sentences with lengths not meeting the effective threshold need to be deleted, and the setting of the effective threshold can be flexibly adjusted according to specific situations.

And 220, determining weight coefficients corresponding to the candidate key sentences respectively according to the chapter positions and/or the title positions of the candidate key sentences in the article to be checked.

In general, the chapter layout of an article has a certain rule, and the probabilities that core views of the article may appear in different chapters are different, so in order to accurately identify the core views of the article, the corresponding weights of the candidate key sentences are set according to the positions of the candidate key sentences in the article.

For example, in general, a academic paper includes five major parts, abstract, introduction, detailed content section, summary, and credit, while the probability that the core views of the articles may appear in the introduction and credit parts is small, and thus the weights of the candidate keywords included in the two parts may be set to a small value (e.g., set to weight 1), while the probability that the core views of the articles may appear in the abstract, detailed content section, and summary parts is large, and thus the weights of the candidate keywords included in the two parts may be set to a large value (e.g., set to weight 5)

And 230, performing equivalent expansion on each candidate key sentence in the candidate key sentence set according to the weight coefficient.

In this embodiment, on the basis of setting the weights of the candidate key sentences in step 220, the candidate key sentences are expanded according to the weights corresponding to the candidate key sentences, for example, the candidate key sentences appearing in the abstract portion of the article are expanded 5 times.

Step 240, clustering according to the semantics of the sentences in the candidate keyword sentence sets to obtain at least one keyword sentence set of the article to be checked.

In this embodiment, the corresponding semantics are obtained by performing semantic analysis on the candidate key sentences, and the candidate key sentences in the candidate key sentence sets are clustered according to the semantics, and finally, at least one key sentence set capable of expressing the article core views is obtained by screening the clustering result according to the preset condition, wherein the key sentences contained in the same key sentence set correspond to the same article views.

Optionally, the clustering is performed according to the semantics of the sentences in the candidate key sentences to obtain at least one key sentence set of the article to be checked, which includes:

respectively calculating semantic similarity between any two alternative key sentences in the alternative key sentence set;

clustering each candidate key sentence in the candidate key sentence set according to the semantic similarity to obtain at least one cluster;

Counting the number of candidate key sentences included in each cluster;

and taking the clustering clusters, the number of which meets the number threshold condition, as the key sentence set.

In this optional embodiment, a specific step of clustering sentences included in the candidate sentence set is provided to obtain a keyword set, firstly, calculating semantic similarity of any two candidate keywords in the candidate keyword set, secondly, performing clustering processing on each candidate keyword in the candidate keyword set according to the semantic similarity, and combining the candidate keywords with the semantic similarity higher than a set threshold into a cluster, that is, classifying the candidate keywords according to semantics, forming a cluster by one class of candidate keywords, and again, counting the number of candidate keywords included in all obtained cluster clusters sequentially, wherein the more the number of candidate keywords included in one cluster indicates that the semantic of the candidate keywords is likely to be close to the core view of the article, so that the key sentence set is finally formed by the candidate keywords included in the cluster meeting the number threshold condition. By way of example, the number threshold may be set to 5.

Step 250, obtaining at least one key description feature corresponding to at least one reference article respectively, wherein different key description features correspond to different article views respectively.

Step 260, matching each key sentence set of the article to be checked with each key description feature of each reference article, and determining the similarity of the key features between the article to be checked and each reference article according to the matching result, so as to check and re-detect the article to be checked.

According to the technical scheme, the candidate keyword sets are obtained through filtering sentences in the article to be checked according to the set conditions, the candidate keyword sets are expanded according to the weight coefficients corresponding to the positions of the candidate keywords in the article, then clustering is carried out according to the semantics of the sentences in the candidate keyword sets, the keyword sets are obtained, finally, the keyword sets are matched with the obtained keyword description features of the reference article, the similarity of the keyword features between the article to be checked and each reference article is determined, so that the article to be checked is checked and re-detected, on one hand, the keyword of the article to be checked is accurately mastered according to the position of the candidate keyword in the article, on the other hand, the core views in the article to be checked are matched with the core views of the reference article, the influence on the check and re-detection result caused by synonymous replacement or the change of the article content sequence is avoided, and the accuracy of the article to be checked and re-detected is improved.

Example III

Fig. 3 is a flowchart of an article duplicate detection method in a third embodiment of the present invention, where the embodiment is further refined on the basis of the foregoing embodiment, and provides a specific step before performing semantic analysis on an article to be checked, determining at least one set of key sentences corresponding to the article to be checked, and a specific step after obtaining at least one key description feature corresponding to at least one reference article respectively. The following describes a duplicate checking and detecting method for an article according to a third embodiment of the present invention with reference to fig. 3, which includes the following steps:

and 310, carrying out semantic analysis on the reference article, and determining at least one key sentence set corresponding to the reference article as a comparison key sentence set.

In this embodiment, in order to accurately match with the core views of the articles to be checked, semantic analysis is performed on the reference articles, and at least one set of key sentences corresponding to each reference article is obtained as a set of comparison key sentences for comparison with the key sentences of the articles to be checked. The method for obtaining the set of comparison key sentences is the same as the method for obtaining the set of key sentences of the article to be checked, and specifically, refer to steps 210 to 240 in the second embodiment, and are not described herein again.

And 320, extracting semantic features of the comparison key sentences in the comparison key sentence sets to obtain key description features corresponding to the comparison key sentence sets of the reference article.

In this embodiment, after at least one comparison keyword set corresponding to each reference article is obtained, semantic features are extracted from comparison keywords included in the comparison keyword set, and finally, key description features corresponding to each comparison keyword set of the reference article are obtained. The semantic analysis is performed on the reference article to obtain 5 comparison keyword sets, common features of the 5 comparison keyword sets are extracted from the keywords contained in the 5 comparison keyword sets respectively to obtain semantic features corresponding to the comparison keyword sets, and at least 5 semantic features corresponding to the 5 comparison keyword sets are finally used as key description features corresponding to the reference article and used for matching with the keywords of the article to be checked.

Optionally, the extracting semantic features of the comparison key sentence in each comparison key sentence set to obtain key description features corresponding to each comparison key sentence set includes:

in the currently processed comparison key sentence set, a comparison key sentence is obtained as a standard question, and other comparison key sentences except the standard question are used as similar question questions of the standard question;

The similar questions of the standard questions are segmented, and intersection is obtained from segmentation results, wherein the segmentation results of each similar question consist of word classes to which each word of the corresponding similar question belongs;

in the intersection, selecting phrases according to the occurrence frequency of the phrases to form at least one semantic expression corresponding to the standard question, wherein each phrase consists of a preset number of word classes;

and taking the formed at least one semantic expression as a key description feature corresponding to the currently processed comparison key sentence set.

The standard questions are sentences or phrases representing the main views of the reference article, and are mainly used herein to clearly express the main views of the reference article, but not to refer to question sentences in the article, and similar question sentences corresponding to the standard questions refer to sentences or phrases with the same meaning as the standard questions but different expression modes. For example, one of the main ideas in a reference article is that "red blood cells are the most abundant blood cells in blood", which is regarded as a question of clear expression, and a sentence in which "red blood cells in blood are more than any other blood cells" and the like are the same as the meaning of the question of the standard, but the expression manner is different is set as a similar question of the standard.

In this alternative embodiment, firstly, a comparison key sentence with clear and simple expression (for example, a sentence in the lower definition form) is selected from the currently processed comparison key sentence set as a standard question, other key sentences except the standard question are used as similar question sentences of the standard question, then, any known word segmentation algorithm can be adopted to divide each similar question sentence into a plurality of words, the word class of each word is used for replacing the word, after the word segmentation result is obtained, the word segmentation result of the plurality of similar question sentences of the current standard question is intersected, in the intersection, a phrase is selected according to the occurrence frequency of the phrase to form a plurality of semantic expressions corresponding to the standard question, and the semantic expressions are used as key description features corresponding to the currently processed comparison key sentence set.

Illustratively, the word segmentation results of 5 similar questions of one standard question are as follows:

[A] [ B ] [ C ] [ D ] [ E ] [ F ] [ G ], [ A ] [ B ] [ K ] [ J ] [ L ] [ M ], [ A ] [ B ] [ C ] [ M ] [ Q ], [ A ] [ B ] [ C ] [ D ], [ A ] [ E ] [ D ] wherein [ A ], [ B ], [ C ], [ D ], [ E ], [ F ], [ G ], [ K ], [ J ], [ L ], [ M ] Q ] are parts of speech in the word segmentation result, and in the process of taking intersection of the five word segmentation results, the word segmentation [ A ] appears 5 times and the word segmentation [ B ] appears 4 times and the word segmentation [ C ] appears 3 times, at the moment, the word segmentation [ A ] with higher frequency can be selected as a semantic expression between standards, the word expression of the word group [ A ] [ B ] can also be selected as a semantic expression between standards, and the key description characteristics corresponding to the comparison key sentence collection of the current processing.

constructing a plurality of training samples according to each comparison key sentence in the currently processed comparison key sentence set;

training the basic deep learning model by using the training sample to obtain a key feature description model;

and using the key feature description model as the key description feature corresponding to the comparison key sentence set in the current process.

In this optional embodiment, the comparison key sentence included in the comparison key sentence set is used as a training sample, the sample is input into the basic deep learning model, and the model is trained to obtain a key feature description model, that is, the finally obtained key feature description model can characterize the semantic feature of the comparison key sentence set corresponding to the reference file, so that the key feature description model is used as the key description feature corresponding to the currently processed comparison key sentence set, and is used for matching with the key sentence of the article to be checked.

And 330, carrying out semantic analysis on the article to be checked, and determining at least one keyword set corresponding to the article to be checked, wherein the keywords in the same keyword set correspond to the same article viewpoint.

And step 340, acquiring at least one key description feature corresponding to the at least one reference article respectively, wherein different key description features correspond to different article views respectively.

Optionally, after the acquiring the at least one key description feature corresponding to the at least one reference article, the method further includes:

and inputting each key description characteristic of the reference article as reference article knowledge data into a knowledge base of a question and answer engine.

In this optional embodiment, when a comparison key sentence set corresponding to a reference article is obtained by performing semantic analysis on the reference article, and semantic features of the comparison key sentence are extracted from the comparison key sentence set by using a standard question generated according to the comparison key sentence, so as to obtain key description features corresponding to each comparison key sentence of the reference article, the obtained key description features of the reference article may be input into a question-answer engine knowledge base for matching with the key sentence set contained in the article to be checked.

And 350, matching each key sentence set of the article to be checked with each key description characteristic of each reference article, and determining the similarity of the key characteristics between the article to be checked and each reference article according to the matching result so as to check and re-detect the article to be checked.

Optionally, the matching the keyword sentence sets of the to-be-checked article with the keyword description features of the reference articles respectively, and determining the similarity of the keyword features between the to-be-checked article and the reference articles according to the matching result includes:

inputting each key sentence in the key sentence set into the question-answer engine respectively;

acquiring the standard question triggering times in the knowledge data of each reference article output by the question and answer engine;

and taking the ratio of the standard question triggering times to the number of all the keyword sets as the key feature similarity between the article to be checked and each reference article.

In this optional embodiment, after the keyword set of the article to be checked is obtained, each keyword in the keyword set is respectively input to a question-answer engine, and is matched with the reference article knowledge data pre-stored in the question-answer engine, if the matching is successful, the standard question trigger counter in the reference article knowledge data is increased by 1, and finally, the ratio of the number of standard question triggers to the number of all the keyword sets is calculated, and the ratio is used as the similarity of the key features between the article to be checked and each reference article.

The method has the advantages that after the semantic expression of the reference article is used as the knowledge data of the reference article to be input into the knowledge base of the question-answering engine, the question-answering engine with mature technology can be directly used for calculating the similarity of the key features, the development cost can be greatly reduced, and the development flow is simplified.

According to the technical scheme, the comparison key sentence set corresponding to the reference article is obtained through semantic analysis of the reference article, semantic features of the comparison key sentences are extracted from the comparison key sentence set by using a standard question or key feature description model generated according to the comparison key sentences, so that key description features corresponding to the comparison key sentences of the reference article are obtained, and finally, the key description features are matched with the key sentences in the article to be checked, so that the core view in the article to be checked is matched with the core view of the reference article, and the accuracy of article check and repeat detection is improved.

Example IV

Fig. 4 is a flowchart of an article duplicate detection method in a fourth embodiment of the present invention, where the embodiment is further refined on the basis of the foregoing embodiment, and provides a specific step of matching each of the set of key sentences of the article to be duplicated to each of the key description features of each of the reference articles, determining a similarity of key features between the article to be duplicated and each of the reference articles according to a matching result, and a specific step after determining the similarity of key features between the article to be duplicated and each of the reference articles according to a matching result. The following describes a duplicate checking and detecting method for an article according to a third embodiment of the present invention with reference to fig. 3, which includes the following steps:

And 410, carrying out semantic analysis on the article to be checked, and determining at least one keyword set corresponding to the article to be checked, wherein the keywords in the same keyword set correspond to the same article viewpoint.

Step 420, obtaining at least one key description feature corresponding to at least one reference article respectively, wherein different key description features correspond to different article views respectively.

And 430, respectively selecting at least one key sentence from each key sentence set of the article to be checked, and respectively matching with each key description characteristic of the reference article.

In this embodiment, at least one key sentence is respectively taken out from at least one key sentence set corresponding to the article to be checked, and is respectively compared with each key description feature of the reference article. For example, the key sentences in the key sentence set 1 (the key sentence set 1 contains 15 key sentences) in the article to be checked are respectively matched with the key description features of the reference article. For example, matching can be performed by calculating similarity, and when the similarity between the key sentence and the key description feature is higher than a set threshold, determining that the current key sentence is matched with the key description feature.

Step 440, determining a matched keyword set in each keyword set according to a matching result of at least one keyword in each keyword set and each keyword description feature.

In this embodiment, according to the matching result of the key sentence and each key description feature contained in the key sentence set, it is determined whether the currently processed key sentence set belongs to the matching key sentence set. For example, when more than a set number of key sentences are matched with the key description features, it is determined that the set of key sentences to which the key sentences belong belongs to the matched set of key sentences.

Optionally, the determining, according to the matching result of at least one key sentence in each set of key sentences and each key description feature, a set of matching key sentences in each set of key sentences includes:

judging whether key sentences exceeding a set proportion in the at least one key sentence in the currently processed key sentence set are matched with the same key description characteristic;

if yes, determining the currently processed keyword set as the matched keyword set.

In this optional embodiment, a specific manner of determining a set of matching key sentences according to a matching result is provided, specifically, a ratio of key sentences matched with key description information in a current set of key sentences to the total number of key sentences in the set of key sentences is calculated by dividing the number of key sentences matched with the current set of key sentences and the total number of key sentences in the set of key sentences, and when the ratio is greater than a set ratio, the currently processed set of key sentences is determined to be the set of matching key sentences. Illustratively, the proportion is set at 45%.

In a specific example, a key sentence may be randomly obtained from a currently processed key sentence set, and the key sentence is matched with each key description feature, and if the key sentence is successfully matched with any key description feature, the currently processed key sentence set is determined to be the matched key sentence set; if the matching of all the key description features fails, determining that the currently processed key sentence set is not the matched key sentence set.

And step 450, calculating the ratio of the number of the matched keyword sets to the number of all the keyword sets, and taking the ratio as the key feature similarity between the article to be checked and the reference article.

In this embodiment, the key feature similarity between the article to be checked and the reference article is determined by calculating the proportion of the number of the matched key sentence sets to the number of all the key sentence sets included in the article to be checked.

Optionally, after determining the key feature similarity between the article to be checked and each reference article according to the matching result, the method further includes:

if it is determined that at least one target reference article with the key feature similarity satisfying the repeated similarity threshold condition exists between the two articles to be checked, determining that the two articles to be checked do not pass the check-repeat test, that is, determining that the two articles to be checked do not pass the check-repeat test when it is determined that most article views in the two articles to be checked are all from one target reference article.

In this optional embodiment, a method for determining whether the article to be checked passes the check-repeat test is provided, specifically, according to the calculated key feature similarity, whether the key feature similarity of the article to be checked and at least one reference article exceeds a preset repeated similarity threshold condition (for example, the key feature similarity exceeds 90%), if so, it is determined that the article to be checked does not pass the check-repeat test.

acquiring at least one target reference article, wherein the similarity of the key features between the target reference article and the article to be checked meets a close similarity threshold condition;

acquiring target key sentence sets which are respectively matched with each target reference article in the articles to be checked;

calculating a set union set among the target key sentence sets respectively matched with the target reference articles;

and if the ratio of the number of the target keyword sets included in the set union sets to the total number of the keyword sets included in the article to be checked meets a set ratio threshold condition, determining that the article to be checked fails the check and repeat test.

In this optional embodiment, another method for determining whether the article to be checked passes the check-repeat test is provided, specifically, first, a target reference article whose key feature similarity meets a close similarity threshold condition (for example, the key feature similarity is greater than 40%) is obtained, the number of target keyword sets matched with each target reference article in the article to be checked is counted and summed to obtain the total number of target keyword sets matched with the target reference article, then a ratio of the number of target keyword sets to the total number of keyword sets included in the article to be checked is calculated, and when the ratio meets a set ratio threshold condition, it is determined that the article to be checked fails the check-repeat test.

That is, when most article views in the to-be-checked article are determined to be from the plurality of target reference articles, the to-be-checked article combines the article views of the plurality of target reference articles, so that it can be determined that the to-be-checked article does not pass the re-check test.

Acquiring at least one target reference article, wherein the similarity of the key features between the target reference article and the article to be checked meets a comparison similarity threshold condition;

obtaining an article original sentence corresponding to a target key sentence in each target key sentence set in the article to be checked, and generating an article original sentence set corresponding to each target key sentence set;

and generating a review test report according to the key feature similarity of the article to be reviewed, the key feature similarity of each target reference article and the article original sentence set.

In this optional embodiment, a method for generating a duplicate detection report according to a duplicate detection result is provided, which includes firstly, obtaining target reference articles whose key feature similarity with a to-be-detected article meets a comparison similarity threshold condition (for example, the key feature similarity is greater than 60%), then obtaining a target key sentence set matched with each target reference article in the to-be-detected article, determining an original sentence in the to-be-detected article corresponding to the target key sentence set according to the target key sentence in the target key sentence set, and finally generating the duplicate detection report according to the key feature similarity and the article original sentence set respectively corresponding to the to-be-detected article and each target reference article. For example, an original sentence corresponding to a key sentence matched with a target reference article may be reddish, and related content in the target reference article is displayed corresponding to the original sentence, and the key feature similarity between the article to be checked and the target reference article is marked.

It will be appreciated by those skilled in the art that the above-mentioned repeated similarity threshold condition, approach similarity threshold condition, and comparison similarity threshold condition may be preset according to actual situations, which is not limited in this embodiment.

According to the technical scheme, key sentences of the articles to be checked and key description features of all the reference articles are matched, a matched key sentence set matched with the reference articles in the key sentence set is obtained, the ratio of the number of the matched key sentence sets to the number of all the key sentence sets is further calculated to determine the similarity of the articles to be checked and the reference articles, whether the articles to be checked pass check and re-test reports are generated according to the matching result or not is finally judged by repeating a similarity threshold condition or a set ratio threshold condition of the number of the matched key sentence sets to the number of all the key sentence sets, check and re-test passing conditions can be flexibly set, and a user can check specific check and re-test conditions conveniently.

Example five

Fig. 5 is a schematic structural diagram of an apparatus for detecting duplicate in an article according to a fifth embodiment of the present invention, where the apparatus for detecting duplicate in an article includes: a key sentence set determination module 510, a key descriptive feature acquisition module 520, and a similarity determination module 530.

The keyword set determining module 510 is configured to perform semantic analysis on a heavy article to be checked, determine at least one keyword set corresponding to the heavy article to be checked, where the keywords in the same keyword set correspond to the same article viewpoint;

the key description feature obtaining module 520 is configured to obtain at least one key description feature corresponding to at least one reference article, where different key description features correspond to different article perspectives;

the similarity determining module 530 is configured to match each of the set of key sentences of the article to be checked with each of the key description features of each of the reference articles, and determine, according to a matching result, a similarity of key features between the article to be checked and each of the reference articles, so as to check and re-detect the article to be checked.

Optionally, the keyword set determining module 510 includes:

the candidate keyword acquisition unit is used for filtering sentences included in the article to be checked according to preset conditions to obtain a candidate keyword set;

the key sentence set acquisition unit is used for clustering according to the semantics of sentences in the candidate key sentence sets to obtain at least one key sentence set of the article to be checked.

Optionally, the keyword set obtaining unit is specifically configured to:

Optionally, the alternative keyword obtaining unit is specifically configured to:

counting the number of candidate key sentences included in each cluster;

Optionally, the keyword set determining module 510 further includes:

the weight coefficient determining unit is used for determining weight coefficients respectively corresponding to the candidate key sentences according to chapter positions and/or title positions of the candidate key sentences in the article to be checked after the sentences included in the article to be checked are filtered according to preset rules to obtain candidate key sentence sets;

and the equivalent expansion unit is used for carrying out equivalent expansion on each alternative key sentence in the alternative key sentence set according to the weight coefficient.

Optionally, the duplicate checking and detecting device of the article further includes:

the comparison key sentence set acquisition module is used for carrying out semantic analysis on the reference article before carrying out semantic analysis on the article to be checked to determine at least one key sentence set corresponding to the article to be checked, and determining at least one key sentence set corresponding to the reference article as a comparison key sentence set;

and the key description feature acquisition module is used for extracting semantic features of the comparison key sentences in the comparison key sentence sets to obtain key description features corresponding to the comparison key sentence sets of the reference article.

Optionally, the key description feature acquisition module is specifically configured to:

Optionally, the similarity determining module 530 includes:

the feature matching unit is used for respectively selecting at least one key sentence from each key sentence set of the article to be checked and matching the key sentence with each key description feature of the reference article;

a matched keyword set determining unit, configured to determine a matched keyword set in each keyword set according to a matching result of at least one keyword in each keyword set and each keyword description feature;

and the similarity calculation unit is used for calculating the ratio of the number of the matched keyword sets to the number of all the keyword sets and taking the ratio as the key feature similarity between the article to be checked and the reference article.

Optionally, the matching keyword set determining unit is specifically configured to:

The knowledge data input module is used for inputting the key description features of each reference article into a knowledge base of the question-answering engine as reference article knowledge data after the at least one key description feature corresponding to the at least one reference article is acquired;

optionally, the similarity determining module includes:

the key sentence input unit is used for inputting each key sentence in the key sentence set into the question-answering engine respectively;

the trigger frequency acquisition unit is used for acquiring the trigger frequency of the standard questions in the knowledge data of each reference article output by the question and answer engine;

and the similarity acquisition unit is used for taking the ratio of the triggering times of the standard questions to the number of all the key sentence sets as the key feature similarity between the article to be checked and each reference article.

and the check and repeat test result determining module is used for determining that the article to be checked fails the check and repeat test if at least one target reference article with the key feature similarity meeting the repeated similarity threshold condition exists after determining the key feature similarity between the article to be checked and each reference article according to the matching result.

the target reference article acquisition module is used for acquiring at least one target reference article, of which the key feature similarity between the article to be checked and each reference article meets a close similarity threshold condition, after the key feature similarity between the article to be checked and each reference article is determined according to the matching result;

the target keyword set acquisition module is used for acquiring target keyword sets which are respectively matched with the target reference articles in the articles to be checked;

the set union calculation module is used for calculating the set union among the target key sentence sets respectively matched with the target reference articles;

and the check and repeat test result determining module is used for determining that the article to be checked fails the check and repeat test if the ratio of the number of target keyword sets included in the set union set to the total number of the keyword sets included in the article to be checked meets a set ratio threshold condition.

the target reference article acquisition module is used for acquiring at least one target reference article, of which the key feature similarity between the article to be checked and each reference article meets a comparison similarity threshold condition, after the key feature similarity between the article to be checked and each reference article is determined according to the matching result;

the article original sentence set generating module is used for acquiring article original sentences corresponding to target key sentences in each target key sentence set in the article to be checked and generating article original sentence sets corresponding to each target key sentence set;

and the check and repeat test report generating module is used for generating a check and repeat test report according to the key feature similarity of the article to be checked and each target reference article and the article original sentence set.

The article duplicate checking and detecting device provided by the embodiment of the invention can execute the article duplicate checking and detecting method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executing method.

Example six

Fig. 6 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention, as shown in fig. 6, the electronic device includes a processor 60 and a memory 61; the number of processors 60 in the device may be one or more, one processor 60 being taken as an example in fig. 6; the processor 60 and the memory 61 in the device may be connected by a bus or otherwise, in fig. 6 by way of example.

The memory 61 is used as a computer readable storage medium, and may be used to store a software program, a computer executable program, and a module, such as program instructions/modules corresponding to a duplicate detection method of an inverted article in an embodiment of the present invention (for example, the keyword set determining module 510, the keyword description feature acquiring module 520, and the similarity determining module 530 in the duplicate detection apparatus of the article). The processor 60 executes various functional applications of the device and data processing by running software programs, instructions and modules stored in the memory 61, i.e., implements the duplicate detection method of the article described above.

The method comprises the following steps:

The memory 61 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the terminal, etc. In addition, the memory 61 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 61 may further comprise memory remotely located relative to processor 60, which may be connected to the device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Example seven

A seventh embodiment of the present invention also provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing a method of duplicate detection of an article, the method comprising:

Of course, the storage medium provided by the embodiments of the present invention and including the computer executable instructions is not limited to the above-described method operations, and may also perform the related operations in the duplicate detection method of the article provided by any embodiment of the present invention.

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.

It should be noted that, in the embodiment of the duplicate detection device of the above article, each unit and module included are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. The article duplicate checking and detecting method is characterized by comprising the following steps:

carrying out semantic analysis on a to-be-checked article, and determining at least one keyword set corresponding to the to-be-checked article, wherein the keywords in the same keyword set correspond to the same article viewpoint; the key sentence set is extracted from the article to be checked and can represent the set of sentences of the article core view;

Acquiring at least one key description feature corresponding to at least one reference article respectively, wherein different key description features correspond to different article views respectively; the key description features are description information which is obtained by processing the reference articles in advance and corresponds to the reference articles, and each key description feature corresponds to one main viewpoint contained in the reference articles;

matching each key sentence set of the article to be checked with each key description feature of each reference article respectively, and determining the key feature similarity between the article to be checked and each reference article according to the matching result so as to check and re-detect the article to be checked;

the semantic analysis is performed on the article to be checked, and the determining of at least one key sentence set corresponding to the article to be checked comprises the following steps:

filtering sentences included in the article to be checked according to a preset rule to obtain an alternative key sentence set;

clustering is carried out according to the semantics of sentences in the alternative keyword sentence sets, and at least one keyword sentence set of the article to be checked is obtained.

2. The method of claim 1, wherein the clustering according to the semantics of the sentences in the candidate keyword sentence sets to obtain at least one keyword sentence set of the article to be checked comprises:

counting the number of candidate key sentences included in each cluster;

taking the clustering clusters, the number of which meets the number threshold condition, as the key sentence set; and/or

The filtering the sentences included in the article to be checked according to a preset rule to obtain an alternative key sentence set, which comprises the following steps:

3. The method of claim 1, wherein after filtering sentences included in the article to be reviewed according to a preset rule to obtain a set of candidate key sentences, further comprising:

determining weight coefficients corresponding to the candidate key sentences respectively according to chapter positions and/or title positions of the candidate key sentences in the article to be checked;

And carrying out equivalent expansion on each alternative key sentence in the alternative key sentence set according to the weight coefficient.

4. A method according to any one of claims 1-3, further comprising, prior to the semantic analysis of the article to be re-examined to determine at least one set of key sentences corresponding to the article to be re-examined:

carrying out semantic analysis on the reference article, and determining at least one key sentence set corresponding to the reference article as a comparison key sentence set;

and extracting semantic features of the comparison key sentences in the comparison key sentence sets to obtain key description features corresponding to the comparison key sentence sets of the reference article.

5. The method of claim 4, wherein extracting semantic features of the aligned key sentences in the aligned key sentence sets to obtain key description features corresponding to the aligned key sentence sets of the reference article comprises:

the at least one semantic expression is formed as a key description feature corresponding to the currently processed comparison key sentence set; and/or

Extracting semantic features of the comparison key sentences in the comparison key sentence sets to obtain key description features corresponding to the comparison key sentence sets, wherein the extracting the semantic features of the comparison key sentences in the comparison key sentence sets comprises the following steps:

6. The method of claim 1, wherein the matching each of the set of key sentences of the article to be reviewed with each of the key description features of each of the reference articles, respectively, and determining the similarity of the key features between the article to be reviewed and each of the reference articles according to the matching result comprises:

At least one key sentence is selected from each key sentence set of the article to be checked and matched with each key description characteristic of the reference article;

determining a matched key sentence set in each key sentence set according to the matching result of at least one key sentence in each key sentence set and each key description characteristic;

and calculating the ratio of the number of the matched keyword sets to the number of all the keyword sets to serve as the key feature similarity between the article to be checked and the reference article.

7. An article duplicate checking and detecting device, comprising:

the keyword set determining module is used for carrying out semantic analysis on the article to be checked, determining at least one keyword set corresponding to the article to be checked, wherein the keywords in the same keyword set correspond to the same article viewpoint; the key sentence set is extracted from the article to be checked and can represent the set of sentences of the article core view;

the key description feature acquisition module is used for acquiring at least one key description feature corresponding to at least one reference article respectively, and different article perspectives are corresponding to different key description features respectively; the key description features are description information which is obtained by processing the reference articles in advance and corresponds to the reference articles, and each key description feature corresponds to one main viewpoint contained in the reference articles;

The similarity determining module is used for respectively matching each key sentence set of the article to be checked with each key description characteristic of each reference article, and determining the similarity of the key characteristics between the article to be checked and each reference article according to the matching result so as to check and re-detect the article to be checked;

the key sentence set determining module is further configured to filter sentences included in the article to be checked according to a preset rule to obtain an alternative key sentence set; clustering is carried out according to the semantics of sentences in the alternative keyword sentence sets, and at least one keyword sentence set of the article to be checked is obtained.

8. An electronic device, the device comprising:

one or more processors;

a memory for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method of duplicate detection of articles of any one of claims 1-6.

9. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a method of duplicate detection of an article as claimed in any one of claims 1 to 6.