CN110347806B

CN110347806B - Original text screening method, original text screening device, original text screening equipment and computer readable storage medium

Info

Publication number: CN110347806B
Application number: CN201910669863.0A
Authority: CN
Inventors: 蔡远航; 郑少杰; 付勇; 范增虎; 江旻
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2019-07-23
Filing date: 2019-07-23
Publication date: 2024-02-06
Anticipated expiration: 2039-07-23
Also published as: WO2021012958A1; CN110347806A

Abstract

The invention discloses a method for discriminating original texts, which comprises the following steps: after receiving a text to be screened, acquiring more than one object to be compared corresponding to the text to be screened from a preset original database; preprocessing the text to be screened to obtain more than one first clause; comparing each first clause with each object to be compared to determine non-original clauses existing in the first clause; and if the ratio of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold, determining that the text to be screened is the original text. The invention also discloses an original text screening device, equipment and a readable storage medium. The method and the device process the text to be screened into each clause, determine whether the text to be screened is the original text, decompose the text to be screened into the original clause, and determine whether the text to be screened is the original text according to the duty ratio of the original clause, thereby effectively improving the screening precision of the original text.

Description

Original text screening method, original text screening device, original text screening equipment and computer readable storage medium

Technical Field

The present invention relates to the field of financial technology (Fintech), and in particular, to a method, apparatus, device, and computer readable storage medium for screening original text.

Background

In recent years, with the development of financial technology (Fintech), particularly internet finance, data discrimination technology is introduced into daily business of financial institutions such as banks. In the daily propaganda process of financial institutions such as banks, in order to ensure that propaganda texts such as news, soft texts and advertisements are not plagiarism works of plagiarism others, originality of the propaganda texts needs to be checked before propagation, unnecessary copyright disputes can be avoided only by ensuring that the propaganda texts are original texts, and the original works are given value feedback, so that original screening of texts to be screened is a work which is necessary to be done by the financial institutions such as banks for external propaganda.

The conventional method is that before publicity texts are transmitted to the outside, publicity texts are input into a computer by public departments of financial institutions such as banks or other departments for propaganda, the publicity texts are compared with texts in an original database of the computer by the computer, and originality of the publicity texts is determined by calculating similarity through keywords.

However, the existing method only can judge whether the text to be screened has plagiarism or not, but cannot give specific plagiarism rate indexes, if the text to be screened sequentially extracts a section of each of a plurality of original texts, the existing method cannot give a plagiarism conclusion, and the original text to be screened, which has a large number of main language substitutions, pronoun substitutions and the like, is difficult to screen, and obviously, the existing screening method has lower accuracy.

Disclosure of Invention

The invention mainly aims to provide an original text screening method, an original text screening device, original text screening equipment and a computer readable storage medium, and aims to improve original text screening precision.

In order to achieve the above object, the present invention provides an original text screening method, which includes the following steps:

after receiving a text to be screened, acquiring more than one object to be compared corresponding to the text to be screened from a preset original database;

preprocessing the text to be screened to obtain more than one first clause;

comparing each first clause with each object to be compared to determine non-original clauses existing in the first clause;

And if the ratio of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold, determining that the text to be screened is the original text.

Preferably, the step of obtaining, after receiving the text to be screened, one or more objects to be compared corresponding to the text to be screened in a preset original database includes:

when a text to be screened is received, determining the text length of the text to be screened, and cutting the text to be screened into character strings with the corresponding number of the text length;

and obtaining matching objects matched with the character strings from a preset original database, and selecting a preset number of objects to be compared from the matching objects.

Preferably, the step of preprocessing the text to be screened to obtain more than one first clause includes:

filtering the text to be screened based on a preset filtering rule to obtain a filtered text;

and carrying out clause on the filtered text based on a preset clause rule to obtain more than one first clause.

Preferably, the step of obtaining more than one first clause by clauseing the filtered text based on a preset clause rule includes:

Based on a preset clause rule, carrying out clause on the filtered text to obtain each clause, and sequentially determining whether the word number of each clause reaches the preset word number;

if the word number of the current clause reaches the preset word number, setting the current clause as the first clause;

and if the word number of the current clause does not reach the preset word number, merging the current clause into the first clause set on the basis of the previous clause.

Preferably, the step of comparing each first clause with each object to be compared, and determining the non-original clause existing in the first clause includes:

generating a first hash value corresponding to each first clause;

a hash value set corresponding to each object to be compared is called, wherein the hash value set comprises a plurality of second hash values;

comparing the first hash value with the second hash values, and determining a third hash value with a Hamming distance smaller than or equal to a first preset value from at least one second hash value in the first hash values;

and marking the clause corresponding to the third hash value as a non-original clause in the first clause.

Preferably, after the step of marking the clause corresponding to the third hash value as a non-original clause, the method further includes:

If it is determined that the Hamming distance from the ith first clause to the ith+k first clause in the text to be screened and the jth clause to the jth+k clause in the target object does not exceed a first preset value, calculating a first editing distance between the ith-nth first clause in the text to be screened and the jth-nth clause in the target object and a second editing distance between the ith+k+m first clause in the text to be screened and the jth+k+m clause in the target object, wherein the target object is one object in the objects to be compared, i and j are constants greater than 0, k is a preset constant, n is a set from 1 to i, m is a set from 1 to infinity, i+k+m is less than or equal to the number of the clauses of the text to be screened, and j+k+m is less than or equal to the number of the clauses of the target object;

if the ratio of the first editing distance to the clause length of the j-n clause is smaller than a second preset value, marking the i-n first clause as a non-original clause; and if the ratio of the second editing distance to the clause length of the j+k+m clause is smaller than the second preset value, marking the i+k+m first clauses as non-original clauses.

Preferably, after said determining the non-original clause existing in the first clause, the method further comprises:

And counting the word number of the non-original clause in the text to be screened, and determining the duty ratio of the non-original clause in the text to be screened based on the word number and the total word number of the text to be screened.

In addition, in order to achieve the above object, the present invention also provides an original text screening apparatus, including:

the acquisition module is used for acquiring more than one object to be compared corresponding to the text to be screened from a preset original database after receiving the text to be screened;

the preprocessing module is used for preprocessing the text to be screened to obtain more than one first clause;

the first determining module is used for comparing each first clause with each object to be compared to determine non-original clauses existing in the first clause;

and the second determining module is used for determining the text to be screened as the original text if the duty ratio of the non-original clause in the text to be screened is not larger than a preset plagiarism threshold.

Preferably, the acquiring module is further configured to:

Preferably, the preprocessing module is further configured to:

Preferably, the first determining module is further configured to:

generating a first hash value corresponding to each first clause;

Preferably, the first determining module is further configured to:

In addition, to achieve the above object, the present invention also provides an original text screening apparatus, including: the system comprises a memory, a processor and an original text screening program stored on the memory and capable of running on the processor, wherein the original text screening program realizes the steps of the original text screening method when being executed by the processor.

In addition, in order to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon an original text discrimination program which, when executed by a processor, implements the steps of the original text discrimination method as described above.

According to the original text screening method, after the text to be screened is received, more than one object to be compared corresponding to the text to be screened is obtained from a preset original database; preprocessing the text to be screened to obtain more than one first clause; comparing each first clause with each object to be compared to determine non-original clauses existing in the first clause; and if the ratio of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold, determining that the text to be screened is the original text. The invention also discloses an original text screening device, equipment and a readable storage medium. The method and the device process the text to be screened into each clause, determine whether the text to be screened is the original text, decompose the text to be screened into the original clause, and determine whether the text to be screened is the original text according to the duty ratio of the original clause, thereby effectively improving the screening precision of the original text.

Drawings

FIG. 1 is a schematic diagram of a device architecture of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flow chart of a first embodiment of the original text screening method of the present invention;

fig. 3 is a flowchart of a second embodiment of the original text screening method of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic device structure of a hardware running environment according to an embodiment of the present invention.

The device of the embodiment of the invention can be a PC or a server device.

As shown in fig. 1, the apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

It will be appreciated by those skilled in the art that the device structure shown in fig. 1 is not limiting of the device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and a text original discrimination program may be included in the memory 1005 as one type of computer storage medium.

The operating system is a program for managing and controlling original text screening equipment and software resources and supports the operation of a network communication module, a user interface module, the original text screening program and other programs or software; the network communication module is used to manage and control the network interface 1002; the user interface module is used to manage and control the user interface 1003.

In the original text discrimination apparatus shown in fig. 1, the original text discrimination apparatus calls an original text discrimination program stored in a memory 1005 by a processor 1001 and performs operations in various embodiments of the original text discrimination method described below.

Based on the hardware structure, the embodiment of the original text screening method is provided.

Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the original text screening method of the present invention, where the method includes:

Step S10, after receiving a text to be screened, acquiring more than one object to be compared corresponding to the text to be screened from a preset original database;

step S20, preprocessing the text to be screened to obtain more than one first clause;

step S30, comparing each first clause with each object to be compared to determine non-original clauses existing in the first clause;

and step S40, if the duty ratio of the non-original clause in the text to be screened is not larger than a preset plagiarism threshold, determining that the text to be screened is original text.

The method for discriminating the original text is applied to original text discriminating devices of financial institutions such as financial institutions or banking systems, and is used for convenience of description, the original text discriminating devices are hereinafter referred to as discriminating devices, the discriminating devices interface an original database, all original texts on the internet are stored in the original database, and in particular implementation, only original texts of nearly 3 years are generally stored in the original database due to hardware limitation, and in addition, a Search module is built in the discriminating devices and used for acquiring original objects corresponding to a current Search language in the original database, wherein the Search module is preferably an ES Search module (Elastic Search, a distributed full text Search engine) in the embodiment. The ES searching module searches in the original database according to the search language, and returns the search result, the higher the sequence of the returned search result is, the higher the similarity between the result and the text of the search language is, the searching principle based on ES searching is the prior art, and the description is omitted here.

When receiving a text to be screened, the screening device of the embodiment screens an object to be compared related to the current text to be screened from an original database, processes the text to be screened into separate sentences, and determines the non-original clause duty ratio in the separate sentences so as to determine whether the text to be screened is the original text.

The following will explain each step in detail:

in this embodiment, before a financial institution or a financial institution such as a bank announces or issues a text to be screened, a text to be screened is input into a screening device to screen whether the text to be screened is an original text, so as to avoid unnecessary copyright disputes.

After receiving the text to be screened, the screening device obtains more than one object to be compared corresponding to the text to be screened in the original database, that is, the embodiment does not need to compare the current text to be screened with all original texts in the original database one by one, but screens the object to be compared related to the current text to be screened from the original database.

Specifically, step S10 includes:

in the step, when receiving a text to be screened, the screening device determines the text length of the text to be screened, namely calculates the character length of the text to be screened, cuts the text to be screened into a corresponding number of character strings, specifically presets the character string length, cuts the text to be screened according to the preset character string length, if the text length of the current text to be screened is assumed to be N, the preset character string length is assumed to be 100, and N/100 character strings are obtained after cutting. Since the search words of the ES search module have an upper limit of length, the text to be screened needs to be truncated, and preferably 100 words are used as the preset string length.

In this step, each character string is searched as a search term, an original object matched with the current character string is obtained in a preset original database, and after all the character string searches are completed, the obtained set of original objects is the matched object.

The specific selection modes can be as follows: in the search result corresponding to each character string, original objects with the front sequence are selected, and the specific number is as follows: preset number/number of strings. If there are 10 character strings, when 5000 original objects are obtained, each character string is searched, 5000/10=500 original objects with the front sequence are selected each time, and then search results of the 10 character strings are combined to obtain 50000 original objects.

It should be noted that there may be no more search results of the current string, and the condition that the original objects with the preset number/number of strings with the front order are fetched each time is not satisfied, for example, the number of search results of the a string is only 3, and the condition that the original objects with the front order are fetched each time is not satisfied, then only the 3 search results are required to be obtained, and the insufficient 497 original objects are replaced by blank text.

Or if the search result of the A character string is not more, more original objects are taken from the search result of the next character string of the A character string so as to compensate the search result of the A character string.

And step S20, preprocessing the text to be screened to obtain more than one first clause.

In this embodiment, the apparatus for screening performs preprocessing on the text to be screened, so as to obtain more than one first clause, that is, decompose the text to be screened into each clause. Specifically, the text to be screened is decomposed into clauses according to punctuation marks, which can be periods.

And step S30, comparing each first clause with each object to be compared to determine non-original clauses in the first clause.

In this embodiment, each first clause is compared with each object to be compared, so as to determine originality of each first clause, and finally determine how many non-original clauses exist in the first clause.

Specifically, step S30 includes:

generating a first hash value corresponding to each first clause;

in this step, the screening apparatus generates a first hash value corresponding to each first clause, where the first hash value is preferably a simhash value, where simhash is a locally sensitive hash, and is a text hash mapping algorithm used to map text into a bit string with a length equal to 64. Unlike the normal hash algorithm, the locality sensitive hash results of two similar texts are also similar, with a hamming distance of 3 or less. Of course, the first hash value is a common hash value, and the similarity may also be calculated, and the simhash is preferably taken as an example for description in this embodiment.

Specifically, the text to be screened is decomposed into first clauses according to punctuation marks, the first clauses can be periods, words are segmented into the first clauses to obtain effective feature vectors, and then 5 levels of weights of 1-5 and the like are set for each feature vector, wherein the weights of the feature vectors can be the number of times that words corresponding to the feature vectors appear in the text to be screened. If the current first clause is that the internet bank issues a loan through the face recognition technology and the big data credit rating, the internet bank issues the loan through the face recognition technology and the big data credit rating after word segmentation, and then weight is given to each feature vector: the internet bank (4) issues (1) a loan (5) through (1) face recognition technology (3) and (1) big data (4) credit rating (5), wherein the number in the brackets represents the importance of the word in the current clause, and the larger the number, the more important the representation.

Then, the hash value of each feature vector is calculated through a hash function, and the hash value is a signature consisting of binary numbers 01. For example, the Hash value of "internet bank" is 100101, the Hash value of "loan" is 101011, and the current clause has become a series of numbers.

And weighting all the eigenvectors on the basis of the Hash value, namely W=hash×weight, wherein the Hash value and the weight are multiplied positively when encountering 1, and the Hash value and the weight are multiplied negatively when encountering 0. For example, the hash value "100101" of "internet bank" is weighted to obtain: w (internet banking) =100101×4=4-4-4 4-4 4, and weighting the hash value "101011" of the "loan" results in: w (loan) =101011×5=5-5 5-5 5, other feature vectors also operate similarly.

Then, the weighted results of the respective feature vectors are accumulated to become only one sequence string. The "4-4-4 4-4 4" such as "Internet banking" and the "5-5 5-5 5" such as "loan" are accumulated to obtain "4+5-4+ -5-4+5+ -5-4+5+4+5" and "9-9 1-1 1".

Finally, for the accumulated result of the signature, if the accumulated result is larger than 0, setting 1, otherwise setting 0, thereby obtaining the simhash value of the current first clause, and finally obtaining 101011 as the result is 9-9 1-1 9.

in this step, the original database stores a plurality of original objects, and stores a clause result list of each original object and simhash values of the clauses, so that the screening device can call hash value sets of the objects to be compared in the original database, where the hash value sets include a plurality of second hash values, and the second hash values are preferably the second simhash values.

specifically, each first simhash value is compared with each second simhash value in the object to be compared to determine the hamming distance, wherein the hamming distance is the number of different characters at the corresponding positions of two character strings, that is, the number of characters required to be replaced when one character string is converted into the other character string. For example: the hamming distance between 1011101 and 1001001 is 2.

Comparing the calculated Hamming distance with a first preset value, and further determining a third hash value, of which the Hamming distance with at least one second hash value is smaller than or equal to the first preset value, in the first hash value, wherein the first preset value is preferably 3 in specific implementation, and determining that the current first clause is plagiated when the Hamming distance is smaller than or equal to 3.

In this step, in the first clause, the clause that determines the plagiarism is marked as a non-original clause. Such as a red display.

In this embodiment, the ratio of the non-original clauses in the text to be screened is counted, specifically, the number of the non-original clauses and the number of the first clauses in the text to be screened can be counted, so that the ratio of the non-original clauses in the text to be screened is calculated, and further, whether the ratio of the non-original clauses in the north-keeping text is not greater than a preset plagiarism threshold value is determined, and if yes, the text to be screened is determined to be the original text; if not, determining the discrimination text as the plagiarism text.

Further, after said determining the non-original clause present in said first clause, further comprising:

In the step, the ratio of the non-original clause in the text to be screened can be determined by counting the number of words of the non-original clause and the total number of words of the text to be screened, and dividing the number of words of the non-original clause by the total number of words of the text to be screened, so that the ratio of the non-original clause in the text to be screened can be obtained. And then determining whether the text to be screened is original text or not according to the duty ratio.

In this embodiment, a plagiarism threshold, for example, 80%, is preset, after the duty ratio of the non-original clause in the text to be screened is calculated, it is determined whether the duty ratio is greater than the preset plagiarism threshold, if so, it is determined that the text to be screened is a plagiarism text, and if not, it is an original text.

After receiving a text to be screened, the embodiment acquires more than one object to be compared corresponding to the text to be screened from a preset original database; preprocessing the text to be screened to obtain more than one first clause; comparing each first clause with each object to be compared to determine non-original clauses existing in the first clause; and if the ratio of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold, determining that the text to be screened is the original text. The invention also discloses an original text screening device, equipment and a readable storage medium. The method and the device process the text to be screened into each clause, determine whether the text to be screened is the original text, decompose the text to be screened into the original clause, and determine whether the text to be screened is the original text according to the duty ratio of the original clause, thereby effectively improving the screening precision of the original text.

Further, based on the first embodiment of the original text screening method, a second embodiment of the original text screening method is provided.

The second embodiment of the original text screening method differs from the first embodiment of the original text screening method in that, referring to fig. 3, the preprocessing includes filtering and clauses, and step S20 includes:

step S21, filtering the text to be screened based on a preset filtering rule to obtain a filtered text;

step S22, based on a preset clause rule, the filtered text is subjected to clause to obtain more than one first clause.

In the embodiment, filtering and clause dividing are specifically used when the text to be screened is preprocessed, so that the text to be screened is decomposed into first clauses.

The following will explain each step in detail:

and S21, filtering the text to be screened based on a preset filtering rule to obtain a filtered text.

In this embodiment, based on a preset filtering rule, the screening device filters the text to be screened, where the preset filtering rule is: filtering nonsensical symbols in the text to be screened, wherein the nonsensical symbols comprise HTML labels, HTML character entities, pigment characters and the like; converting traditional Chinese in the text to be screened into simplified Chinese; the dashes in the text to be screened, the Chinese and English single quotation marks, the Chinese and English double quotation marks, the Chinese and English colon marks and other symbols are uniformly replaced by Chinese commas and the like, so that the purpose of the process is to avoid the difference brought by the symbols to influence the original screening, such as:

Project responsible person three: the heart is not forgotten, and whet is moved forward.

The person responsible for the project is three-don't forget to start the heart, and lead forward.

Project responsible person three: not forgetting to start, whet and go forward. "

And filtering the text to be screened to obtain a filtered text so as to carry out clause on the filtered text later.

In this embodiment, the screening apparatus performs clause on the filtered text that completes the filtering based on a preset clause rule, where the preset clause rule is: according to Chinese and English commas, chinese and English periods, chinese and English sigmarks, chinese and English question marks, chinese semicolons, chinese pause marks, blank, escape characters\n and\t and the like, clauses are carried out, for example, three pairs of team members of project responsible person are said, then we need to further, forget about initial hearts and whetstones, and the clauses are carried out according to symbols: the project responsible person speaks three pairs of team members, we want to further, don't forget initial heart, and ruble forward, and filters the text to be screened and obtains each first clause after the clause.

Further, step S22 includes:

step a, carrying out clause on the filtered text based on a preset clause rule to obtain each clause, and sequentially determining whether the word number of each clause reaches the preset word number;

in this step, the screening device performs clause on the filtered text after filtering based on the preset clause rule, so as to obtain each clause, and specific preset rules include performing clause according to Chinese and English comma, chinese and English sentence mark, chinese and English question mark, chinese clause, chinese pause mark, space, escape characters\n and\t, and the like, so as to obtain each clause, and determining whether the word number of each clause reaches the preset word number, such as 10 words, and the like, because of the existence of the symbol, each clause has different lengths, in order to reduce the comparison times, the clause with fewer words needs to be combined, so as to reduce the number of clauses, and because a complete and meaningful sentence needs to have a certain subject, predicate, object, and the like, the complete and meaningful sentence itself has a certain word number requirement, so after the filtered text is segmented, whether the word number of each clause reaches the preset word number needs to be determined.

B, if the word number of the current clause reaches the preset word number, setting the current clause as the first clause;

In the process of determining whether the word number of each clause reaches the preset word number, if the word number of the current clause reaches the preset word number, for example, 10 words, the current clause is set as the first clause.

And c, if the word number of the current clause does not reach the preset word number, merging the current clause into the first clause set on the basis of the previous clause.

If not, merging the current clause into the first clause set based on the previous clause. I.e. merging the current clause with the previous first clause; and the previous first clause may be a first clause formed by a previous clause of the current clause, or may be a first clause formed by two previous clauses of the current clause.

As in the example above, the clause yields: the method comprises the steps of combining clauses with the number less than 10 to obtain the project responsible person who speaks three pairs of team members, further forgetting to write initial grinding.

The preprocessing of the embodiment includes filtering and clause, and specifically uses the filtering and clause when preprocessing the text to be screened, so as to filter out factors influencing originality screening, namely, filter out various meaningless symbols, decompose the text to be screened into each clause, further screen originality of the text to be screened by determining originality of each clause, refine the screened object, and improve screening precision of the original text.

Further, based on the first and second embodiments of the original text screening method of the present invention, a third embodiment of the original text screening method of the present invention is provided.

The third embodiment of the original text screening method is different from the first and second embodiments of the original text screening method in that after the step of marking the clause corresponding to the third hash value as a non-original clause, the method further includes:

step d, if it is determined that the Hamming distance from the ith first clause to the ith+k first clause in the text to be screened to the jth clause to the jth+k clause in the target object does not exceed a first preset value, calculating a first editing distance between the ith-n first clause in the text to be screened and the jth-n clause in the target object, and a second editing distance between the ith+k+m first clause in the text to be screened and the jth+k+m clause in the target object, wherein the target object is one object in the objects to be compared, i and j are constants greater than 0, k is a preset constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of the clauses of the text to be screened, j+k+m is less than or equal to the number of the clauses of the target object;

Step e, if the ratio of the first editing distance to the clause length of the j-n clause is smaller than a second preset value, marking the i-n first clause as a non-original clause; and if the ratio of the second editing distance to the clause length of the j+k+m clause is smaller than the second preset value, marking the i+k+m first clauses as non-original clauses.

In this embodiment, for the text to be screened in the case of replacing the subject, the pronoun and the like, after the text to be screened is decomposed into each clause and originality of each clause is determined, editing distance of each clause is calculated, and originality of each clause is further confirmed.

The following will explain each step in detail:

and d, if the Hamming distance from the ith first clause to the ith+k first clause in the text to be screened to the jth clause to the jth+k clause in the target object is not more than a first preset value, calculating a first editing distance from the ith-n first clause in the text to be screened to the jth-n clause in the target object and a second editing distance from the ith+k+m first clause in the text to be screened to the jth+k+m clause in the target object, wherein the target object is one object in the objects to be compared, i and j are constants which are more than 0, k is a preset constant, n is a set from 1 to i, m is a set from 1 to infinity, i+k+m is less than or equal to the number of the clauses of the text to be screened, and j+k+m is less than or equal to the number of the clauses of the target object.

In this embodiment, after marking each non-original clause, the screening device monitors the continuity of the marked clause in real time, that is, monitors the continuity of the non-original clause, if the Hamming distance from the ith first clause to the ith+k first clause in the text to be screened to the jth clause to the jth+k clause in the target object does not exceed a first preset value, that is, the ith first clause to the ith+k first clause in the text to be screened is marked as the non-original clause, calculates the first editing distance between the ith-n first clause in the text to be screened and the jth-n clause in the target object, and the second editing distance between the ith+k+m first clause in the text to be screened and the jth+k+m clause in the target object, that is, the ith-n first clause and the ith+k+m first clause in the original text to be screened are not marked, however, these clauses may have the scenes of subject substitution, pronoun substitution, etc., while simple subject substitution, pronoun substitution, etc. are not original, but the hamming distance cannot be judged, so in the case that it is determined that the continuous k first clauses of the text to be screened are marked as non-original clauses, it can be determined that the former clauses and the later clauses still have the possibility of plagiarism, so the i-n first clauses and the i+k+m first clauses are subjected to the calculation of the editing distance, wherein the target object is one object of the objects to be compared, i, j is a constant greater than 0, k is a preset constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of the clauses of the text to be screened, j+k+m is less than or equal to the number of clauses of the target object.

The edit distance refers to the minimum number of editing operations required to convert from one to the other between two strings, and in this embodiment, the minimum number of editing operations to convert a clause in a text to be screened into a clause in a target object.

It should be noted that, in addition to using the edit distance to determine whether the original unlabeled clause is plagiated, a jaccard distance, or a length value of the longest common subsequence, may be used to determine, as is not exhaustive herein.

Step e, if the ratio of the first editing distance to the clause length of the j-n clause is smaller than a second preset value, marking the i-n first clause as a non-original clause; if the ratio of the second editing distance to the clause length of the j+k+m clause is smaller than the second preset value, marking the i+k+m first clauses as non-original clauses;

in this embodiment, after the first editing distance and the second editing distance are obtained, dividing the first editing distance by the clause length of the j-n clause of the target object, dividing the second editing distance by the clause length of the j+k+m clause of the target object, and if the ratio of the first editing distance to the clause length of the j-n clause is smaller than the second preset value, marking the i-n first clause in the text to be screened as a non-original clause; and if the first clause is equal to or larger than the second preset value, marking the ith-nth first clause in the text to be screened. Similarly, if the ratio of the second editing distance to the clause length of the j+k+m clause is smaller than a second preset value, marking the i+k+m first clause in the text to be screened as non-original clause, and if the ratio is equal to or larger than the second preset value, marking the i+k+m first clause in the text to be screened. Wherein the second preset value is preferably 0.1.

The method comprises the steps of firstly calculating the editing distance between an i-1 th first clause in a text to be screened and a j-1 th clause of a target object, marking the i-1 th first clause when the ratio of the editing distance to the length of the j-1 th clause is smaller than a second preset value, not marking the i-1 th first clause when the ratio is larger than or equal to the second preset value, and then continuously calculating the editing distance between the i-2 th first clause in the text to be screened and the j-2 th clause of the target object. Similarly, calculating the editing distance between the (i+k+1) th first clause in the text to be screened and the j+k+1 th clause of the target object, marking the (i+k+1) th first clause when the ratio of the editing distance to the length of the j+k+1 th clause is smaller than a second preset value, marking the (i+k+1) th first clause when the ratio is larger than or equal to the second preset value, and then continuously calculating the editing distance between the (i+k+2) th first clause in the text to be screened and the j+k+2 th clause of the target object.

When the i-n first clauses in the text to be screened are marked, namely that the i-n clauses are marked, the calculation of the editing distance of the i-n clauses is not needed, and similarly, when the i+k+m first clauses in the text to be screened are marked, the calculation of the editing distance of the subsequent clauses is not needed. At this time, the number of words marked as non-original clauses is counted, and the non-original clauses at this time include non-original clauses that are initially marked by the hamming distance and also include non-original clauses that are subsequently marked by the editing distance.

It should be noted that if the i+k+m first clauses of the text to be screened have no comparison object in the target object, that is, the target object is at the bottom, the i+k+m first clauses of the text to be screened are not marked.

When determining whether the text to be screened is the original text, the method and the device consider factors of main language replacement and substitution replacement besides the similarity between the text to be screened and the original object, and compared with a single Hamming distance algorithm, the method and the device can effectively solve the problem that a large number of main language replacement and substitution copy scenes exist, and further improve the screening precision of the original text.

The invention also provides a device for discriminating the original text. The original text screening device comprises:

Further, the acquisition module is further configured to:

Further, the preprocessing module is further configured to:

Further, the first determining module is further configured to:

generating a first hash value corresponding to each first clause;

Further, the first determining module is further configured to:

The invention also provides a computer readable storage medium.

The computer readable storage medium of the present invention stores thereon an original text screening program which, when executed by a processor, implements the steps of the original text screening method as described above.

The method implemented when the original text screening program running on the processor is executed may refer to various embodiments of the original text screening method of the present invention, which are not described herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein, or any application, directly or indirectly, in the field of other related technology.

Claims

1. The original text screening method is characterized by comprising the following steps of:

preprocessing the text to be screened to obtain more than one first clause;

generating a first hash value corresponding to each first clause;

in the first clause, marking the clause corresponding to the third hash value as a non-original clause;

if it is determined that the Hamming distance from the ith first clause to the ith+k first clause in the text to be screened to the jth clause to the jth+k clause in the target object does not exceed a first preset value, calculating a first editing distance between the ith-nth first clause in the text to be screened and the jth-nth clause in the target object, and a second editing distance between the ith+k+m first clause in the text to be screened and the jth+k+m clause in the target object, wherein the target object is one object in the objects to be compared, i and j are constants greater than 0, k is a preset constant, n is a set from 1 to i, m is a set from 1 to infinity, i+k+m is less than or equal to the number of the clauses of the text to be screened, and j+k+m is less than or equal to the number of the clauses of the target object;

If the ratio of the first editing distance to the clause length of the j-n clause is smaller than a second preset value, marking the i-n first clause as a non-original clause; if the ratio of the second editing distance to the clause length of the j+k+m clause is smaller than the second preset value, marking the i+k+m first clauses as non-original clauses;

2. The method for screening original text according to claim 1, wherein the step of obtaining, after receiving the text to be screened, one or more objects to be compared corresponding to the text to be screened in a preset original database includes:

3. The method of screening original text according to claim 1, wherein the step of preprocessing the text to be screened to obtain more than one first clause comprises:

4. The method for screening original text according to claim 3, wherein the step of sorting the filtered text based on a preset sorting rule to obtain more than one first sorting comprises:

5. The original text screening method of claim 1, further comprising, after said determining the non-original clause present in the first clause:

6. An original text screening apparatus, the original text screening apparatus comprising:

the first determining module is used for generating a first hash value corresponding to each first clause; a hash value set corresponding to each object to be compared is called, wherein the hash value set comprises a plurality of second hash values; comparing the first hash value with the second hash values, and determining a third hash value with a Hamming distance smaller than or equal to a first preset value from at least one second hash value in the first hash values; in the first clause, marking the clause corresponding to the third hash value as a non-original clause; if it is determined that the Hamming distance from the ith first clause to the ith+k first clause in the text to be screened to the jth clause to the jth+k clause in the target object does not exceed a first preset value, calculating a first editing distance between the ith-nth first clause in the text to be screened and the jth-nth clause in the target object, and a second editing distance between the ith+k+m first clause in the text to be screened and the jth+k+m clause in the target object, wherein the target object is one object in the objects to be compared, i and j are constants greater than 0, k is a preset constant, n is a set from 1 to i, m is a set from 1 to infinity, i+k+m is less than or equal to the number of the clauses of the text to be screened, and j+k+m is less than or equal to the number of the clauses of the target object; if the ratio of the first editing distance to the clause length of the j-n clause is smaller than a second preset value, marking the i-n first clause as a non-original clause; if the ratio of the second editing distance to the clause length of the j+k+m clause is smaller than the second preset value, marking the i+k+m first clauses as non-original clauses;

7. An original text screening apparatus, the original text screening apparatus comprising: a memory, a processor and an original text screening program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the original text screening method of any one of claims 1 to 5.

8. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon an original text screening program, which when executed by a processor, implements the steps of the original text screening method according to any one of claims 1 to 5.