WO2021012958A1

WO2021012958A1 - Original text screening method, apparatus, device and computer-readable storage medium

Info

Publication number: WO2021012958A1
Application number: PCT/CN2020/101003
Authority: WO
Inventors: 蔡远航; 郑少杰; 付勇; 范增虎; 江旻
Original assignee: 深圳前海微众银行股份有限公司
Priority date: 2019-07-23
Filing date: 2020-07-09
Publication date: 2021-01-28
Also published as: CN110347806B; CN110347806A

Abstract

An original text screening method, an apparatus, a device and a computer-readable storage medium, the method comprising: after receiving text to be screened, acquiring from a preset original database more than one object to be compared that corresponds to the text to be screened (S10); pre-processing the text to be screened so as to obtain more than one first clause (S20); comparing each first clause to each object to be compared, and determining a non-original clause among the first clauses (S30); if the proportion of non-original clauses in the text to be screened is not larger than a preset plagiarism threshold, determining that the text to be screened is an original text (S40).

Description

Original text discrimination method, device, equipment and computer readable storage medium

This application claims the priority of the Chinese patent application filed on July 23, 2019, with the application number 201910669863.0 and the name "Original text screening method, device, equipment and computer-readable storage medium", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the technical field of financial technology (Fintech), in particular to original text discrimination methods, devices, equipment and computer-readable storage media.

Background technique

In recent years, with the continuous development of financial technology (Fintech), especially Internet finance, data screening technology has been introduced into the daily operations of banks and other financial institutions. In the daily propaganda process of financial institutions such as banks, in order to ensure that the propaganda texts, such as news, adverts, and advertisements, are not plagiarized by others, the originality of the propaganda texts needs to be reviewed before dissemination. Only original texts can avoid unnecessary copyright disputes and enable original works to receive the value feedback they deserve. Therefore, the originality of the screened texts is a must for banks and other financial institutions when they publicize them.

The current practice is that the public relations department of banks and other financial institutions or other publicity departments, before disseminating the publicity text to the outside world, enter the publicity text into the computer, and compare the publicity text with the text in the computer’s original database through the computer. The keywords calculate the similarity to determine the originality of the promotional text.

However, the existing method can only judge whether the text to be screened is plagiarized, but cannot give a specific index of plagiarism rate. If the text to be screened is extracted from multiple original texts in turn, the existing method cannot give a conclusion of plagiarism. And, for texts to be screened that have a large number of subject substitutions and pronoun substitutions, it is difficult to screen their originality. Obviously, the accuracy of existing screening methods is low.

Technical solutions

The main purpose of this application is to propose an original text discrimination method, device, equipment and computer readable storage medium, aiming to improve the accuracy of original text discrimination.

In order to achieve the above-mentioned purpose, this application provides an original text screening method, and the original text screening method includes the following steps:

After receiving the text to be screened, obtain more than one object to be compared corresponding to the text to be screened in a preset original database;

Preprocessing the text to be screened to obtain more than one first clause;

Compare each of the first clauses with each of the objects to be compared to determine the non-original clauses in the first clause;

If the proportion of the non-original clauses in the text to be screened is not greater than a preset plagiarism threshold, it is determined that the text to be screened is an original text.

In one embodiment, after receiving the text to be screened, the step of obtaining one or more objects to be compared corresponding to the text to be screened in a preset original database includes:

When receiving the text to be screened, determine the text length of the text to be screened, and cut the text to be screened into a number of character strings corresponding to the text length;

Obtain matching objects matching the character string in a preset original database, and select a preset number of objects to be compared from the matching objects.

In an embodiment, the step of preprocessing the text to be screened to obtain more than one first clause includes:

Filtering the text to be screened based on preset filtering rules to obtain filtered text;

Based on the preset sentence rules, the filtering text is divided into sentences to obtain more than one first sentence.

In one embodiment, the step of segmenting the filtered text based on a preset segmentation rule to obtain more than one first clause includes:

Based on the preset clause rules, clauses the filtered text to obtain each clause, and sequentially determines whether the number of words of each clause reaches the preset number of words;

If the number of words in the current clause reaches the preset number of words, set the current clause as the first clause;

If the number of words in the current clause does not reach the preset number of words, the current clause is merged into the first clause set based on the previous clause.

In an embodiment, the step of comparing each of the first clauses with each of the objects to be compared, and determining the non-original clauses in the first clause includes:

Generating a first hash value corresponding to each of the first clauses;

Retrieve a set of hash values corresponding to each of the objects to be compared, where the set of hash values includes multiple second hash values;

The first hash value is compared with the second hash value, and in the first hash value, it is determined that the Hamming distance to at least one of the second hash values is less than or equal to the first predetermined value. Set the third hash value;

In the first clause, the clause corresponding to the third hash value is marked as a non-original clause.

In an embodiment, after the step of marking the clause corresponding to the third hash value as a non-original clause, the method further includes:

If it is determined that the Hamming distance between the ith first clause to the i+k first clause in the text to be screened and the jth clause to the j+k clause in the target object does not exceed the first preset Value, calculate the first edit distance between the in-th first clause in the text to be screened and the jn-th clause in the target object, and the i+k+m-th first score in the text to be screened The second edit distance between the sentence and the j+k+mth clause in the target object, where the target object is one of the objects to be compared, i and j are constants greater than 0, and k is the predetermined Set a constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of clauses of the text to be screened, and j+k+m is less than or equal to the target object The number of clauses;

If the ratio of the first edit distance to the clause length of the jnth clause is less than a second preset value, mark the inth first clause as a non-original clause; if the second The ratio of the edit distance to the clause length of the j+k+mth clause is less than the second preset value, and the i+k+mth first clause is marked as a non-original clause.

In an embodiment, after determining the non-original clauses existing in the first clause, the method further includes:

In the text to be screened, count the number of words of the non-original clause, and determine the proportion of the non-original clause in the text to be screened based on the number of words and the total word count of the text to be screened .

In addition, in order to achieve the above objective, this application also provides an original text screening device, the original text screening device includes:

The obtaining module is configured to obtain more than one object to be compared corresponding to the text to be screened in a preset original database after receiving the text to be screened;

The preprocessing module is used to preprocess the text to be screened to obtain more than one first clause;

The first determining module is configured to compare each of the first clauses with each of the objects to be compared, and determine the non-original clauses in the first clause;

The second determining module is configured to determine that the text to be screened is an original text if the proportion of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold.

In an embodiment, the acquisition module is further used for:

In an embodiment, the preprocessing module is further used for:

In an embodiment, the first determining module is further configured to:

Generating a first hash value corresponding to each of the first clauses;

In an embodiment, the first determining module is further configured to:

In addition, in order to achieve the above objective, this application also provides an original text screening device, the original text screening device comprising: a memory, a processor, and an original text screening stored on the memory and running on the processor A program, when the original text screening program is executed by the processor, the steps of the original text screening method as described above are realized.

In addition, in order to achieve the above object, the present application also provides a computer-readable storage medium, the computer-readable storage medium stores an original text screening program, and the original text screening program is executed by a processor to achieve the above The steps of the original text screening method.

In the original text screening method proposed in this application, after receiving the text to be screened, more than one object to be compared corresponding to the text to be screened is obtained in a preset original database; the text to be screened is preprocessed to Obtain more than one first clause; compare each of the first clauses with each of the objects to be compared to determine the non-original clauses in the first clause; if the non-original clauses If the proportion of the text to be screened is not greater than a preset plagiarism threshold, it is determined that the text to be screened is an original text. The application also discloses an original text screening device, equipment and readable storage medium. This application processes the text to be screened into individual clauses, determines whether the text to be screened is an original text, and decomposes it into determining whether each clause is an original clause, so as to determine whether the text to be screened is original based on the proportion of original clauses Text, effectively improve the accuracy of original text discrimination.

Description of the drawings

FIG. 1 is a schematic diagram of a device structure of a hardware operating environment involved in a solution of an embodiment of the present application;

FIG. 2 is a schematic flowchart of a first embodiment of a method for identifying original texts of an application;

FIG. 3 is a schematic flowchart of a second embodiment of the method for identifying original texts of this application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Embodiments of the invention

It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.

As shown in FIG. 1, FIG. 1 is a schematic diagram of the device structure of the hardware operating environment involved in the solution of the embodiment of the present application.

The device in the embodiment of this application may be a PC or a server device.

As shown in FIG. 1, the device may include a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002. Among them, the communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory, or a stable memory (non-volatile memory), such as a magnetic disk memory. Optionally, the memory 1005 may also be a storage device independent of the foregoing processor 1001.

Those skilled in the art can understand that the structure of the device shown in FIG. 1 does not constitute a limitation on the device, and may include more or fewer components than those shown in the figure, or a combination of certain components, or different component arrangements.

As shown in FIG. 1, the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and an original text discrimination program.

Among them, the operating system is a program that manages and controls original text discrimination equipment and software resources, and supports the operation of network communication modules, user interface modules, original text discrimination programs, and other programs or software; the network communication module is used to manage and control the network interface 1002 ; The user interface module is used to manage and control the user interface 1003.

In the original text screening device shown in FIG. 1, the original text screening device calls the original text screening program stored in the memory 1005 through the processor 1001, and executes the operations in each embodiment of the following original text screening method.

Based on the above hardware structure, an embodiment of the method for discriminating the original text of this application is proposed.

Referring to Fig. 2, Fig. 2 is a schematic flowchart of a first embodiment of a method for identifying original texts of this application. The method includes:

Step S10, after receiving the text to be screened, obtain more than one object to be compared corresponding to the text to be screened in a preset original database;

Step S20, preprocessing the text to be screened to obtain more than one first clause;

Step S30, comparing each of the first clauses with each of the objects to be compared to determine the non-original clauses in the first clause;

Step S40: If the proportion of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold, it is determined that the text to be screened is an original text.

The original text screening method of this embodiment is applied to the original text screening equipment of financial institutions such as financial institutions or banking systems. For the convenience of description, the original text screening equipment is hereinafter referred to as screening equipment. The screening equipment is connected to the original database, and the original database is stored on the Internet. All original texts, including original news texts, original advertisements, original soft texts and other works. In the specific implementation, due to hardware limitations, the original database generally only stores the original texts of the past 3 years. In addition, a search module is built in the screening equipment , Used to obtain the original object corresponding to the current search term in the original database, wherein the search module is preferably the ES search module (Elastic Search, elastic search, a distributed full-text search engine). The ES search module searches the original database based on the search terms and returns the search results. The higher the ranking of the returned search results, the higher the text similarity between the results and the search terms. The search principle based on ES search is due to the current There are technologies, so I won’t repeat them here.

The screening device of this embodiment, when receiving the text to be screened, first selects the object to be compared related to the text to be screened from the original database, then processes the text to be screened into a sentence, and then determines the non-original in the clause The proportion of clauses to determine whether the text to be screened is an original text.

Each step will be described in detail below:

In this embodiment, the external propaganda personnel of financial institutions such as financial institutions or banks, before publicizing or publishing the text to be screened, input the text to be screened into the screening device to screen whether the text to be screened is an original text. Avoid unnecessary copyright disputes.

After the screening device receives the text to be screened, it first obtains more than one object to be compared corresponding to the text to be screened in the original database. That is to say, this embodiment does not need to compare the current text to be screened with all the objects in the original database. The original texts are compared one by one, but the objects to be compared related to the current text to be screened are first selected from the original database.

Specifically, step S10 includes:

In this step, when the screening device receives the text to be screened, it first determines the text length of the text to be screened, that is, calculates the character length of the text to be screened, and cuts the text to be screened into a corresponding number of character strings. Specifically, The string length is preset, and the text to be screened is truncated according to the preset string length. For example, if the text length of the text to be screened is currently N and the preset string length is 100, then N/100 strings will be obtained after truncation. This is because there is an upper limit on the length of the search term of the ES search module, so the text to be screened needs to be truncated, preferably 100 characters as the preset string length.

In this step, each character string is searched as a search term, and the original object matching the current character string is obtained in the preset original database. After all the character strings are searched, the set of original objects obtained is the matching Object, since the original object corresponding to different character strings may be the same, therefore, in this step, it is also necessary to de-duplicate the original object retrieved, and then select a preset number of objects to be compared on the basis of matching objects Object.

The specific selection method may be: in the retrieval result corresponding to each character string, the original object with the highest ranking is selected, and the specific number is: preset number/number of character strings. For example, if there are 10 strings, and 5000 original objects are to be obtained, when each string is searched, the top 5000/10=500 original objects are selected each time, and then the search results of the 10 strings are combined. Get 50,000 original objects.

It should be noted that there may not be many search results for the current string, which does not satisfy the condition of fetching the original object of the preset number/number of strings each time. For example, there are only 3 search results for the A string. To meet the condition of fetching the top 500 original objects each time, then only need to obtain these 3 search results, and replace the insufficient 497 original objects with blank text.

Or, when there are not many search results for the A character string, more original objects are selected from the search result of the next character string of the A character string to compensate for the search result of the A character string.

Step S20, preprocessing the text to be screened to obtain more than one first clause.

In this embodiment, the screening device preprocesses the text to be screened to obtain more than one first clause, that is, the text to be screened is decomposed into various clauses. Specifically, the text to be screened is decomposed into various clauses according to punctuation marks, which can be specific periods.

Step S30, comparing each of the first clauses with each of the objects to be compared, and determining non-original clauses in the first clause.

In this embodiment, each first clause is compared with each object to be compared, so as to determine the originality of each first clause, and finally determine how many non-original clauses exist in the first clause.

Specifically, step S30 includes:

Generating a first hash value corresponding to each of the first clauses;

In this step, the screening device generates the first hash value corresponding to each first clause, where the first hash value is preferably a simhash value, and simhash is a locally sensitive hash, which is a text hash mapping algorithm for The text is mapped to a bit string of length equal to 64. The difference from ordinary hash algorithms is that the locally sensitive hash results of two similar texts are also similar, and their Hamming distance is less than or equal to 3. Of course, the similarity calculation can also be performed if the first hash value is a common hash value. This embodiment is preferably described by taking simhash as an example.

Specifically, the text to be screened is decomposed into each first clause according to punctuation marks, specifically a period, and each first clause is segmented to obtain an effective feature vector, and then set 1-5 for each feature vector 5 levels of weights, where the weight of the feature vector can be the number of times the word corresponding to the feature vector appears in the text to be screened. For example, the current first clause is "Internet banks issue loans through face recognition technology and big data credit rating", and after word segmentation is "Internet banks issue loans through face recognition technology and big data credit rating", and then for each feature vector Give weight: Internet banking (4) through (1) face recognition technology (3) and (1) big data (4) credit rating (5) issuance (1) loan (5), where the number in parentheses represents the word The degree of importance in the current clause, the larger the number, the more important.

Then, the hash value of each feature vector is calculated through the hash function, and the hash value is a signature composed of the binary number 01. For example, the hash value of "Internet Banking", Hash (Internet Banking) is 100101, and the hash value of "Loan", Hash (loan) is 101011, so far, the current clause has become a series of numbers.

Then on the basis of the hash value, weight all the feature vectors, that is, W = Hash × weight, and when it encounters 1, the hash value and the weight are positively multiplied, and when it is 0, the hash value and the weight are negatively multiplied. For example, weight the hash value "100101" of "Internet Bank" to get: W (Internet Bank) = 100101×4 = 4 -4 -4 4 -4 4, weight the hash value "101011" of "loan" to get: W(loan)=101011×5 = 5 -5 5 -5 5 5, and other feature vectors are similar to this operation.

Then, the weighted results of the above-mentioned feature vectors are accumulated to become only one sequence string. For example, "4 -4 -4 4 -4 4" of "Internet Banking" and "5 -5 5 -5 5 5" of "Loan" are added together to get "4+5 -4+-5 -4+5 4+-5 -4+5 4+5", get "9 -9 1 -1 1".

Finally, for the cumulative result of the signature, if it is greater than 0, set it to 1, otherwise it is set to 0, thereby obtaining the simhash value of the current first clause. For example, the above result is "9 -9 1 -1 1 9" and finally "1 0 1" 0 1 1".

In this step, multiple original objects are stored in the original database, and each original object's sentence result list and the simhash value of the sentence are stored. Therefore, the screening equipment can retrieve the original database for each object to be compared. The hash value set, where the hash value set includes a plurality of second hash values, and the second hash value is preferably the second simhash value.

Specifically, each first simhash value is compared with the second simhash value in each object to be compared to determine its Hamming distance, where the Hamming distance is the number of different characters in the corresponding positions of the two strings, that is In other words, it is the number of characters that need to be replaced to transform a string into another string. For example, the Hamming distance between 1011101 and 1001001 is 2.

Comparing the calculated Hamming distance with the first preset value to determine that the Hamming distance between the first hash value and the at least one second hash value is less than or equal to the third hash value of the first preset value, Wherein, the first preset value is preferably 3 during specific implementation. When the Hamming distance is less than or equal to 3, it is determined that there is plagiarism in the current first clause.

In this step, in the first clause, the clause that determines plagiarism is marked as a non-original clause. Specifically, such as the red display.

In this embodiment, the proportion of non-original clauses in the text to be screened is counted. Specifically, the number of non-original clauses and the number of first clauses in the text to be screened can be counted, so as to calculate the number of non-original clauses in the text to be screened. The proportion of the text to be screened, and further, determine whether the proportion of non-original clauses in the Daizhenbei text is not greater than the preset plagiarism threshold, if so, the text to be screened is determined to be the original text; if not, then Confirm that the screened text is plagiarized.

Further, after the determination of the non-original clause existing in the first clause, the method further includes:

In this step, the proportion of non-original clauses in the text to be screened can also be determined by counting the number of words in non-original clauses and the total number of words in the text to be screened, and dividing the number of words in non-original clauses by the text to be screened The total number of words in the text to get the proportion of non-original clauses in the text to be screened. Based on this proportion, it is determined whether the text to be screened is an original text.

In this embodiment, a plagiarism threshold is preset, such as 80%. After calculating the proportion of non-original clauses in the text to be screened, it is determined whether the proportion is greater than the preset plagiarism threshold, and if so, the text to be screened is determined It is plagiarized text, otherwise it is original text.

In this embodiment, after receiving the text to be screened, more than one object to be compared corresponding to the text to be screened is obtained from the preset original database; the text to be screened is preprocessed to obtain more than one first Clause; compare each of the first clauses with each of the objects to be compared to determine the non-original clauses in the first clause; if the non-original clauses are in the text to be screened If the proportion in is not greater than the preset plagiarism threshold, it is determined that the text to be screened is an original text. The application also discloses an original text screening device, equipment and readable storage medium. This application processes the text to be screened into individual clauses, determines whether the text to be screened is an original text, and decomposes it into determining whether each clause is an original clause, so as to determine whether the text to be screened is original based on the proportion of original clauses Text, effectively improve the accuracy of original text discrimination.

Further, based on the first embodiment of the original text screening method of this application, a second embodiment of the original text screening method of this application is proposed.

The difference between the second embodiment of the original text screening method and the first embodiment of the original text screening method is that, referring to FIG. 3, the preprocessing includes filtering and sentence segmentation, and step S20 includes:

Step S21, filtering the text to be screened based on preset filtering rules to obtain filtered text;

In step S22, based on a preset sentence rule, the filtered text is sentenced to obtain more than one first sentence.

In this embodiment, when preprocessing the text to be screened, filtering and clauses are specifically used to decompose the text to be screened into each first clause.

Each step will be described in detail below:

Step S21: Filter the text to be screened based on preset filtering rules to obtain filtered text.

In this embodiment, based on a preset filtering rule, the screening device filters the text to be screened, where the preset filtering rule is: filter out meaningless symbols in the text to be screened, meaningless symbols include HTML tags and HTML character entities And facial characters symbols, etc.; convert traditional Chinese in the text to be screened into simplified Chinese; replace dashes, single quotation marks in Chinese and English, double quotation marks in Chinese and English, and colons in Chinese and English with Chinese commas, etc. in the text to be screened The purpose is to avoid the difference brought by the symbols and affect the originality discrimination, such as:

Project leader Zhang San: Do not forget the original intention, and move forward.

Project leader Zhang San-never forget the original intention and move forward.

Project leader Zhang San: "Don't forget your original aspiration, and forge ahead."

After filtering the text to be screened, the filtered text is obtained so that the filtered text can be subsequently segmented.

In this embodiment, the screening device divides the filtered text after filtering based on preset sentence rules, where the preset sentence rules are: according to Chinese and English commas, Chinese and English periods, Chinese and English exclamation marks, Chinese and English question marks , Chinese semicolon, Chinese comma, space and escape characters \n and \t to make clauses, such as "Project leader Zhang San said to the team members, we must go further, not forgetting our original intentions, and forge ahead ", according to the symbol, the sentence is: "Project leader Zhang San said to the team members", "Next we will go further", "Don't forget the original intention", "Forge ahead", filter and divide the text to be screened After that, get each first clause.

Further, step S22 includes:

Step a, based on a preset clause rule, segment the filtered text to obtain each clause, and sequentially determine whether the number of words in each clause reaches the preset number of words;

In this step, the screening device subdivides the filtered text that has been filtered based on preset clause rules to obtain each clause. The specific preset according to the rules includes Chinese and English commas, Chinese and English periods, and Chinese and English exclamation marks. Chinese and English question marks, Chinese semicolons, Chinese halts, spaces, and escape characters \n and \t are used to make clauses to obtain each clause and determine whether the number of words in each clause reaches the preset number of words, such as 10 words, etc. , This is due to the existence of symbols, the length of each clause is different, in order to reduce the number of comparisons, it is necessary to merge the clauses with fewer words to reduce the number of clauses, and because a complete and meaningful sentence needs to have certain Subject, predicate, and object, etc., have certain word count requirements. Therefore, after the filter text is divided into sentences, it is necessary to determine whether the word count of each clause reaches the preset word count.

Step b, if the number of words in the current clause reaches the preset number of words, set the current clause as the first clause;

In the process of determining whether the word count of each clause reaches the preset word count, if the word count of the current clause reaches the preset word count, such as 10 words, the current clause is set as the first clause.

Step c: If the number of words of the current clause does not reach the preset number of words, merge the current clause into the first clause set based on the previous clause.

If not, merge the current clause into the first clause based on the previous clause. That is, the current clause is merged with the previous first clause; and the previous first clause can be the first clause formed by the previous clause of the current clause, or the first two clauses of the current clause The first clause... If the current clause is the first sentence of the text to be screened, and the number of words in the current clause does not reach the preset number of words, then the current clause and the first clause set by the next clause will be combined One clause is combined.

As in the above example, after the clause, you get: "Project leader Zhang San said to the team members", "Next we must go further", "Don't forget the original intention", "Forge ahead", and for clauses with less than 10 words After the merger, "Project leader Zhang San said to the team members", "Next, we must go further without forgetting our original intentions."

The preprocessing in this embodiment includes filtering and sentence segmentation. When the text to be screened is preprocessed, the filter and the sentence are specifically used to filter out the factors that affect the originality screening, that is, to filter out various meaningless symbols. The text to be screened is decomposed into various clauses, and then the originality of the text to be screened is identified by determining the originality of each clause, and the screening objects are refined, so that the accuracy of the original text can be improved.

Further, based on the first and second embodiments of the original text screening method of this application, a third embodiment of the original text screening method of this application is proposed.

The difference between the third embodiment of the original text screening method and the first and second embodiments of the original text screening method is that after the step of marking the clause corresponding to the third hash value as a non-original clause, Also includes:

Step d, if it is determined that the Hamming distance between the ith first clause to the i+k first clause in the text to be screened and the jth clause to the j+k clause in the target object does not exceed the A preset value, calculate the first edit distance between the in-th first clause in the text to be screened and the jn-th clause in the target object, and the i+k+m-th in the text to be screened The second edit distance between the first clause and the j+k+mth clause in the target object, wherein the target object is one of the objects to be compared, and i and j are constants greater than 0, k is a preset constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of clauses of the text to be screened, and j+k+m is less than or equal to The number of clauses of the target object;

Step e, if the ratio of the first edit distance to the clause length of the jnth clause is less than a second preset value, mark the inth first clause as a non-original clause; If the ratio of the second edit distance to the clause length of the j+k+mth clause is smaller than the second preset value, then the i+k+mth first clause is marked as non-original Clause.

In this embodiment, for the text to be screened that replaces subjects and pronouns, after decomposing the text to be screened into various clauses and determining the originality of each clause, the edit distance of each clause is also calculated to further confirm each clause Originality.

Each step will be described in detail below:

Step d, if it is determined that the Hamming distance between the ith first clause to the i+k first clause in the text to be screened and the jth clause to the j+k clause in the target object does not exceed The first preset value, calculate the first edit distance between the in-th first clause in the text to be screened and the jn-th clause in the target object, and the i+k+m-th in the text to be screened The second edit distance between the first clause and the j+k+mth clause in the target object, where the target object is one of the objects to be compared, and i and j are constants greater than 0 , K is a preset constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of clauses of the text to be screened, and j+k+m is less than or equal to all The number of clauses describing the target object.

In this embodiment, after marking each non-original clause, the screening device monitors the continuity of the marked clauses in real time, that is, it monitors the continuity of non-original clauses. If there is an i-th in the text to be screened The Hamming distance from the first clause to the i+k first clause to the jth clause to the j+k clause in the target object does not exceed the first preset value, that is, the first clause of the text to be screened The i-th first clause to the i+k-th first clause are all marked as non-original clauses, then the first edit of the in-th first clause in the text to be screened and the jn-th clause in the target object is calculated Distance, and the second edit distance between the i+k+m first clause in the text to be screened and the j+k+m clause in the target object, that is, the in-th first clause in the text to be screened and the The i+k+m first clause is not marked, but these clauses may have subject substitution and pronoun substitution. Simple subject substitution and pronoun substitution are not original, but Hamming distance does not Judging, therefore, when it is determined that the k consecutive first clauses of the text to be screened are marked as non-original clauses, it can be determined that there is still the possibility of plagiarism in the clauses before and after. Therefore, the edit distance is calculated for the in-th first clause and the i+k+m-th first clause, where the target object is one of the objects to be compared, i and j are constants greater than 0, and k Is a preset constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of clauses of the text to be screened, and j+k+m is less than or equal to the target The number of clauses of the object.

Editing distance refers to the minimum number of editing operations required to convert two strings from one to the other. In this embodiment, it refers to the minimum number of edits required to convert a clause in the text to be screened into a clause in the target object The number of operations.

It should be noted that in addition to using the edit distance to determine whether the original unmarked clause is plagiarized, you can also use the jaccard distance or the length of the longest common subsequence to determine it. Lift.

Step e, if the ratio of the first edit distance to the clause length of the jnth clause is less than a second preset value, mark the inth first clause as a non-original clause; If the ratio of the second edit distance to the clause length of the j+k+mth clause is smaller than the second preset value, then the i+k+mth first clause is marked as non-original Clause

In this embodiment, after the first edit distance and the second edit distance are obtained, the first edit distance is divided by the sentence length of the jnth clause of the target object, and the second edit distance is divided by the first edit distance of the target object. The clause length of the j+k+m clause is divided. If the ratio of the first edit distance to the clause length of the jnth clause is less than the second preset value, the inth first in the text to be screened The clause is marked as a non-original clause; if it is equal to or greater than the second preset value, the in-th first clause in the screened text is not marked. Similarly, if the ratio of the second edit distance to the clause length of the j+k+mth clause is less than the second preset value, the i+k+mth first clause in the text to be screened is marked as If the non-original clause is equal to or greater than the second preset value, the i+k+mth first clause in the screened text is not marked. The second preset value is preferably 0.1.

That is, first calculate the edit distance between the i-1th first clause in the text to be screened and the j-1th clause of the target object, and the ratio of the edit distance to the sentence length of the j-1th clause is less than the second When the default value is set, the i-1th first clause is marked. When it is greater than or equal to the second preset value, the i-1th first clause is not marked, and then continue to calculate the i-th sentence in the text to be screened -2 The edit distance between the first clause and the j-2th clause of the target object... until the in-th first clause has been marked. In the same way, calculate the edit distance between the i+k+1th first clause in the text to be screened and the j+k+1th clause of the target object, and calculate the edit distance from the j+k+1th clause When the ratio of length is less than the second preset value, mark the i+k+1th first clause; when it is greater than or equal to the second preset value, not mark the i+k+1th first clause , Then continue to calculate the edit distance between the i+k+2th first clause in the text to be screened and the j+k+2th clause of the target object... until the i+k+mth first clause The sentence has been marked, or the current first clause is the last first clause of the text to be screened.

When it is determined that the in-th first clause in the text to be screened has been marked, it is determined that the in-th clause is plagiarized, and there is no need to calculate the edit distance for the in-th clause. Similarly, in the text to be screened , When the i+k+m first clause has been marked, there is no need to calculate the edit distance for subsequent clauses. At this time, the number of words marked as non-original clauses is counted. The non-original clauses at this time include non-original clauses marked by Hamming distance at the beginning and non-original clauses marked by edit distance later.

It should be noted that if the i+k+mth first clause of the text to be screened has no comparison object in the target object, that is, the target object is at the end, then the i+k+mth clause of the text to be screened will not be treated Mark the first clause.

In this embodiment, when determining whether the text to be screened is an original text, in addition to considering the similarity between the text to be screened and the original object, the factors of subject substitution and substitution substitution are also considered. Compared with the single Hamming distance algorithm, it can effectively solve the problem. There are a large number of plagiarism scenes of subject substitution and pronoun substitution, which further improves the accuracy of original text discrimination.

This application also provides an original text screening device. The original text screening device of this application includes:

Further, the acquisition module is also used for:

Further, the preprocessing module is also used for:

Further, the first determining module is also used for:

Generating a first hash value corresponding to each of the first clauses;

Further, the first determining module is also used for:

The application also provides a computer-readable storage medium.

The computer-readable storage medium of the present application stores an original text screening program, and when the original text screening program is executed by a processor, the steps of the original text screening method as described above are realized.

For the method implemented when the original text screening program running on the processor is executed, please refer to the various embodiments of the original text screening method of the present application, which will not be repeated here.

It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, method, article, or system. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article or system that includes the element.

The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority of the embodiments.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including several instructions to make a terminal device (can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the method described in each embodiment of the present application.

The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made by using the content of the description and drawings of this application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

An original text screening method, wherein the original text screening method includes the following steps:

After receiving the text to be screened, obtain more than one object to be compared corresponding to the text to be screened in a preset original database;

Preprocessing the text to be screened to obtain more than one first clause;

Compare each of the first clauses with each of the objects to be compared to determine the non-original clauses in the first clause;

If the proportion of the non-original clauses in the text to be screened is not greater than a preset plagiarism threshold, it is determined that the text to be screened is an original text.
5. The original text screening method according to claim 1, wherein the step of obtaining one or more objects to be compared corresponding to the text to be screened in a preset original database after receiving the text to be screened comprises:

When receiving the text to be screened, determine the text length of the text to be screened, and cut the text to be screened into a number of character strings corresponding to the text length;

Obtain matching objects matching the character string in a preset original database, and select a preset number of objects to be compared from the matching objects.
5. The original text screening method of claim 1, wherein the step of preprocessing the text to be screened to obtain more than one first clause comprises:

Filtering the text to be screened based on preset filtering rules to obtain filtered text;

Based on the preset sentence rules, the filtering text is divided into sentences to obtain more than one first sentence.
3. The original text discrimination method according to claim 3, wherein the step of segmenting the filtered text based on preset sentence rules to obtain more than one first sentence comprises:

Based on the preset clause rules, clauses the filtered text to obtain each clause, and sequentially determines whether the number of words of each clause reaches the preset number of words;

If the number of words in the current clause reaches the preset number of words, set the current clause as the first clause;

If the number of words in the current clause does not reach the preset number of words, the current clause is merged into the first clause set based on the previous clause.
The original text screening method according to any one of claims 1 to 4, wherein the comparison of each of the first clauses with each of the objects to be compared is performed to determine the existence of the first clause in the The steps for non-original clauses include:

Generating a first hash value corresponding to each of the first clauses;

Retrieve a set of hash values corresponding to each of the objects to be compared, where the set of hash values includes multiple second hash values;

The first hash value is compared with the second hash value, and in the first hash value, it is determined that the Hamming distance to at least one of the second hash values is less than or equal to the first predetermined value. Set the third hash value;

In the first clause, the clause corresponding to the third hash value is marked as a non-original clause.
5. The original text discrimination method according to claim 5, wherein after the step of marking the clause corresponding to the third hash value as a non-original clause, the method further comprises:

If it is determined that the Hamming distance between the ith first clause to the i+k first clause in the text to be screened and the jth clause to the j+k clause in the target object does not exceed the first prediction Set the value, calculate the first edit distance between the in-th first clause in the text to be screened and the jn-th clause in the target object, and the i+k+m-th first clause in the text to be screened The second edit distance between the clause and the j+k+mth clause in the target object, wherein the target object is one of the objects to be compared, i and j are constants greater than 0, and k is A preset constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of clauses of the text to be screened, and j+k+m is less than or equal to the target object The number of clauses;

If the ratio of the first edit distance to the clause length of the jnth clause is less than a second preset value, mark the inth first clause as a non-original clause; if the second The ratio of the edit distance to the clause length of the j+k+mth clause is less than the second preset value, and the i+k+mth first clause is marked as a non-original clause.
5. The original text discrimination method according to claim 1, wherein after said determining the non-original clauses existing in the first clause, the method further comprises:

In the text to be screened, count the number of words of the non-original clause, and determine the proportion of the non-original clause in the text to be screened based on the number of words and the total word count of the text to be screened .
An original text screening device, wherein the original text screening device includes:

The obtaining module is configured to obtain more than one object to be compared corresponding to the text to be screened in a preset original database after receiving the text to be screened;

The preprocessing module is used to preprocess the text to be screened to obtain more than one first clause;

The first determining module is configured to compare each of the first clauses with each of the objects to be compared, and determine the non-original clauses in the first clause;

The second determining module is configured to determine that the text to be screened is an original text if the proportion of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold.
An original text screening device, wherein the original text screening device includes: a memory, a processor, and an original text screening program stored on the memory and running on the processor, the original text screening program being The processor implements the steps of the original text screening method according to any one of claims 1 to 7 when executed.
A computer-readable storage medium, wherein an original text screening program is stored on the computer-readable storage medium, and when the original text screening program is executed by a processor, the method according to any one of claims 1 to 7 is realized The steps of the original text screening method.