WO2021012958A1 - Original text screening method, apparatus, device and computer-readable storage medium - Google Patents

Original text screening method, apparatus, device and computer-readable storage medium Download PDF

Info

Publication number
WO2021012958A1
WO2021012958A1 PCT/CN2020/101003 CN2020101003W WO2021012958A1 WO 2021012958 A1 WO2021012958 A1 WO 2021012958A1 CN 2020101003 W CN2020101003 W CN 2020101003W WO 2021012958 A1 WO2021012958 A1 WO 2021012958A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
clause
original
screened
preset
Prior art date
Application number
PCT/CN2020/101003
Other languages
French (fr)
Chinese (zh)
Inventor
蔡远航
郑少杰
付勇
范增虎
江旻
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2021012958A1 publication Critical patent/WO2021012958A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the technical field of financial technology (Fintech), in particular to original text discrimination methods, devices, equipment and computer-readable storage media.
  • the current practice is that the public relations department of banks and other financial institutions or other publicity departments, before disseminating the publicity text to the outside world, enter the publicity text into the computer, and compare the publicity text with the text in the computer’s original database through the computer.
  • the keywords calculate the similarity to determine the originality of the promotional text.
  • the existing method can only judge whether the text to be screened is plagiarized, but cannot give a specific index of plagiarism rate. If the text to be screened is extracted from multiple original texts in turn, the existing method cannot give a conclusion of plagiarism. And, for texts to be screened that have a large number of subject substitutions and pronoun substitutions, it is difficult to screen their originality. Obviously, the accuracy of existing screening methods is low.
  • the main purpose of this application is to propose an original text discrimination method, device, equipment and computer readable storage medium, aiming to improve the accuracy of original text discrimination.
  • this application provides an original text screening method, and the original text screening method includes the following steps:
  • the proportion of the non-original clauses in the text to be screened is not greater than a preset plagiarism threshold, it is determined that the text to be screened is an original text.
  • the step of obtaining one or more objects to be compared corresponding to the text to be screened in a preset original database includes:
  • the step of preprocessing the text to be screened to obtain more than one first clause includes:
  • the filtering text is divided into sentences to obtain more than one first sentence.
  • the step of segmenting the filtered text based on a preset segmentation rule to obtain more than one first clause includes:
  • the current clause is merged into the first clause set based on the previous clause.
  • the step of comparing each of the first clauses with each of the objects to be compared, and determining the non-original clauses in the first clause includes:
  • the first hash value is compared with the second hash value, and in the first hash value, it is determined that the Hamming distance to at least one of the second hash values is less than or equal to the first predetermined value.
  • the clause corresponding to the third hash value is marked as a non-original clause.
  • the method further includes:
  • the ratio of the first edit distance to the clause length of the jnth clause is less than a second preset value, mark the inth first clause as a non-original clause; if the second The ratio of the edit distance to the clause length of the j+k+mth clause is less than the second preset value, and the i+k+mth first clause is marked as a non-original clause.
  • the method further includes:
  • this application also provides an original text screening device, the original text screening device includes:
  • the obtaining module is configured to obtain more than one object to be compared corresponding to the text to be screened in a preset original database after receiving the text to be screened;
  • the preprocessing module is used to preprocess the text to be screened to obtain more than one first clause
  • the first determining module is configured to compare each of the first clauses with each of the objects to be compared, and determine the non-original clauses in the first clause;
  • the second determining module is configured to determine that the text to be screened is an original text if the proportion of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold.
  • the acquisition module is further used for:
  • the preprocessing module is further used for:
  • the filtering text is divided into sentences to obtain more than one first sentence.
  • the preprocessing module is further used for:
  • the current clause is merged into the first clause set based on the previous clause.
  • the first determining module is further configured to:
  • the first hash value is compared with the second hash value, and in the first hash value, it is determined that the Hamming distance to at least one of the second hash values is less than or equal to the first predetermined value.
  • the clause corresponding to the third hash value is marked as a non-original clause.
  • the first determining module is further configured to:
  • the ratio of the first edit distance to the clause length of the jnth clause is less than a second preset value, mark the inth first clause as a non-original clause; if the second The ratio of the edit distance to the clause length of the j+k+mth clause is less than the second preset value, and the i+k+mth first clause is marked as a non-original clause.
  • the first determining module is further configured to:
  • this application also provides an original text screening device, the original text screening device comprising: a memory, a processor, and an original text screening stored on the memory and running on the processor A program, when the original text screening program is executed by the processor, the steps of the original text screening method as described above are realized.
  • the present application also provides a computer-readable storage medium, the computer-readable storage medium stores an original text screening program, and the original text screening program is executed by a processor to achieve the above The steps of the original text screening method.
  • the text to be screened is preprocessed to Obtain more than one first clause; compare each of the first clauses with each of the objects to be compared to determine the non-original clauses in the first clause; if the non-original clauses If the proportion of the text to be screened is not greater than a preset plagiarism threshold, it is determined that the text to be screened is an original text.
  • the application also discloses an original text screening device, equipment and readable storage medium.
  • This application processes the text to be screened into individual clauses, determines whether the text to be screened is an original text, and decomposes it into determining whether each clause is an original clause, so as to determine whether the text to be screened is original based on the proportion of original clauses Text, effectively improve the accuracy of original text discrimination.
  • FIG. 1 is a schematic diagram of a device structure of a hardware operating environment involved in a solution of an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a first embodiment of a method for identifying original texts of an application
  • FIG. 3 is a schematic flowchart of a second embodiment of the method for identifying original texts of this application.
  • FIG. 1 is a schematic diagram of the device structure of the hardware operating environment involved in the solution of the embodiment of the present application.
  • the device in the embodiment of this application may be a PC or a server device.
  • the device may include a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002.
  • the communication bus 1002 is used to implement connection and communication between these components.
  • the user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 1005 may be a high-speed RAM memory, or a stable memory (non-volatile memory), such as a magnetic disk memory.
  • the memory 1005 may also be a storage device independent of the foregoing processor 1001.
  • FIG. 1 does not constitute a limitation on the device, and may include more or fewer components than those shown in the figure, or a combination of certain components, or different component arrangements.
  • the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and an original text discrimination program.
  • the operating system is a program that manages and controls original text discrimination equipment and software resources, and supports the operation of network communication modules, user interface modules, original text discrimination programs, and other programs or software;
  • the network communication module is used to manage and control the network interface 1002 ;
  • the user interface module is used to manage and control the user interface 1003.
  • the original text screening device calls the original text screening program stored in the memory 1005 through the processor 1001, and executes the operations in each embodiment of the following original text screening method.
  • Fig. 2 is a schematic flowchart of a first embodiment of a method for identifying original texts of this application. The method includes:
  • Step S10 after receiving the text to be screened, obtain more than one object to be compared corresponding to the text to be screened in a preset original database;
  • Step S20 preprocessing the text to be screened to obtain more than one first clause
  • Step S30 comparing each of the first clauses with each of the objects to be compared to determine the non-original clauses in the first clause;
  • Step S40 If the proportion of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold, it is determined that the text to be screened is an original text.
  • the original text screening method of this embodiment is applied to the original text screening equipment of financial institutions such as financial institutions or banking systems.
  • the original text screening equipment is hereinafter referred to as screening equipment.
  • the screening equipment is connected to the original database, and the original database is stored on the Internet. All original texts, including original news texts, original advertisements, original soft texts and other works.
  • the original database generally only stores the original texts of the past 3 years.
  • a search module is built in the screening equipment , Used to obtain the original object corresponding to the current search term in the original database, wherein the search module is preferably the ES search module (Elastic Search, elastic search, a distributed full-text search engine).
  • the ES search module searches the original database based on the search terms and returns the search results.
  • the search principle based on ES search is due to the current There are technologies, so I won’t repeat them here.
  • the screening device of this embodiment when receiving the text to be screened, first selects the object to be compared related to the text to be screened from the original database, then processes the text to be screened into a sentence, and then determines the non-original in the clause The proportion of clauses to determine whether the text to be screened is an original text.
  • Step S10 after receiving the text to be screened, obtain more than one object to be compared corresponding to the text to be screened in a preset original database;
  • the external propaganda personnel of financial institutions such as financial institutions or banks, before publicizing or publishing the text to be screened, input the text to be screened into the screening device to screen whether the text to be screened is an original text. Avoid unnecessary copyright disputes.
  • the screening device After the screening device receives the text to be screened, it first obtains more than one object to be compared corresponding to the text to be screened in the original database. That is to say, this embodiment does not need to compare the current text to be screened with all the objects in the original database.
  • the original texts are compared one by one, but the objects to be compared related to the current text to be screened are first selected from the original database.
  • step S10 includes:
  • the screening device when it receives the text to be screened, it first determines the text length of the text to be screened, that is, calculates the character length of the text to be screened, and cuts the text to be screened into a corresponding number of character strings.
  • the string length is preset, and the text to be screened is truncated according to the preset string length. For example, if the text length of the text to be screened is currently N and the preset string length is 100, then N/100 strings will be obtained after truncation. This is because there is an upper limit on the length of the search term of the ES search module, so the text to be screened needs to be truncated, preferably 100 characters as the preset string length.
  • each character string is searched as a search term, and the original object matching the current character string is obtained in the preset original database.
  • the set of original objects obtained is the matching Object, since the original object corresponding to different character strings may be the same, therefore, in this step, it is also necessary to de-duplicate the original object retrieved, and then select a preset number of objects to be compared on the basis of matching objects Object.
  • search results for the current string which does not satisfy the condition of fetching the original object of the preset number/number of strings each time. For example, there are only 3 search results for the A string. To meet the condition of fetching the top 500 original objects each time, then only need to obtain these 3 search results, and replace the insufficient 497 original objects with blank text.
  • Step S20 preprocessing the text to be screened to obtain more than one first clause.
  • the screening device preprocesses the text to be screened to obtain more than one first clause, that is, the text to be screened is decomposed into various clauses. Specifically, the text to be screened is decomposed into various clauses according to punctuation marks, which can be specific periods.
  • Step S30 comparing each of the first clauses with each of the objects to be compared, and determining non-original clauses in the first clause.
  • each first clause is compared with each object to be compared, so as to determine the originality of each first clause, and finally determine how many non-original clauses exist in the first clause.
  • step S30 includes:
  • the screening device generates the first hash value corresponding to each first clause, where the first hash value is preferably a simhash value, and simhash is a locally sensitive hash, which is a text hash mapping algorithm for The text is mapped to a bit string of length equal to 64.
  • the difference from ordinary hash algorithms is that the locally sensitive hash results of two similar texts are also similar, and their Hamming distance is less than or equal to 3.
  • the similarity calculation can also be performed if the first hash value is a common hash value. This embodiment is preferably described by taking simhash as an example.
  • the text to be screened is decomposed into each first clause according to punctuation marks, specifically a period, and each first clause is segmented to obtain an effective feature vector, and then set 1-5 for each feature vector 5 levels of weights, where the weight of the feature vector can be the number of times the word corresponding to the feature vector appears in the text to be screened.
  • the current first clause is "Internet banks issue loans through face recognition technology and big data credit rating”
  • after word segmentation is "Internet banks issue loans through face recognition technology and big data credit rating”
  • then for each feature vector Give weight: Internet banking (4) through (1) face recognition technology (3) and (1) big data (4) credit rating (5) issuance (1) loan (5), where the number in parentheses represents the word The degree of importance in the current clause, the larger the number, the more important.
  • the hash value of each feature vector is calculated through the hash function, and the hash value is a signature composed of the binary number 01.
  • the hash value of "Internet Banking”, Hash (Internet Banking) is 100101
  • the hash value of "Loan”, Hash (loan) is 101011, so far, the current clause has become a series of numbers.
  • W Hash ⁇ weight
  • other feature vectors are similar to this operation.
  • the screening equipment can retrieve the original database for each object to be compared.
  • the hash value set where the hash value set includes a plurality of second hash values, and the second hash value is preferably the second simhash value.
  • the first hash value is compared with the second hash value, and in the first hash value, it is determined that the Hamming distance to at least one of the second hash values is less than or equal to the first predetermined value.
  • each first simhash value is compared with the second simhash value in each object to be compared to determine its Hamming distance, where the Hamming distance is the number of different characters in the corresponding positions of the two strings, that is In other words, it is the number of characters that need to be replaced to transform a string into another string.
  • the Hamming distance between 1011101 and 1001001 is 2.
  • the first preset value is preferably 3 during specific implementation.
  • the Hamming distance is less than or equal to 3, it is determined that there is plagiarism in the current first clause.
  • the clause corresponding to the third hash value is marked as a non-original clause.
  • the clause that determines plagiarism is marked as a non-original clause. Specifically, such as the red display.
  • Step S40 If the proportion of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold, it is determined that the text to be screened is an original text.
  • the proportion of non-original clauses in the text to be screened is counted. Specifically, the number of non-original clauses and the number of first clauses in the text to be screened can be counted, so as to calculate the number of non-original clauses in the text to be screened.
  • the method further includes:
  • the proportion of non-original clauses in the text to be screened can also be determined by counting the number of words in non-original clauses and the total number of words in the text to be screened, and dividing the number of words in non-original clauses by the text to be screened The total number of words in the text to get the proportion of non-original clauses in the text to be screened. Based on this proportion, it is determined whether the text to be screened is an original text.
  • a plagiarism threshold is preset, such as 80%. After calculating the proportion of non-original clauses in the text to be screened, it is determined whether the proportion is greater than the preset plagiarism threshold, and if so, the text to be screened is determined It is plagiarized text, otherwise it is original text.
  • more than one object to be compared corresponding to the text to be screened is obtained from the preset original database; the text to be screened is preprocessed to obtain more than one first Clause; compare each of the first clauses with each of the objects to be compared to determine the non-original clauses in the first clause; if the non-original clauses are in the text to be screened If the proportion in is not greater than the preset plagiarism threshold, it is determined that the text to be screened is an original text.
  • the application also discloses an original text screening device, equipment and readable storage medium.
  • This application processes the text to be screened into individual clauses, determines whether the text to be screened is an original text, and decomposes it into determining whether each clause is an original clause, so as to determine whether the text to be screened is original based on the proportion of original clauses Text, effectively improve the accuracy of original text discrimination.
  • step S20 includes:
  • Step S21 filtering the text to be screened based on preset filtering rules to obtain filtered text
  • step S22 based on a preset sentence rule, the filtered text is sentenced to obtain more than one first sentence.
  • filtering and clauses are specifically used to decompose the text to be screened into each first clause.
  • Step S21 Filter the text to be screened based on preset filtering rules to obtain filtered text.
  • the screening device filters the text to be screened, where the preset filtering rule is: filter out meaningless symbols in the text to be screened, meaningless symbols include HTML tags and HTML character entities And facial characters symbols, etc.; convert traditional Chinese in the text to be screened into simplified Chinese; replace dashes, single quotation marks in Chinese and English, double quotation marks in Chinese and English, and colons in Chinese and English with Chinese commas, etc. in the text to be screened
  • the purpose is to avoid the difference brought by the symbols and affect the originality discrimination, such as:
  • the filtered text is obtained so that the filtered text can be subsequently segmented.
  • step S22 based on a preset sentence rule, the filtered text is sentenced to obtain more than one first sentence.
  • the screening device divides the filtered text after filtering based on preset sentence rules, where the preset sentence rules are: according to Chinese and English commas, Chinese and English periods, Chinese and English exclamation marks, Chinese and English question marks , Chinese semicolon, Chinese comma, space and escape characters ⁇ n and ⁇ t to make clauses, such as "Project leader Zhang San said to the team members, we must go further, not forgetting our original intentions, and forge ahead ", according to the symbol, the sentence is: “Project leader Zhang San said to the team members", “Next we will go further”, “Don't forget the original intention", "Forge ahead”, filter and divide the text to be screened After that, get each first clause.
  • preset sentence rules are: according to Chinese and English commas, Chinese and English periods, Chinese and English exclamation marks, Chinese and English question marks , Chinese semicolon, Chinese comma, space and escape characters ⁇ n and ⁇ t to make clauses, such as "Project leader Zhang San said to the team members, we
  • step S22 includes:
  • Step a based on a preset clause rule, segment the filtered text to obtain each clause, and sequentially determine whether the number of words in each clause reaches the preset number of words;
  • the screening device subdivides the filtered text that has been filtered based on preset clause rules to obtain each clause.
  • the specific preset according to the rules includes Chinese and English commas, Chinese and English periods, and Chinese and English exclamation marks.
  • Chinese and English question marks, Chinese semicolons, Chinese halts, spaces, and escape characters ⁇ n and ⁇ t are used to make clauses to obtain each clause and determine whether the number of words in each clause reaches the preset number of words, such as 10 words, etc.
  • Step b if the number of words in the current clause reaches the preset number of words, set the current clause as the first clause;
  • the current clause is set as the first clause.
  • Step c If the number of words of the current clause does not reach the preset number of words, merge the current clause into the first clause set based on the previous clause.
  • the current clause is merged with the previous first clause; and the previous first clause can be the first clause formed by the previous clause of the current clause, or the first two clauses of the current clause The first clause... If the current clause is the first sentence of the text to be screened, and the number of words in the current clause does not reach the preset number of words, then the current clause and the first clause set by the next clause will be combined One clause is combined.
  • the preprocessing in this embodiment includes filtering and sentence segmentation.
  • the filter and the sentence are specifically used to filter out the factors that affect the originality screening, that is, to filter out various meaningless symbols.
  • the text to be screened is decomposed into various clauses, and then the originality of the text to be screened is identified by determining the originality of each clause, and the screening objects are refined, so that the accuracy of the original text can be improved.
  • the difference between the third embodiment of the original text screening method and the first and second embodiments of the original text screening method is that after the step of marking the clause corresponding to the third hash value as a non-original clause, Also includes:
  • Step d if it is determined that the Hamming distance between the ith first clause to the i+k first clause in the text to be screened and the jth clause to the j+k clause in the target object does not exceed the A preset value, calculate the first edit distance between the in-th first clause in the text to be screened and the jn-th clause in the target object, and the i+k+m-th in the text to be screened The second edit distance between the first clause and the j+k+mth clause in the target object, wherein the target object is one of the objects to be compared, and i and j are constants greater than 0, k is a preset constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of clauses of the text to be screened, and j+k+m is less than or equal to The number of clauses of the target object;
  • Step e if the ratio of the first edit distance to the clause length of the jnth clause is less than a second preset value, mark the inth first clause as a non-original clause; If the ratio of the second edit distance to the clause length of the j+k+mth clause is smaller than the second preset value, then the i+k+mth first clause is marked as non-original Clause.
  • the edit distance of each clause is also calculated to further confirm each clause Originality.
  • Step d if it is determined that the Hamming distance between the ith first clause to the i+k first clause in the text to be screened and the jth clause to the j+k clause in the target object does not exceed The first preset value, calculate the first edit distance between the in-th first clause in the text to be screened and the jn-th clause in the target object, and the i+k+m-th in the text to be screened
  • the second edit distance between the first clause and the j+k+mth clause in the target object where the target object is one of the objects to be compared, and i and j are constants greater than 0 , K is a preset constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of clauses of the text to be screened, and j+k+m is less than or equal to all The number of clauses describing the target object.
  • the screening device monitors the continuity of the marked clauses in real time, that is, it monitors the continuity of non-original clauses. If there is an i-th in the text to be screened The Hamming distance from the first clause to the i+k first clause to the jth clause to the j+k clause in the target object does not exceed the first preset value, that is, the first clause of the text to be screened The i-th first clause to the i+k-th first clause are all marked as non-original clauses, then the first edit of the in-th first clause in the text to be screened and the jn-th clause in the target object is calculated Distance, and the second edit distance between the i+k+m first clause in the text to be screened and the j+k+m clause in the target object, that is, the in-th first clause in the text to be screened and the The i+k+m first clause is not marked, but these clauses may have subject substitution and pronoun substitution
  • Simple subject substitution and pronoun substitution are not original, but Hamming distance does not Judging, therefore, when it is determined that the k consecutive first clauses of the text to be screened are marked as non-original clauses, it can be determined that there is still the possibility of plagiarism in the clauses before and after.
  • the edit distance is calculated for the in-th first clause and the i+k+m-th first clause, where the target object is one of the objects to be compared, i and j are constants greater than 0, and k Is a preset constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of clauses of the text to be screened, and j+k+m is less than or equal to the target The number of clauses of the object.
  • Editing distance refers to the minimum number of editing operations required to convert two strings from one to the other. In this embodiment, it refers to the minimum number of edits required to convert a clause in the text to be screened into a clause in the target object The number of operations.
  • Step e if the ratio of the first edit distance to the clause length of the jnth clause is less than a second preset value, mark the inth first clause as a non-original clause; If the ratio of the second edit distance to the clause length of the j+k+mth clause is smaller than the second preset value, then the i+k+mth first clause is marked as non-original Clause
  • the first edit distance is divided by the sentence length of the jnth clause of the target object
  • the second edit distance is divided by the first edit distance of the target object.
  • the clause length of the j+k+m clause is divided. If the ratio of the first edit distance to the clause length of the jnth clause is less than the second preset value, the inth first in the text to be screened The clause is marked as a non-original clause; if it is equal to or greater than the second preset value, the in-th first clause in the screened text is not marked.
  • the i+k+mth first clause in the text to be screened is marked as If the non-original clause is equal to or greater than the second preset value, the i+k+mth first clause in the screened text is not marked.
  • the second preset value is preferably 0.1.
  • the i-1th first clause is marked.
  • the i-1th first clause is not marked, and then continue to calculate the i-th sentence in the text to be screened -2
  • the edit distance between the first clause and the j-2th clause of the target object ... until the in-th first clause has been marked.
  • the i+k+mth first clause of the text to be screened has no comparison object in the target object, that is, the target object is at the end, then the i+k+mth clause of the text to be screened will not be treated Mark the first clause.
  • the factors of subject substitution and substitution substitution are also considered. Compared with the single Hamming distance algorithm, it can effectively solve the problem. There are a large number of plagiarism scenes of subject substitution and pronoun substitution, which further improves the accuracy of original text discrimination.
  • the original text screening device of this application includes:
  • the obtaining module is configured to obtain more than one object to be compared corresponding to the text to be screened in a preset original database after receiving the text to be screened;
  • the preprocessing module is used to preprocess the text to be screened to obtain more than one first clause
  • the first determining module is configured to compare each of the first clauses with each of the objects to be compared, and determine the non-original clauses in the first clause;
  • the second determining module is configured to determine that the text to be screened is an original text if the proportion of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold.
  • the acquisition module is also used for:
  • preprocessing module is also used for:
  • the filtering text is divided into sentences to obtain more than one first sentence.
  • preprocessing module is also used for:
  • the current clause is merged into the first clause set based on the previous clause.
  • the first determining module is also used for:
  • the first hash value is compared with the second hash value, and in the first hash value, it is determined that the Hamming distance to at least one of the second hash values is less than or equal to the first predetermined value.
  • the clause corresponding to the third hash value is marked as a non-original clause.
  • the first determining module is also used for:
  • the ratio of the first edit distance to the clause length of the jnth clause is less than a second preset value, mark the inth first clause as a non-original clause; if the second The ratio of the edit distance to the clause length of the j+k+mth clause is less than the second preset value, and the i+k+mth first clause is marked as a non-original clause.
  • the first determining module is also used for:
  • the application also provides a computer-readable storage medium.
  • the computer-readable storage medium of the present application stores an original text screening program, and when the original text screening program is executed by a processor, the steps of the original text screening method as described above are realized.

Abstract

An original text screening method, an apparatus, a device and a computer-readable storage medium, the method comprising: after receiving text to be screened, acquiring from a preset original database more than one object to be compared that corresponds to the text to be screened (S10); pre-processing the text to be screened so as to obtain more than one first clause (S20); comparing each first clause to each object to be compared, and determining a non-original clause among the first clauses (S30); if the proportion of non-original clauses in the text to be screened is not larger than a preset plagiarism threshold, determining that the text to be screened is an original text (S40).

Description

原创文本甄别方法、装置、设备与计算机可读存储介质Original text discrimination method, device, equipment and computer readable storage medium
本申请要求于2019年7月23日申请的、申请号为201910669863.0、名称为“原创文本甄别方法、装置、设备与计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on July 23, 2019, with the application number 201910669863.0 and the name "Original text screening method, device, equipment and computer-readable storage medium", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及金融科技(Fintech)技术领域,尤其涉及原创文本甄别方法、装置、设备与计算机可读存储介质。This application relates to the technical field of financial technology (Fintech), in particular to original text discrimination methods, devices, equipment and computer-readable storage media.
背景技术Background technique
近年来,随着金融科技(Fintech),尤其是互联网金融的不断发展,数据甄别技术被引入银行等金融机构的日常业务中。在银行等金融机构的日常宣传过程中,为确保宣传文本,如新闻、软文和广告等,不是抄袭他人的抄袭作品,在传播之前,需要对宣传文本的原创性进行审核,只有确保宣传文本是原创文本,才能避免不必要的版权纠纷,并且使原创作品得到应有的价值反馈,因此,对待甄别文本进行原创性甄别是银行等金融机构对外宣传时必做的一项工作。In recent years, with the continuous development of financial technology (Fintech), especially Internet finance, data screening technology has been introduced into the daily operations of banks and other financial institutions. In the daily propaganda process of financial institutions such as banks, in order to ensure that the propaganda texts, such as news, adverts, and advertisements, are not plagiarized by others, the originality of the propaganda texts needs to be reviewed before dissemination. Only original texts can avoid unnecessary copyright disputes and enable original works to receive the value feedback they deserve. Therefore, the originality of the screened texts is a must for banks and other financial institutions when they publicize them.
现有做法是,银行等金融机构的公关部门或者其他对外宣传的部门,在将宣传文本对外传播之前,将宣传文本输入计算机,通过计算机将宣传文本与计算机原创数据库中的文本进行比对,通过关键字计算相似度来确定宣传文本的原创性。The current practice is that the public relations department of banks and other financial institutions or other publicity departments, before disseminating the publicity text to the outside world, enter the publicity text into the computer, and compare the publicity text with the text in the computer’s original database through the computer. The keywords calculate the similarity to determine the originality of the promotional text.
然而现有做法仅能判断待甄别文本是否存在抄袭,但无法给出具体的抄袭率指标,如果待甄别文本依次从多篇原创文本中各摘抄一段话,那么现有做法无法给出抄袭的结论,并且,对于存在大量主语替换和代词替换等的待甄别文本,很难对其原创性进行甄别,显然,现有甄别方法准确率较低。However, the existing method can only judge whether the text to be screened is plagiarized, but cannot give a specific index of plagiarism rate. If the text to be screened is extracted from multiple original texts in turn, the existing method cannot give a conclusion of plagiarism. And, for texts to be screened that have a large number of subject substitutions and pronoun substitutions, it is difficult to screen their originality. Obviously, the accuracy of existing screening methods is low.
技术解决方案Technical solutions
本申请的主要目的在于提出一种原创文本甄别方法、装置、设备与计算机可读存储介质,旨在提高原创文本的甄别精度。The main purpose of this application is to propose an original text discrimination method, device, equipment and computer readable storage medium, aiming to improve the accuracy of original text discrimination.
为实现上述目的,本申请提供一种原创文本甄别方法,所述原创文本甄别方法包括如下步骤:In order to achieve the above-mentioned purpose, this application provides an original text screening method, and the original text screening method includes the following steps:
在接收到待甄别文本后,在预设的原创数据库中获取所述待甄别文本对应的一个以上的待比较对象;After receiving the text to be screened, obtain more than one object to be compared corresponding to the text to be screened in a preset original database;
对所述待甄别文本进行预处理,以得到一个以上的第一分句;Preprocessing the text to be screened to obtain more than one first clause;
将各个所述第一分句与各个所述待比较对象进行比对,确定所述第一分句中存在的非原创分句;Compare each of the first clauses with each of the objects to be compared to determine the non-original clauses in the first clause;
若所述非原创分句在所述待甄别文本中的占比不大于预设的抄袭阈值,则确定所述待甄别文本为原创文本。If the proportion of the non-original clauses in the text to be screened is not greater than a preset plagiarism threshold, it is determined that the text to be screened is an original text.
在一实施例中,所述在接收到待甄别文本后,在预设的原创数据库中获取所述待甄别文本对应的一个以上的待比较对象的步骤包括:In one embodiment, after receiving the text to be screened, the step of obtaining one or more objects to be compared corresponding to the text to be screened in a preset original database includes:
在接收到待甄别文本时,确定所述待甄别文本的文本长度,并将所述待甄别文本截为所述文本长度对应数量的字符串;When receiving the text to be screened, determine the text length of the text to be screened, and cut the text to be screened into a number of character strings corresponding to the text length;
在预设的原创数据库中获取与所述字符串匹配的匹配对象,并在所述匹配对象中选取预设数量的待比较对象。Obtain matching objects matching the character string in a preset original database, and select a preset number of objects to be compared from the matching objects.
在一实施例中,所述对所述待甄别文本进行预处理,以得到一个以上的第一分句的步骤包括:In an embodiment, the step of preprocessing the text to be screened to obtain more than one first clause includes:
基于预设过滤规则,对所述待甄别文本进行过滤,以得到过滤文本;Filtering the text to be screened based on preset filtering rules to obtain filtered text;
基于预设分句规则,对所述过滤文本进行分句,以得到一个以上的第一分句。Based on the preset sentence rules, the filtering text is divided into sentences to obtain more than one first sentence.
在一实施例中,所述基于预设分句规则,对所述过滤文本进行分句,以得到一个以上的第一分句的步骤包括:In one embodiment, the step of segmenting the filtered text based on a preset segmentation rule to obtain more than one first clause includes:
基于预设分句规则,对所述过滤文本进行分句,以得到各个子句,并依次确定各个子句的字数是否达到预设字数;Based on the preset clause rules, clauses the filtered text to obtain each clause, and sequentially determines whether the number of words of each clause reaches the preset number of words;
若当前子句的字数达到所述预设字数,则将当前子句设定为所述第一分句;If the number of words in the current clause reaches the preset number of words, set the current clause as the first clause;
若当前子句的字数未达到所述预设字数,则将当前子句合并到基于前一子句所设定的所述第一分句中。If the number of words in the current clause does not reach the preset number of words, the current clause is merged into the first clause set based on the previous clause.
在一实施例中,所述将各个所述第一分句与各个所述待比较对象进行比对,确定所述第一分句中存在的非原创分句的步骤包括:In an embodiment, the step of comparing each of the first clauses with each of the objects to be compared, and determining the non-original clauses in the first clause includes:
生成各个所述第一分句对应的第一哈希值;Generating a first hash value corresponding to each of the first clauses;
调取各个所述待比较对象对应的哈希值集合,所述哈希值集合中包含多个第二哈希值;Retrieve a set of hash values corresponding to each of the objects to be compared, where the set of hash values includes multiple second hash values;
将所述第一哈希值与所述第二哈希值进行比较,在所述第一哈希值中,确定与至少一个所述第二哈希值的海明距离小于或等于第一预设值的第三哈希值;The first hash value is compared with the second hash value, and in the first hash value, it is determined that the Hamming distance to at least one of the second hash values is less than or equal to the first predetermined value. Set the third hash value;
在所述第一分句中,将所述第三哈希值对应的分句标记为非原创分句。In the first clause, the clause corresponding to the third hash value is marked as a non-original clause.
在一实施例中,在所述将所述第三哈希值对应的分句标记为非原创分句的步骤之后,还包括:In an embodiment, after the step of marking the clause corresponding to the third hash value as a non-original clause, the method further includes:
若确定所述待甄别文本中第i个第一分句到第i+k第一分句与目标对象中第j分句到第j+k分句的海明距离都未超过第一预设值,则计算所述待甄别文本中第i-n个第一分句与所述目标对象中第j-n分句的第一编辑距离,以及所述待甄别文本中第i+k+m个第一分句与所述目标对象中第j+k+m分句的第二编辑距离,其中,所述目标对象为所述待比较对象中的一个对象,i、j为大于0的常数,k为预设常数,n为1到i的集合,m为1到无穷大的集合,且i+k+m小于等于所述待甄别文本的分句的数量,j+k+m小于等于所述目标对象的分句的数量;If it is determined that the Hamming distance between the ith first clause to the i+k first clause in the text to be screened and the jth clause to the j+k clause in the target object does not exceed the first preset Value, calculate the first edit distance between the in-th first clause in the text to be screened and the jn-th clause in the target object, and the i+k+m-th first score in the text to be screened The second edit distance between the sentence and the j+k+mth clause in the target object, where the target object is one of the objects to be compared, i and j are constants greater than 0, and k is the predetermined Set a constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of clauses of the text to be screened, and j+k+m is less than or equal to the target object The number of clauses;
若所述第一编辑距离与所述第j-n分句的分句长度的比值小于第二预设值,则将所述第i-n个第一分句标记为非原创分句;若所述第二编辑距离与所述第j+k+m分句的分句长度的比值小于所述第二预设值,则将所述第i+k+m个第一分句标记为非原创分句。If the ratio of the first edit distance to the clause length of the jnth clause is less than a second preset value, mark the inth first clause as a non-original clause; if the second The ratio of the edit distance to the clause length of the j+k+mth clause is less than the second preset value, and the i+k+mth first clause is marked as a non-original clause.
在一实施例中,在所述确定所述第一分句中存在的非原创分句之后,还包括:In an embodiment, after determining the non-original clauses existing in the first clause, the method further includes:
在所述待甄别文本中,统计所述非原创分句的字数,并基于所述字数以及所述待甄别文本的总字数,确定所述非原创分句在所述待甄别文本中的占比。In the text to be screened, count the number of words of the non-original clause, and determine the proportion of the non-original clause in the text to be screened based on the number of words and the total word count of the text to be screened .
此外,为实现上述目的,本申请还提供一种原创文本甄别装置,所述原创文本甄别装置包括:In addition, in order to achieve the above objective, this application also provides an original text screening device, the original text screening device includes:
获取模块,用于在接收到待甄别文本后,在预设的原创数据库中获取所述待甄别文本对应的一个以上的待比较对象;The obtaining module is configured to obtain more than one object to be compared corresponding to the text to be screened in a preset original database after receiving the text to be screened;
预处理模块,用于对所述待甄别文本进行预处理,以得到一个以上的第一分句;The preprocessing module is used to preprocess the text to be screened to obtain more than one first clause;
第一确定模块,用于将各个所述第一分句与各个所述待比较对象进行比对,确定所述第一分句中存在的非原创分句;The first determining module is configured to compare each of the first clauses with each of the objects to be compared, and determine the non-original clauses in the first clause;
第二确定模块,用于若所述非原创分句在所述待甄别文本中的占比不大于预设的抄袭阈值,则确定所述待甄别文本为原创文本。The second determining module is configured to determine that the text to be screened is an original text if the proportion of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold.
在一实施例中,所述获取模块还用于:In an embodiment, the acquisition module is further used for:
在接收到待甄别文本时,确定所述待甄别文本的文本长度,并将所述待甄别文本截为所述文本长度对应数量的字符串;When receiving the text to be screened, determine the text length of the text to be screened, and cut the text to be screened into a number of character strings corresponding to the text length;
在预设的原创数据库中获取与所述字符串匹配的匹配对象,并在所述匹配对象中选取预设数量的待比较对象。Obtain matching objects matching the character string in a preset original database, and select a preset number of objects to be compared from the matching objects.
在一实施例中,所述预处理模块还用于:In an embodiment, the preprocessing module is further used for:
基于预设过滤规则,对所述待甄别文本进行过滤,以得到过滤文本;Filtering the text to be screened based on preset filtering rules to obtain filtered text;
基于预设分句规则,对所述过滤文本进行分句,以得到一个以上的第一分句。Based on the preset sentence rules, the filtering text is divided into sentences to obtain more than one first sentence.
在一实施例中,所述预处理模块还用于:In an embodiment, the preprocessing module is further used for:
基于预设分句规则,对所述过滤文本进行分句,以得到各个子句,并依次确定各个子句的字数是否达到预设字数;Based on the preset clause rules, clauses the filtered text to obtain each clause, and sequentially determines whether the number of words of each clause reaches the preset number of words;
若当前子句的字数达到所述预设字数,则将当前子句设定为所述第一分句;If the number of words in the current clause reaches the preset number of words, set the current clause as the first clause;
若当前子句的字数未达到所述预设字数,则将当前子句合并到基于前一子句所设定的所述第一分句中。If the number of words in the current clause does not reach the preset number of words, the current clause is merged into the first clause set based on the previous clause.
在一实施例中,所述第一确定模块还用于:In an embodiment, the first determining module is further configured to:
生成各个所述第一分句对应的第一哈希值;Generating a first hash value corresponding to each of the first clauses;
调取各个所述待比较对象对应的哈希值集合,所述哈希值集合中包含多个第二哈希值;Retrieve a set of hash values corresponding to each of the objects to be compared, where the set of hash values includes multiple second hash values;
将所述第一哈希值与所述第二哈希值进行比较,在所述第一哈希值中,确定与至少一个所述第二哈希值的海明距离小于或等于第一预设值的第三哈希值;The first hash value is compared with the second hash value, and in the first hash value, it is determined that the Hamming distance to at least one of the second hash values is less than or equal to the first predetermined value. Set the third hash value;
在所述第一分句中,将所述第三哈希值对应的分句标记为非原创分句。In the first clause, the clause corresponding to the third hash value is marked as a non-original clause.
在一实施例中,所述第一确定模块还用于:In an embodiment, the first determining module is further configured to:
若确定所述待甄别文本中第i个第一分句到第i+k第一分句与目标对象中第j分句到第j+k分句的海明距离都未超过第一预设值,则计算所述待甄别文本中第i-n个第一分句与所述目标对象中第j-n分句的第一编辑距离,以及所述待甄别文本中第i+k+m个第一分句与所述目标对象中第j+k+m分句的第二编辑距离,其中,所述目标对象为所述待比较对象中的一个对象,i、j为大于0的常数,k为预设常数,n为1到i的集合,m为1到无穷大的集合,且i+k+m小于等于所述待甄别文本的分句的数量,j+k+m小于等于所述目标对象的分句的数量;If it is determined that the Hamming distance between the ith first clause to the i+k first clause in the text to be screened and the jth clause to the j+k clause in the target object does not exceed the first preset Value, calculate the first edit distance between the in-th first clause in the text to be screened and the jn-th clause in the target object, and the i+k+m-th first score in the text to be screened The second edit distance between the sentence and the j+k+mth clause in the target object, where the target object is one of the objects to be compared, i and j are constants greater than 0, and k is the predetermined Set a constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of clauses of the text to be screened, and j+k+m is less than or equal to the target object The number of clauses;
若所述第一编辑距离与所述第j-n分句的分句长度的比值小于第二预设值,则将所述第i-n个第一分句标记为非原创分句;若所述第二编辑距离与所述第j+k+m分句的分句长度的比值小于所述第二预设值,则将所述第i+k+m个第一分句标记为非原创分句。If the ratio of the first edit distance to the clause length of the jnth clause is less than a second preset value, mark the inth first clause as a non-original clause; if the second The ratio of the edit distance to the clause length of the j+k+mth clause is less than the second preset value, and the i+k+mth first clause is marked as a non-original clause.
在一实施例中,所述第一确定模块还用于:In an embodiment, the first determining module is further configured to:
在所述待甄别文本中,统计所述非原创分句的字数,并基于所述字数以及所述待甄别文本的总字数,确定所述非原创分句在所述待甄别文本中的占比。In the text to be screened, count the number of words of the non-original clause, and determine the proportion of the non-original clause in the text to be screened based on the number of words and the total word count of the text to be screened .
此外,为实现上述目的,本申请还提供一种原创文本甄别设备,所述原创文本甄别设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的原创文本甄别程序,所述原创文本甄别程序被所述处理器执行时实现如上所述的原创文本甄别方法的步骤。In addition, in order to achieve the above objective, this application also provides an original text screening device, the original text screening device comprising: a memory, a processor, and an original text screening stored on the memory and running on the processor A program, when the original text screening program is executed by the processor, the steps of the original text screening method as described above are realized.
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有原创文本甄别程序,所述原创文本甄别程序被处理器执行时实现如上所述的原创文本甄别方法的步骤。In addition, in order to achieve the above object, the present application also provides a computer-readable storage medium, the computer-readable storage medium stores an original text screening program, and the original text screening program is executed by a processor to achieve the above The steps of the original text screening method.
本申请提出的原创文本甄别方法,在接收到待甄别文本后,在预设的原创数据库中获取所述待甄别文本对应的一个以上的待比较对象;对所述待甄别文本进行预处理,以得到一个以上的第一分句;将各个所述第一分句与各个所述待比较对象进行比对,确定所述第一分句中存在的非原创分句;若所述非原创分句在所述待甄别文本中的占比不大于预设的抄袭阈值,则确定所述待甄别文本为原创文本。本申请还公开了一种原创文本甄别装置、设备和可读存储介质。本申请将待甄别文本处理为各个分句,将确定待甄别文本是否为原创文本,分解为确定各个分句是否是原创分句,从而通过原创分句的占比,确定待甄别文本是否为原创文本,有效提高原创文本的甄别精度。In the original text screening method proposed in this application, after receiving the text to be screened, more than one object to be compared corresponding to the text to be screened is obtained in a preset original database; the text to be screened is preprocessed to Obtain more than one first clause; compare each of the first clauses with each of the objects to be compared to determine the non-original clauses in the first clause; if the non-original clauses If the proportion of the text to be screened is not greater than a preset plagiarism threshold, it is determined that the text to be screened is an original text. The application also discloses an original text screening device, equipment and readable storage medium. This application processes the text to be screened into individual clauses, determines whether the text to be screened is an original text, and decomposes it into determining whether each clause is an original clause, so as to determine whether the text to be screened is original based on the proportion of original clauses Text, effectively improve the accuracy of original text discrimination.
附图说明Description of the drawings
图1是本申请实施例方案涉及的硬件运行环境的设备结构示意图;FIG. 1 is a schematic diagram of a device structure of a hardware operating environment involved in a solution of an embodiment of the present application;
图2为本申请原创文本甄别方法第一实施例的流程示意图;FIG. 2 is a schematic flowchart of a first embodiment of a method for identifying original texts of an application;
图3为本申请原创文本甄别方法第二实施例的流程示意图。FIG. 3 is a schematic flowchart of a second embodiment of the method for identifying original texts of this application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
本发明的实施方式Embodiments of the invention
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.
如图1所示,图1是本申请实施例方案涉及的硬件运行环境的设备结构示意图。As shown in FIG. 1, FIG. 1 is a schematic diagram of the device structure of the hardware operating environment involved in the solution of the embodiment of the present application.
本申请实施例设备可以是PC机或服务器设备。The device in the embodiment of this application may be a PC or a server device.
如图1所示,该设备可以包括:处理器1001,例如CPU,网络接口1004,用户接口1003,存储器1005,通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG. 1, the device may include a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002. Among them, the communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory, or a stable memory (non-volatile memory), such as a magnetic disk memory. Optionally, the memory 1005 may also be a storage device independent of the foregoing processor 1001.
本领域技术人员可以理解,图1中示出的设备结构并不构成对设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure of the device shown in FIG. 1 does not constitute a limitation on the device, and may include more or fewer components than those shown in the figure, or a combination of certain components, or different component arrangements.
如图1所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及原创文本甄别程序。As shown in FIG. 1, the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and an original text discrimination program.
其中,操作系统是管理和控制原创文本甄别设备与软件资源的程序,支持网络通信模块、用户接口模块、原创文本甄别程序以及其他程序或软件的运行;网络通信模块用于管理和控制网络接口1002;用户接口模块用于管理和控制用户接口1003。Among them, the operating system is a program that manages and controls original text discrimination equipment and software resources, and supports the operation of network communication modules, user interface modules, original text discrimination programs, and other programs or software; the network communication module is used to manage and control the network interface 1002 ; The user interface module is used to manage and control the user interface 1003.
在图1所示的原创文本甄别设备中,所述原创文本甄别设备通过处理器1001调用存储器1005中存储的原创文本甄别程序,并执行下述原创文本甄别方法各个实施例中的操作。In the original text screening device shown in FIG. 1, the original text screening device calls the original text screening program stored in the memory 1005 through the processor 1001, and executes the operations in each embodiment of the following original text screening method.
基于上述硬件结构,提出本申请原创文本甄别方法实施例。Based on the above hardware structure, an embodiment of the method for discriminating the original text of this application is proposed.
参照图2,图2为本申请原创文本甄别方法第一实施例的流程示意图,所述方法包括:Referring to Fig. 2, Fig. 2 is a schematic flowchart of a first embodiment of a method for identifying original texts of this application. The method includes:
步骤S10,在接收到待甄别文本后,在预设的原创数据库中获取所述待甄别文本对应的一个以上的待比较对象;Step S10, after receiving the text to be screened, obtain more than one object to be compared corresponding to the text to be screened in a preset original database;
步骤S20,对所述待甄别文本进行预处理,以得到一个以上的第一分句;Step S20, preprocessing the text to be screened to obtain more than one first clause;
步骤S30,将各个所述第一分句与各个所述待比较对象进行比对,确定所述第一分句中存在的非原创分句;Step S30, comparing each of the first clauses with each of the objects to be compared to determine the non-original clauses in the first clause;
步骤S40,若所述非原创分句在所述待甄别文本中的占比不大于预设的抄袭阈值,则确定所述待甄别文本为原创文本。Step S40: If the proportion of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold, it is determined that the text to be screened is an original text.
本实施例原创文本甄别方法运用于理财机构或者银行系统等金融机构的原创文本甄别设备中,为描述方便,原创文本甄别设备以下简称甄别设备,甄别设备对接原创数据库,原创数据库中储存着互联网上所有的原创文本,其中包括原创新闻文本、原创广告和原创软文等作品,在具体实施时,由于硬件限制,原创数据库中一般只储存近3年的原创文本,此外,甄别设备中搭建有检索模块,用于获取原创数据库中与当前检索语对应的原创对象,其中,检索模块在本实施例中优选ES检索模块(Elastic Search,弹性检索,一种分布式的全文检索引擎)。ES检索模块根据检索语,在原创数据库中进行检索,并返回检索结果,返回的检索结果排序越靠前,代表该结果与检索语的文本相似度越高,基于ES检索的检索原理由于是现有技术,在此不做赘述。The original text screening method of this embodiment is applied to the original text screening equipment of financial institutions such as financial institutions or banking systems. For the convenience of description, the original text screening equipment is hereinafter referred to as screening equipment. The screening equipment is connected to the original database, and the original database is stored on the Internet. All original texts, including original news texts, original advertisements, original soft texts and other works. In the specific implementation, due to hardware limitations, the original database generally only stores the original texts of the past 3 years. In addition, a search module is built in the screening equipment , Used to obtain the original object corresponding to the current search term in the original database, wherein the search module is preferably the ES search module (Elastic Search, elastic search, a distributed full-text search engine). The ES search module searches the original database based on the search terms and returns the search results. The higher the ranking of the returned search results, the higher the text similarity between the results and the search terms. The search principle based on ES search is due to the current There are technologies, so I won’t repeat them here.
本实施例的甄别设备,在接收到待甄别文本时,先从原创数据库中筛选出与当前待甄别文本相关的待比较对象,再将待甄别文本处理成分句,再确定分句中的非原创分句占比,以此确定待甄别文本是否为原创文本。The screening device of this embodiment, when receiving the text to be screened, first selects the object to be compared related to the text to be screened from the original database, then processes the text to be screened into a sentence, and then determines the non-original in the clause The proportion of clauses to determine whether the text to be screened is an original text.
以下将对各个步骤进行详细说明:Each step will be described in detail below:
步骤S10,在接收到待甄别文本后,在预设的原创数据库中获取所述待甄别文本对应的一个以上的待比较对象;Step S10, after receiving the text to be screened, obtain more than one object to be compared corresponding to the text to be screened in a preset original database;
在本实施例中,理财机构或者银行等金融机构的对外宣传人员,在对外宣或者发布待甄别文本之前,先将待甄别文本输入甄别设备中,以甄别当前待甄别文本是否为原创文本,以避免不必要的版权纠纷。In this embodiment, the external propaganda personnel of financial institutions such as financial institutions or banks, before publicizing or publishing the text to be screened, input the text to be screened into the screening device to screen whether the text to be screened is an original text. Avoid unnecessary copyright disputes.
甄别设备在接收到待甄别文本后,先在原创数据库中获取待甄别文本对应的一个以上的待比较对象,也即是说,本实施例并不需要将当前待甄别文本与原创数据库中的所有原创文本进行一一比较,而是先从原创数据库中筛选出与当前待甄别文本相关的待比较对象。After the screening device receives the text to be screened, it first obtains more than one object to be compared corresponding to the text to be screened in the original database. That is to say, this embodiment does not need to compare the current text to be screened with all the objects in the original database. The original texts are compared one by one, but the objects to be compared related to the current text to be screened are first selected from the original database.
具体的,步骤S10包括:Specifically, step S10 includes:
在接收到待甄别文本时,确定所述待甄别文本的文本长度,并将所述待甄别文本截为所述文本长度对应数量的字符串;When receiving the text to be screened, determine the text length of the text to be screened, and cut the text to be screened into a number of character strings corresponding to the text length;
在该步骤中,甄别设备在接收到待甄别文本时,先确定待甄别文本的文本长度,也即计算待甄别文本的字符长度,并将待甄别文本截为对应数量的字符串,具体的,预设字符串长度,将待甄别文本按照预设字符串长度进行截断,如假设当前待甄别文本的文本长度为N,预设字符串长度为100,则截断后得到N/100个字符串。这是由于ES检索模块的检索语存在长度上限,因此需要将待甄别文本进行截断,优选以100字作为预设字符串长度。In this step, when the screening device receives the text to be screened, it first determines the text length of the text to be screened, that is, calculates the character length of the text to be screened, and cuts the text to be screened into a corresponding number of character strings. Specifically, The string length is preset, and the text to be screened is truncated according to the preset string length. For example, if the text length of the text to be screened is currently N and the preset string length is 100, then N/100 strings will be obtained after truncation. This is because there is an upper limit on the length of the search term of the ES search module, so the text to be screened needs to be truncated, preferably 100 characters as the preset string length.
在预设的原创数据库中获取与所述字符串匹配的匹配对象,并在所述匹配对象中选取预设数量的待比较对象。Obtain matching objects matching the character string in a preset original database, and select a preset number of objects to be compared from the matching objects.
在该步骤中,将每个字符串分别作为检索语进行检索,在预设的原创数据库中获取与当前字符串匹配的原创对象,所有字符串检索完成后,所得的原创对象的集合即为匹配对象,由于不同的字符串检索对应的原创对象可能是同一个,因此,在该步骤中,还需要对检索到的原创对象进行去重,再在匹配对象的基础上选取预设数量的待比较对象。In this step, each character string is searched as a search term, and the original object matching the current character string is obtained in the preset original database. After all the character strings are searched, the set of original objects obtained is the matching Object, since the original object corresponding to different character strings may be the same, therefore, in this step, it is also necessary to de-duplicate the original object retrieved, and then select a preset number of objects to be compared on the basis of matching objects Object.
具体选取方式可以为:在每一个字符串对应的检索结果中,取排序靠前的原创对象,具体数量为:预设数量/字符串数量。如字符串有10个,要得到5000篇原创对象,则每个字符串进行检索时,每次取排序靠前的5000/10=500篇原创对象,再将10个字符串的检索结果合并,得到50000篇原创对象。The specific selection method may be: in the retrieval result corresponding to each character string, the original object with the highest ranking is selected, and the specific number is: preset number/number of character strings. For example, if there are 10 strings, and 5000 original objects are to be obtained, when each string is searched, the top 5000/10=500 original objects are selected each time, and then the search results of the 10 strings are combined. Get 50,000 original objects.
需要说明的是,可能存在当前字符串的检索结果并不多,不满足每次取排序靠前的预设数量/字符串数量的原创对象的条件,如A字符串检索结果只有3个,不满足每次取排序靠前的500篇原创对象的条件,那么只需获取这3个检索结果,并对不足的497篇原创对象,以空白文本代替。It should be noted that there may not be many search results for the current string, which does not satisfy the condition of fetching the original object of the preset number/number of strings each time. For example, there are only 3 search results for the A string. To meet the condition of fetching the top 500 original objects each time, then only need to obtain these 3 search results, and replace the insufficient 497 original objects with blank text.
或者,在A字符串的检索结果不多的情况下,在A字符串的下一字符串的检索结果中取更多的原创对象,以对A字符串的检索结果进行补偿。Or, when there are not many search results for the A character string, more original objects are selected from the search result of the next character string of the A character string to compensate for the search result of the A character string.
步骤S20,对所述待甄别文本进行预处理,以得到一个以上的第一分句。Step S20, preprocessing the text to be screened to obtain more than one first clause.
在本实施例中,甄别设备对待甄别文本进行预处理,从而得到一个以上的第一分句,也即将待甄别文本分解成各个子句。具体的,将待甄别文本按照标点符号,具体可以为句号,分解成各个子句。In this embodiment, the screening device preprocesses the text to be screened to obtain more than one first clause, that is, the text to be screened is decomposed into various clauses. Specifically, the text to be screened is decomposed into various clauses according to punctuation marks, which can be specific periods.
步骤S30,将各个所述第一分句与各个所述待比较对象进行比对,确定所述第一分句中存在的非原创分句。Step S30, comparing each of the first clauses with each of the objects to be compared, and determining non-original clauses in the first clause.
在本实施例中,将各个第一分句与各个待比较对象进行比对,从而判断各个第一分句的原创性,最终确定第一分句中存在多少非原创分句。In this embodiment, each first clause is compared with each object to be compared, so as to determine the originality of each first clause, and finally determine how many non-original clauses exist in the first clause.
具体的,步骤S30包括:Specifically, step S30 includes:
生成各个所述第一分句对应的第一哈希值;Generating a first hash value corresponding to each of the first clauses;
在该步骤中,甄别设备生成各个第一分句对应的第一哈希值,其中,第一哈希值优选simhash值,simhash为局部敏感哈希,是一种文本哈希映射算法,用于将文本映射为长度等于64的比特串。区别于普通哈希算法的是,两个相似文本的局部敏感哈希结果也是相似的,其海明距离小于等于3。当然,第一哈希值为为普通哈希值也可以进行相似度的计算,本实施例优选以simhash为例进行描述。In this step, the screening device generates the first hash value corresponding to each first clause, where the first hash value is preferably a simhash value, and simhash is a locally sensitive hash, which is a text hash mapping algorithm for The text is mapped to a bit string of length equal to 64. The difference from ordinary hash algorithms is that the locally sensitive hash results of two similar texts are also similar, and their Hamming distance is less than or equal to 3. Of course, the similarity calculation can also be performed if the first hash value is a common hash value. This embodiment is preferably described by taking simhash as an example.
具体的,将待甄别文本按照标点符号,具体可以为句号,分解成各个第一分句,并对各个第一分句进行分词,得到有效的特征向量,然后为每一个特征向量设置1-5等5个级别的权重,其中,特征向量的权重可以是该特征向量对应的词在待甄别文本中出现的次数。如当前第一分句为“互联网银行通过人脸识别技术和大数据信用评级发放贷款”,分词后为“互联网银行通过人脸识别技术和大数据信用评级发放贷款”,然后为每个特征向量赋予权重:互联网银行(4) 通过(1) 人脸识别技术(3) 和(1) 大数据(4) 信用评级(5) 发放(1) 贷款(5) ,其中括号里的数字代表这个词在当前子句中的重要程度,数字越大代表越重要。Specifically, the text to be screened is decomposed into each first clause according to punctuation marks, specifically a period, and each first clause is segmented to obtain an effective feature vector, and then set 1-5 for each feature vector 5 levels of weights, where the weight of the feature vector can be the number of times the word corresponding to the feature vector appears in the text to be screened. For example, the current first clause is "Internet banks issue loans through face recognition technology and big data credit rating", and after word segmentation is "Internet banks issue loans through face recognition technology and big data credit rating", and then for each feature vector Give weight: Internet banking (4) through (1) face recognition technology (3) and (1) big data (4) credit rating (5) issuance (1) loan (5), where the number in parentheses represents the word The degree of importance in the current clause, the larger the number, the more important.
然后,通过hash函数计算各个特征向量的hash值,hash值为二进制数01组成的签名。比如“互联网银行”的hash值,Hash(互联网银行)为100101,“贷款”的hash值,Hash(贷款)为101011,至此,当前子句就变成了一系列数字。Then, the hash value of each feature vector is calculated through the hash function, and the hash value is a signature composed of the binary number 01. For example, the hash value of "Internet Banking", Hash (Internet Banking) is 100101, and the hash value of "Loan", Hash (loan) is 101011, so far, the current clause has become a series of numbers.
再在hash值的基础上,给所有特征向量进行加权,即W = Hash × weight,且遇到1则hash值和权重正相乘,遇到0则hash值和权重负相乘。例如给“互联网银行”的hash值“100101”加权得到:W(互联网银行) = 100101×4 = 4 -4 -4 4 -4 4,给“贷款”的hash值“101011”加权得到:W(贷款)=101011×5 = 5 -5 5 -5 5 5,其他特征向量也类似此操作。Then on the basis of the hash value, weight all the feature vectors, that is, W = Hash × weight, and when it encounters 1, the hash value and the weight are positively multiplied, and when it is 0, the hash value and the weight are negatively multiplied. For example, weight the hash value "100101" of "Internet Bank" to get: W (Internet Bank) = 100101×4 = 4 -4 -4 4 -4 4, weight the hash value "101011" of "loan" to get: W(loan)=101011×5 = 5 -5 5 -5 5 5, and other feature vectors are similar to this operation.
接着,将上述各个特征向量的加权结果累加,变成只有一个序列串。如“互联网银行”的“4 -4 -4 4 -4 4”和“贷款”的“5 -5 5 -5 5 5”进行累加,得到“4+5 -4+-5 -4+5 4+-5 -4+5 4+5”,得到“9 -9 1 -1 1”。Then, the weighted results of the above-mentioned feature vectors are accumulated to become only one sequence string. For example, "4 -4 -4 4 -4 4" of "Internet Banking" and "5 -5 5 -5 5 5" of "Loan" are added together to get "4+5 -4+-5 -4+5 4+-5 -4+5 4+5", get "9 -9 1 -1 1".
最后,对于签名的累加结果,如果大于0则置1,否则置0,从而得到当前第一分句的simhash值,如上述结果为“9 -9 1 -1 1 9”最终得到“1 0 1 0 1 1”。Finally, for the cumulative result of the signature, if it is greater than 0, set it to 1, otherwise it is set to 0, thereby obtaining the simhash value of the current first clause. For example, the above result is "9 -9 1 -1 1 9" and finally "1 0 1" 0 1 1".
调取各个所述待比较对象对应的哈希值集合,所述哈希值集合中包含多个第二哈希值;Retrieve a set of hash values corresponding to each of the objects to be compared, where the set of hash values includes multiple second hash values;
在该步骤中,原创数据库中存储有多个原创对象,并且存储有每一个原创对象的分句结果列表和分句的simhash值,因此,甄别设备可在原创数据库中调取各个待比较对象的哈希值集合,其中,哈希值集合包含多个第二哈希值,第二哈希值优选第二simhash值。In this step, multiple original objects are stored in the original database, and each original object's sentence result list and the simhash value of the sentence are stored. Therefore, the screening equipment can retrieve the original database for each object to be compared. The hash value set, where the hash value set includes a plurality of second hash values, and the second hash value is preferably the second simhash value.
将所述第一哈希值与所述第二哈希值进行比较,在所述第一哈希值中,确定与至少一个所述第二哈希值的海明距离小于或等于第一预设值的第三哈希值;The first hash value is compared with the second hash value, and in the first hash value, it is determined that the Hamming distance to at least one of the second hash values is less than or equal to the first predetermined value. Set the third hash value;
具体的,将各个第一simhash值与各个待比较对象中的第二simhash值进行比较,确定其海明距离,其中,海明距离是两个字符串对应位置的不同字符的个数,也就是说,它就是将一个字符串变换成另外一个字符串所需要替换的字符个数。例如:1011101与 1001001之间的海明距离是2。Specifically, each first simhash value is compared with the second simhash value in each object to be compared to determine its Hamming distance, where the Hamming distance is the number of different characters in the corresponding positions of the two strings, that is In other words, it is the number of characters that need to be replaced to transform a string into another string. For example, the Hamming distance between 1011101 and 1001001 is 2.
将计算所得海明距离与第一预设值进行比较,进而确定第一哈希值中与至少一个第二哈希值的海明距离小于或等于第一预设值的第三哈希值,其中,第一预设值在具体实施时,优选为3,在海明距离小于或等于3时,确定当前第一分句存在抄袭。Comparing the calculated Hamming distance with the first preset value to determine that the Hamming distance between the first hash value and the at least one second hash value is less than or equal to the third hash value of the first preset value, Wherein, the first preset value is preferably 3 during specific implementation. When the Hamming distance is less than or equal to 3, it is determined that there is plagiarism in the current first clause.
在所述第一分句中,将所述第三哈希值对应的分句标记为非原创分句。In the first clause, the clause corresponding to the third hash value is marked as a non-original clause.
在该步骤中,在第一分句中,将确定抄袭的分句标记为非原创分句。具体如标红显示等。In this step, in the first clause, the clause that determines plagiarism is marked as a non-original clause. Specifically, such as the red display.
步骤S40,若所述非原创分句在所述待甄别文本中的占比不大于预设的抄袭阈值,则确定所述待甄别文本为原创文本。Step S40: If the proportion of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold, it is determined that the text to be screened is an original text.
在本实施例中,统计非原创分句在待甄别文本中的占比,具体可统计非原创分句的数量,以及待甄别文本中第一分句的数量,从而计算出非原创分句在待甄别文本中的占比,并进一步的,确定非原创分句在埭镇北文本中的占比是否不大于预设的抄袭阈值,若是,则确定待甄别文本为原创文本;若否,则确定甄别文本为抄袭文本。In this embodiment, the proportion of non-original clauses in the text to be screened is counted. Specifically, the number of non-original clauses and the number of first clauses in the text to be screened can be counted, so as to calculate the number of non-original clauses in the text to be screened. The proportion of the text to be screened, and further, determine whether the proportion of non-original clauses in the Daizhenbei text is not greater than the preset plagiarism threshold, if so, the text to be screened is determined to be the original text; if not, then Confirm that the screened text is plagiarized.
进一步地,在所述确定所述第一分句中存在的非原创分句之后,还包括:Further, after the determination of the non-original clause existing in the first clause, the method further includes:
在所述待甄别文本中,统计所述非原创分句的字数,并基于所述字数以及所述待甄别文本的总字数,确定所述非原创分句在所述待甄别文本中的占比。In the text to be screened, count the number of words of the non-original clause, and determine the proportion of the non-original clause in the text to be screened based on the number of words and the total word count of the text to be screened .
在该步骤中,确定非原创分句在待甄别文本中的占比还可以通过统计非原创分句的字数,以及待甄别文本的总字数,并将非原创分句的字数除以待甄别文本的总字数,从而得到非原创分句在待甄别文本中的占比。后续根据该占比,确定待甄别文本是否为原创文本。In this step, the proportion of non-original clauses in the text to be screened can also be determined by counting the number of words in non-original clauses and the total number of words in the text to be screened, and dividing the number of words in non-original clauses by the text to be screened The total number of words in the text to get the proportion of non-original clauses in the text to be screened. Based on this proportion, it is determined whether the text to be screened is an original text.
在本实施例中预设一个抄袭阈值,如80%,在计算得到待甄别文本中非原创分句的占比后,确定该占比是否大于预设的抄袭阈值,若是,则确定待甄别文本为抄袭文本,否则为原创文本。In this embodiment, a plagiarism threshold is preset, such as 80%. After calculating the proportion of non-original clauses in the text to be screened, it is determined whether the proportion is greater than the preset plagiarism threshold, and if so, the text to be screened is determined It is plagiarized text, otherwise it is original text.
本实施例在接收到待甄别文本后,在预设的原创数据库中获取所述待甄别文本对应的一个以上的待比较对象;对所述待甄别文本进行预处理,以得到一个以上的第一分句;将各个所述第一分句与各个所述待比较对象进行比对,确定所述第一分句中存在的非原创分句;若所述非原创分句在所述待甄别文本中的占比不大于预设的抄袭阈值,则确定所述待甄别文本为原创文本。本申请还公开了一种原创文本甄别装置、设备和可读存储介质。本申请将待甄别文本处理为各个分句,将确定待甄别文本是否为原创文本,分解为确定各个分句是否是原创分句,从而通过原创分句的占比,确定待甄别文本是否为原创文本,有效提高原创文本的甄别精度。In this embodiment, after receiving the text to be screened, more than one object to be compared corresponding to the text to be screened is obtained from the preset original database; the text to be screened is preprocessed to obtain more than one first Clause; compare each of the first clauses with each of the objects to be compared to determine the non-original clauses in the first clause; if the non-original clauses are in the text to be screened If the proportion in is not greater than the preset plagiarism threshold, it is determined that the text to be screened is an original text. The application also discloses an original text screening device, equipment and readable storage medium. This application processes the text to be screened into individual clauses, determines whether the text to be screened is an original text, and decomposes it into determining whether each clause is an original clause, so as to determine whether the text to be screened is original based on the proportion of original clauses Text, effectively improve the accuracy of original text discrimination.
进一步地,基于本申请原创文本甄别方法第一实施例,提出本申请原创文本甄别方法第二实施例。Further, based on the first embodiment of the original text screening method of this application, a second embodiment of the original text screening method of this application is proposed.
原创文本甄别方法的第二实施例与原创文本甄别方法的第一实施例的区别在于,参照图3,所述预处理包括过滤和分句,步骤S20包括:The difference between the second embodiment of the original text screening method and the first embodiment of the original text screening method is that, referring to FIG. 3, the preprocessing includes filtering and sentence segmentation, and step S20 includes:
步骤S21,基于预设过滤规则,对所述待甄别文本进行过滤,以得到过滤文本;Step S21, filtering the text to be screened based on preset filtering rules to obtain filtered text;
步骤S22,基于预设分句规则,对所述过滤文本进行分句,以得到一个以上的第一分句。In step S22, based on a preset sentence rule, the filtered text is sentenced to obtain more than one first sentence.
本实施例在对待甄别文本进行预处理时,具体使用过滤和分句,从而将待甄别文本分解成各个第一分句。In this embodiment, when preprocessing the text to be screened, filtering and clauses are specifically used to decompose the text to be screened into each first clause.
以下将对各个步骤进行详细说明:Each step will be described in detail below:
步骤S21,基于预设过滤规则,对所述待甄别文本进行过滤,以得到过滤文本。Step S21: Filter the text to be screened based on preset filtering rules to obtain filtered text.
在本实施例中,基于预设过滤规则,甄别设备对待甄别文本进行过滤,其中,预设过滤规则为:将待甄别文本中无意义的符号过滤掉,无意义符号包括HTML标签,HTML字符实体以及颜文字符号等;将待甄别文本中的繁体中文转换为简体中文;将待甄别文本中的破折号,中英文单引号,中英文双引号和中英文冒号等符号统一替换为中文逗号等,这样做的目的是为了避免符号带来的差别而影响原创性的甄别,如:In this embodiment, based on a preset filtering rule, the screening device filters the text to be screened, where the preset filtering rule is: filter out meaningless symbols in the text to be screened, meaningless symbols include HTML tags and HTML character entities And facial characters symbols, etc.; convert traditional Chinese in the text to be screened into simplified Chinese; replace dashes, single quotation marks in Chinese and English, double quotation marks in Chinese and English, and colons in Chinese and English with Chinese commas, etc. in the text to be screened The purpose is to avoid the difference brought by the symbols and affect the originality discrimination, such as:
项目负责人张三:不忘初心,砥砺前行。Project leader Zhang San: Do not forget the original intention, and move forward.
项目负责人张三——不忘初心,砥砺前行。Project leader Zhang San-never forget the original intention and move forward.
项目负责人张三:“不忘初心,砥砺前行。”Project leader Zhang San: "Don't forget your original aspiration, and forge ahead."
在将待甄别文本进行过滤后,得到过滤文本,以便后续对过滤文本进行分句。After filtering the text to be screened, the filtered text is obtained so that the filtered text can be subsequently segmented.
步骤S22,基于预设分句规则,对所述过滤文本进行分句,以得到一个以上的第一分句。In step S22, based on a preset sentence rule, the filtered text is sentenced to obtain more than one first sentence.
在本实施例中,甄别设备基于预设分句规则,对完成过滤的过滤文本进行分句,其中,预设分句规则为:按照中英文逗号,中英文句号,中英文叹号,中英文问号,中文分号,中文顿号,空格以及转义字符\n和\t等进行分句,如“项目负责人张三对团队成员说,接下来我们要更进一步,不忘初心,砥砺前行”,按照符号进行分句为:“项目负责人张三对团队成员说”,“接下来我们要更进一步”,“不忘初心”,“砥砺前行”,将待甄别文本过滤并分句后,得到各个第一分句。In this embodiment, the screening device divides the filtered text after filtering based on preset sentence rules, where the preset sentence rules are: according to Chinese and English commas, Chinese and English periods, Chinese and English exclamation marks, Chinese and English question marks , Chinese semicolon, Chinese comma, space and escape characters \n and \t to make clauses, such as "Project leader Zhang San said to the team members, we must go further, not forgetting our original intentions, and forge ahead ", according to the symbol, the sentence is: "Project leader Zhang San said to the team members", "Next we will go further", "Don't forget the original intention", "Forge ahead", filter and divide the text to be screened After that, get each first clause.
进一步地,步骤S22包括:Further, step S22 includes:
步骤a,基于预设分句规则,对所述过滤文本进行分句,以得到各个子句,并依次确定各个子句的字数是否达到预设字数;Step a, based on a preset clause rule, segment the filtered text to obtain each clause, and sequentially determine whether the number of words in each clause reaches the preset number of words;
在该步骤中,甄别设备基于预设分句规则,对完成过滤的过滤文本进行分句,从而得到各个子句,具体的预设根据规则包括按照中英文逗号,中英文句号,中英文叹号,中英文问号,中文分号,中文顿号,空格以及转义字符\n和\t等进行分句,从而得到各个子句,并确定各个子句的字数是否达到预设字数,如10字等,这是由于符号的存在,各个子句长短不一,为减少比较次数,需要将字数少的子句进行合并,以此减少分句的数量,又因为一个完整有意义的句子需要有一定的主语、谓语和宾语等,因此其本身具备一定的字数要求,因此,在对过滤文本进行分句后,需要确定各个子句的字数是否达到预设字数。In this step, the screening device subdivides the filtered text that has been filtered based on preset clause rules to obtain each clause. The specific preset according to the rules includes Chinese and English commas, Chinese and English periods, and Chinese and English exclamation marks. Chinese and English question marks, Chinese semicolons, Chinese halts, spaces, and escape characters \n and \t are used to make clauses to obtain each clause and determine whether the number of words in each clause reaches the preset number of words, such as 10 words, etc. , This is due to the existence of symbols, the length of each clause is different, in order to reduce the number of comparisons, it is necessary to merge the clauses with fewer words to reduce the number of clauses, and because a complete and meaningful sentence needs to have certain Subject, predicate, and object, etc., have certain word count requirements. Therefore, after the filter text is divided into sentences, it is necessary to determine whether the word count of each clause reaches the preset word count.
步骤b,若当前子句的字数达到所述预设字数,则将当前子句设定为所述第一分句;Step b, if the number of words in the current clause reaches the preset number of words, set the current clause as the first clause;
在确定各个子句的字数是否达到预设字数的过程中,若当前子句的字数达到预设字数,如10字,则将当前子句设定为第一分句。In the process of determining whether the word count of each clause reaches the preset word count, if the word count of the current clause reaches the preset word count, such as 10 words, the current clause is set as the first clause.
步骤c,若当前子句的字数未达到所述预设字数,则将当前子句合并到基于前一子句所设定的所述第一分句中。Step c: If the number of words of the current clause does not reach the preset number of words, merge the current clause into the first clause set based on the previous clause.
若未达到,则将当前子句合并到基于前一子句所设定的第一分句中。即将当前子句与前一个第一分句合并;而前一个第一分句可以是当前子句的前一子句构成的第一分句,也可以是当前子句的前两个子句构成的第一分句......若当前子句为待甄别文本的第一句,且当前子句的字数未达到预设字数,则将当前分句与后一子句所设定的第一分句进行合并。If not, merge the current clause into the first clause based on the previous clause. That is, the current clause is merged with the previous first clause; and the previous first clause can be the first clause formed by the previous clause of the current clause, or the first two clauses of the current clause The first clause... If the current clause is the first sentence of the text to be screened, and the number of words in the current clause does not reach the preset number of words, then the current clause and the first clause set by the next clause will be combined One clause is combined.
如上述例子,分句后得到:“项目负责人张三对团队成员说”,“接下来我们要更进一步”,“不忘初心”,“砥砺前行”,对字数不足10字的子句进行合并,得到“项目负责人张三对团队成员说”,“接下来我们要更进一步不忘初心砥砺前行”。As in the above example, after the clause, you get: "Project leader Zhang San said to the team members", "Next we must go further", "Don't forget the original intention", "Forge ahead", and for clauses with less than 10 words After the merger, "Project leader Zhang San said to the team members", "Next, we must go further without forgetting our original intentions."
本实施例的预处理包括过滤和分句,在对待甄别文本进行预处理时,具体使用过滤和分句,从而过滤掉影响原创性甄别的因素,也即过滤掉各种无意义的符号,将待甄别文本分解成各个分句,进而通过确定各个分句的原创性来甄别待甄别文本的原创性,细化甄别对象,使得原创文本的甄别精度得以提高。The preprocessing in this embodiment includes filtering and sentence segmentation. When the text to be screened is preprocessed, the filter and the sentence are specifically used to filter out the factors that affect the originality screening, that is, to filter out various meaningless symbols. The text to be screened is decomposed into various clauses, and then the originality of the text to be screened is identified by determining the originality of each clause, and the screening objects are refined, so that the accuracy of the original text can be improved.
进一步地,基于本申请原创文本甄别方法第一、第二实施例,提出本申请原创文本甄别方法第三实施例。Further, based on the first and second embodiments of the original text screening method of this application, a third embodiment of the original text screening method of this application is proposed.
原创文本甄别方法的第三实施例与原创文本甄别方法的第一、第二实施例的区别在于,所述将所述第三哈希值对应的分句标记为非原创分句的步骤之后,还包括:The difference between the third embodiment of the original text screening method and the first and second embodiments of the original text screening method is that after the step of marking the clause corresponding to the third hash value as a non-original clause, Also includes:
步骤d,若确定所述待甄别文本中第i个第一分句到第i+k第一分句与目标对象中第j分句到第j+k分句的海明距离都未超过第一预设值,则计算所述待甄别文本中第i-n个第一分句与所述目标对象中第j-n分句的第一编辑距离,以及所述待甄别文本中第i+k+m个第一分句与所述目标对象中第j+k+m分句的第二编辑距离,其中,所述目标对象为所述待比较对象中的一个对象,i、j为大于0的常数,k为预设常数,n为1到i的集合,m为1到无穷大的集合,且i+k+m小于等于所述待甄别文本的分句的数量,j+k+m小于等于所述目标对象的分句的数量;Step d, if it is determined that the Hamming distance between the ith first clause to the i+k first clause in the text to be screened and the jth clause to the j+k clause in the target object does not exceed the A preset value, calculate the first edit distance between the in-th first clause in the text to be screened and the jn-th clause in the target object, and the i+k+m-th in the text to be screened The second edit distance between the first clause and the j+k+mth clause in the target object, wherein the target object is one of the objects to be compared, and i and j are constants greater than 0, k is a preset constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of clauses of the text to be screened, and j+k+m is less than or equal to The number of clauses of the target object;
步骤e,若所述第一编辑距离与所述第j-n分句的分句长度的比值小于第二预设值,则将所述第i-n个第一分句标记为非原创分句;若所述第二编辑距离与所述第j+k+m分句的分句长度的比值小于所述第二预设值,则将所述第i+k+m个第一分句标记为非原创分句。Step e, if the ratio of the first edit distance to the clause length of the jnth clause is less than a second preset value, mark the inth first clause as a non-original clause; If the ratio of the second edit distance to the clause length of the j+k+mth clause is smaller than the second preset value, then the i+k+mth first clause is marked as non-original Clause.
本实施例对于替换主语和代词等情况的待甄别文本,在将待甄别文本分解为各个分句,并确定各个分句的原创性之后,还计算各个分句的编辑距离,进一步确认各个分句的原创性。In this embodiment, for the text to be screened that replaces subjects and pronouns, after decomposing the text to be screened into various clauses and determining the originality of each clause, the edit distance of each clause is also calculated to further confirm each clause Originality.
以下将对各个步骤进行详细说明:Each step will be described in detail below:
步骤d,若确定所述待甄别文本中第i个第一分句到第i+k个第一分句与目标对象中第j分句到第j+k分句的海明距离都未超过第一预设值,则计算所述待甄别文本中第i-n个第一分句与所述目标对象中第j-n分句的第一编辑距离,以及所述待甄别文本中第i+k+m个第一分句与所述目标对象中第j+k+m分句的第二编辑距离,其中,所述目标对象为所述待比较对象中的一个对象,i、j为大于0的常数,k为预设常数,n为1到i的集合,m为1到无穷大的集合,且i+k+m小于等于所述待甄别文本的分句的数量,j+k+m小于等于所述目标对象的分句的数量。Step d, if it is determined that the Hamming distance between the ith first clause to the i+k first clause in the text to be screened and the jth clause to the j+k clause in the target object does not exceed The first preset value, calculate the first edit distance between the in-th first clause in the text to be screened and the jn-th clause in the target object, and the i+k+m-th in the text to be screened The second edit distance between the first clause and the j+k+mth clause in the target object, where the target object is one of the objects to be compared, and i and j are constants greater than 0 , K is a preset constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of clauses of the text to be screened, and j+k+m is less than or equal to all The number of clauses describing the target object.
在本实施例中,甄别设备在对各个非原创分句进行标记后,实时监测已标注的子句的连续性,也即监测非原创分句的连续性,若在待甄别文本中存在第i个第一分句到第i+k个第一分句与目标对象中第j分句到第j+k分句的海明距离都未超过第一预设值,也即待甄别文本的第i个第一分句到第i+k个第一分句都被标记为非原创分句,则计算待甄别文本中第i-n个第一分句与目标对象中第j-n分句的第一编辑距离,以及待甄别文本中第i+k+m个第一分句与目标对象中第j+k+m分句的第二编辑距离,即原先待甄别文本中第i-n个第一分句和第i+k+m个第一分句未被标记,但这些分句可能存在主语替换以及代词替换等场景,而简单的主语替换以及代词替换等,并不算原创,但海明距离并不能判断出来,因此,在确定待甄别文本连续k个第一分句都被标记为非原创分句的情况下,可确定在此之前的子句以及在此之后的子句依旧存在抄袭的可能,故对第i-n个第一分句以及第i+k+m个第一分句进行编辑距离的计算,其中,目标对象为待比较对象中的一个对象,i、j为大于0的常数,k为预设常数,n为1到i的集合,m为1到无穷大的集合,且i+k+m小于等于所述待甄别文本的分句的数量,j+k+m小于等于所述目标对象的分句的数量。In this embodiment, after marking each non-original clause, the screening device monitors the continuity of the marked clauses in real time, that is, it monitors the continuity of non-original clauses. If there is an i-th in the text to be screened The Hamming distance from the first clause to the i+k first clause to the jth clause to the j+k clause in the target object does not exceed the first preset value, that is, the first clause of the text to be screened The i-th first clause to the i+k-th first clause are all marked as non-original clauses, then the first edit of the in-th first clause in the text to be screened and the jn-th clause in the target object is calculated Distance, and the second edit distance between the i+k+m first clause in the text to be screened and the j+k+m clause in the target object, that is, the in-th first clause in the text to be screened and the The i+k+m first clause is not marked, but these clauses may have subject substitution and pronoun substitution. Simple subject substitution and pronoun substitution are not original, but Hamming distance does not Judging, therefore, when it is determined that the k consecutive first clauses of the text to be screened are marked as non-original clauses, it can be determined that there is still the possibility of plagiarism in the clauses before and after. Therefore, the edit distance is calculated for the in-th first clause and the i+k+m-th first clause, where the target object is one of the objects to be compared, i and j are constants greater than 0, and k Is a preset constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of clauses of the text to be screened, and j+k+m is less than or equal to the target The number of clauses of the object.
编辑距离是指两个字串之间,由一个转成另一个所需的最少编辑操作次数,在本实施例中,指待甄别文本中的分句转换为目标对象中的分句的最少编辑操作次数。Editing distance refers to the minimum number of editing operations required to convert two strings from one to the other. In this embodiment, it refers to the minimum number of edits required to convert a clause in the text to be screened into a clause in the target object The number of operations.
需要说明的是,除了使用编辑距离来确定原先未标记的分句是否抄袭之外,还可以采用jaccard(杰卡德)距离,或者最长公共子序列的长度值来确定,在此不做穷举。It should be noted that in addition to using the edit distance to determine whether the original unmarked clause is plagiarized, you can also use the jaccard distance or the length of the longest common subsequence to determine it. Lift.
步骤e,若所述第一编辑距离与所述第j-n分句的分句长度的比值小于第二预设值,则将所述第i-n个第一分句标记为非原创分句;若所述第二编辑距离与所述第j+k+m分句的分句长度的比值小于所述第二预设值,则将所述第i+k+m个第一分句标记为非原创分句;Step e, if the ratio of the first edit distance to the clause length of the jnth clause is less than a second preset value, mark the inth first clause as a non-original clause; If the ratio of the second edit distance to the clause length of the j+k+mth clause is smaller than the second preset value, then the i+k+mth first clause is marked as non-original Clause
在本实施例中,在得到第一编辑距离和第二编辑距离之后,将第一编辑距离与目标对象的第j-n分句的分句长度进行相除,将第二编辑距离与目标对象的第j+k+m分句的分句长度进行相除,若第一编辑距离与第j-n分句的分句长度的比值小于第二预设值,则将待甄别文本中的第i-n个第一分句标记为非原创分句;若等于或者大于第二预设值,则不对待甄别文本中的第i-n个第一分句进行标记。同理,若第二编辑距离与第j+k+m分句的分句长度的比值小于第二预设值,则将待甄别文本中的第i+k+m个第一分句标记为非原创分句,若等于或者大于第二预设值,则不对待甄别文本中的第i+k+m个第一分句进行标记。其中第二预设值优选为0.1。In this embodiment, after the first edit distance and the second edit distance are obtained, the first edit distance is divided by the sentence length of the jnth clause of the target object, and the second edit distance is divided by the first edit distance of the target object. The clause length of the j+k+m clause is divided. If the ratio of the first edit distance to the clause length of the jnth clause is less than the second preset value, the inth first in the text to be screened The clause is marked as a non-original clause; if it is equal to or greater than the second preset value, the in-th first clause in the screened text is not marked. Similarly, if the ratio of the second edit distance to the clause length of the j+k+mth clause is less than the second preset value, the i+k+mth first clause in the text to be screened is marked as If the non-original clause is equal to or greater than the second preset value, the i+k+mth first clause in the screened text is not marked. The second preset value is preferably 0.1.
即,先计算待甄别文本中第i-1个第一分句与目标对象第j-1分句的编辑距离,并在编辑距离与第j-1分句的分句长度的比值小于第二预设值时,对第i-1个第一分句进行标记,在大于等于第二预设值时,不对第i-1个第一分句进行标记,再继续计算待甄别文本中第i-2个第一分句与目标对象第j-2分句的编辑距离......直至第i-n个第一分句已被标记。同理,计算待甄别文本中第i+k+1个第一分句与目标对象第j+k+1分句的编辑距离,并在编辑距离与第j+k+1分句的分句长度的比值小于第二预设值时,对第i+k+1个第一分句进行标记,在大于等于第二预设值时,不对第i+k+1个第一分句进行标记,再继续计算待甄别文本中第i+k+2个第一分句与目标对象第j+k+2分句的编辑距离......直至第i+k+m个第一分句已被标记,或者当前第一分句为待甄别文本的最后一个第一分句。That is, first calculate the edit distance between the i-1th first clause in the text to be screened and the j-1th clause of the target object, and the ratio of the edit distance to the sentence length of the j-1th clause is less than the second When the default value is set, the i-1th first clause is marked. When it is greater than or equal to the second preset value, the i-1th first clause is not marked, and then continue to calculate the i-th sentence in the text to be screened -2 The edit distance between the first clause and the j-2th clause of the target object... until the in-th first clause has been marked. In the same way, calculate the edit distance between the i+k+1th first clause in the text to be screened and the j+k+1th clause of the target object, and calculate the edit distance from the j+k+1th clause When the ratio of length is less than the second preset value, mark the i+k+1th first clause; when it is greater than or equal to the second preset value, not mark the i+k+1th first clause , Then continue to calculate the edit distance between the i+k+2th first clause in the text to be screened and the j+k+2th clause of the target object... until the i+k+mth first clause The sentence has been marked, or the current first clause is the last first clause of the text to be screened.
在确定待甄别文本中第i-n个第一分句已被标记时,即已确定第i-n分句存在抄袭,无需再对第i-n分句进行编辑距离的计算,同理,在确定待甄别文本中,第i+k+m个第一分句已被标记时,无需再对后续的分句进行编辑距离的计算。此时,统计被标记为非原创分句的字数,此时的非原创分句包括一开始通过海明距离进行标记的非原创分句,也包括后续通过编辑距离进行标记的非原创分句。When it is determined that the in-th first clause in the text to be screened has been marked, it is determined that the in-th clause is plagiarized, and there is no need to calculate the edit distance for the in-th clause. Similarly, in the text to be screened , When the i+k+m first clause has been marked, there is no need to calculate the edit distance for subsequent clauses. At this time, the number of words marked as non-original clauses is counted. The non-original clauses at this time include non-original clauses marked by Hamming distance at the beginning and non-original clauses marked by edit distance later.
需要说明的是,若待甄别文本的第i+k+m个第一分句在目标对象中没有比较对象,也即目标对象到底了,则也不对待甄别文本的第i+k+m个第一分句进行标记。It should be noted that if the i+k+mth first clause of the text to be screened has no comparison object in the target object, that is, the target object is at the end, then the i+k+mth clause of the text to be screened will not be treated Mark the first clause.
本实施例在确定待甄别文本是否为原创文本时,除了考虑待甄别文本与原创对象的相似度之外,还考虑存在主语替换和代替替换的因素,相比单一海明距离算法,可以有效解决存在大量主语替换,代词替换的抄袭场景,进一步提高原创文本的甄别精度。In this embodiment, when determining whether the text to be screened is an original text, in addition to considering the similarity between the text to be screened and the original object, the factors of subject substitution and substitution substitution are also considered. Compared with the single Hamming distance algorithm, it can effectively solve the problem. There are a large number of plagiarism scenes of subject substitution and pronoun substitution, which further improves the accuracy of original text discrimination.
本申请还提供一种原创文本甄别装置。本申请原创文本甄别装置包括:This application also provides an original text screening device. The original text screening device of this application includes:
获取模块,用于在接收到待甄别文本后,在预设的原创数据库中获取所述待甄别文本对应的一个以上的待比较对象;The obtaining module is configured to obtain more than one object to be compared corresponding to the text to be screened in a preset original database after receiving the text to be screened;
预处理模块,用于对所述待甄别文本进行预处理,以得到一个以上的第一分句;The preprocessing module is used to preprocess the text to be screened to obtain more than one first clause;
第一确定模块,用于将各个所述第一分句与各个所述待比较对象进行比对,确定所述第一分句中存在的非原创分句;The first determining module is configured to compare each of the first clauses with each of the objects to be compared, and determine the non-original clauses in the first clause;
第二确定模块,用于若所述非原创分句在所述待甄别文本中的占比不大于预设的抄袭阈值,则确定所述待甄别文本为原创文本。The second determining module is configured to determine that the text to be screened is an original text if the proportion of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold.
进一步地,所述获取模块还用于:Further, the acquisition module is also used for:
在接收到待甄别文本时,确定所述待甄别文本的文本长度,并将所述待甄别文本截为所述文本长度对应数量的字符串;When receiving the text to be screened, determine the text length of the text to be screened, and cut the text to be screened into a number of character strings corresponding to the text length;
在预设的原创数据库中获取与所述字符串匹配的匹配对象,并在所述匹配对象中选取预设数量的待比较对象。Obtain matching objects matching the character string in a preset original database, and select a preset number of objects to be compared from the matching objects.
进一步地,所述预处理模块还用于:Further, the preprocessing module is also used for:
基于预设过滤规则,对所述待甄别文本进行过滤,以得到过滤文本;Filtering the text to be screened based on preset filtering rules to obtain filtered text;
基于预设分句规则,对所述过滤文本进行分句,以得到一个以上的第一分句。Based on the preset sentence rules, the filtering text is divided into sentences to obtain more than one first sentence.
进一步地,所述预处理模块还用于:Further, the preprocessing module is also used for:
基于预设分句规则,对所述过滤文本进行分句,以得到各个子句,并依次确定各个子句的字数是否达到预设字数;Based on the preset clause rules, clauses the filtered text to obtain each clause, and sequentially determines whether the number of words of each clause reaches the preset number of words;
若当前子句的字数达到所述预设字数,则将当前子句设定为所述第一分句;If the number of words in the current clause reaches the preset number of words, set the current clause as the first clause;
若当前子句的字数未达到所述预设字数,则将当前子句合并到基于前一子句所设定的所述第一分句中。If the number of words in the current clause does not reach the preset number of words, the current clause is merged into the first clause set based on the previous clause.
进一步地,所述第一确定模块还用于:Further, the first determining module is also used for:
生成各个所述第一分句对应的第一哈希值;Generating a first hash value corresponding to each of the first clauses;
调取各个所述待比较对象对应的哈希值集合,所述哈希值集合中包含多个第二哈希值;Retrieve a set of hash values corresponding to each of the objects to be compared, where the set of hash values includes multiple second hash values;
将所述第一哈希值与所述第二哈希值进行比较,在所述第一哈希值中,确定与至少一个所述第二哈希值的海明距离小于或等于第一预设值的第三哈希值;The first hash value is compared with the second hash value, and in the first hash value, it is determined that the Hamming distance to at least one of the second hash values is less than or equal to the first predetermined value. Set the third hash value;
在所述第一分句中,将所述第三哈希值对应的分句标记为非原创分句。In the first clause, the clause corresponding to the third hash value is marked as a non-original clause.
进一步地,所述第一确定模块还用于:Further, the first determining module is also used for:
若确定所述待甄别文本中第i个第一分句到第i+k第一分句与目标对象中第j分句到第j+k分句的海明距离都未超过第一预设值,则计算所述待甄别文本中第i-n个第一分句与所述目标对象中第j-n分句的第一编辑距离,以及所述待甄别文本中第i+k+m个第一分句与所述目标对象中第j+k+m分句的第二编辑距离,其中,所述目标对象为所述待比较对象中的一个对象,i、j为大于0的常数,k为预设常数,n为1到i的集合,m为1到无穷大的集合,且i+k+m小于等于所述待甄别文本的分句的数量,j+k+m小于等于所述目标对象的分句的数量;If it is determined that the Hamming distance between the ith first clause to the i+k first clause in the text to be screened and the jth clause to the j+k clause in the target object does not exceed the first preset Value, calculate the first edit distance between the in-th first clause in the text to be screened and the jn-th clause in the target object, and the i+k+m-th first score in the text to be screened The second edit distance between the sentence and the j+k+mth clause in the target object, where the target object is one of the objects to be compared, i and j are constants greater than 0, and k is the predetermined Set a constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of clauses of the text to be screened, and j+k+m is less than or equal to the target object The number of clauses;
若所述第一编辑距离与所述第j-n分句的分句长度的比值小于第二预设值,则将所述第i-n个第一分句标记为非原创分句;若所述第二编辑距离与所述第j+k+m分句的分句长度的比值小于所述第二预设值,则将所述第i+k+m个第一分句标记为非原创分句。If the ratio of the first edit distance to the clause length of the jnth clause is less than a second preset value, mark the inth first clause as a non-original clause; if the second The ratio of the edit distance to the clause length of the j+k+mth clause is less than the second preset value, and the i+k+mth first clause is marked as a non-original clause.
进一步地,所述第一确定模块还用于:Further, the first determining module is also used for:
在所述待甄别文本中,统计所述非原创分句的字数,并基于所述字数以及所述待甄别文本的总字数,确定所述非原创分句在所述待甄别文本中的占比。In the text to be screened, count the number of words of the non-original clause, and determine the proportion of the non-original clause in the text to be screened based on the number of words and the total word count of the text to be screened .
本申请还提供一种计算机可读存储介质。The application also provides a computer-readable storage medium.
本申请计算机可读存储介质上存储有原创文本甄别程序,所述原创文本甄别程序被处理器执行时实现如上所述的原创文本甄别方法的步骤。The computer-readable storage medium of the present application stores an original text screening program, and when the original text screening program is executed by a processor, the steps of the original text screening method as described above are realized.
其中,在所述处理器上运行的原创文本甄别程序被执行时所实现的方法可参照本申请原创文本甄别方法各个实施例,此处不再赘述。For the method implemented when the original text screening program running on the processor is executed, please refer to the various embodiments of the original text screening method of the present application, which will not be repeated here.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, method, article, or system. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article or system that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including several instructions to make a terminal device (can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书与附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made by using the content of the description and drawings of this application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (10)

  1. 一种原创文本甄别方法,其中,所述原创文本甄别方法包括如下步骤:An original text screening method, wherein the original text screening method includes the following steps:
    在接收到待甄别文本后,在预设的原创数据库中获取所述待甄别文本对应的一个以上的待比较对象;After receiving the text to be screened, obtain more than one object to be compared corresponding to the text to be screened in a preset original database;
    对所述待甄别文本进行预处理,以得到一个以上的第一分句;Preprocessing the text to be screened to obtain more than one first clause;
    将各个所述第一分句与各个所述待比较对象进行比对,确定所述第一分句中存在的非原创分句;Compare each of the first clauses with each of the objects to be compared to determine the non-original clauses in the first clause;
    若所述非原创分句在所述待甄别文本中的占比不大于预设的抄袭阈值,则确定所述待甄别文本为原创文本。If the proportion of the non-original clauses in the text to be screened is not greater than a preset plagiarism threshold, it is determined that the text to be screened is an original text.
  2. 如权利要求1所述的原创文本甄别方法,其中,所述在接收到待甄别文本后,在预设的原创数据库中获取所述待甄别文本对应的一个以上的待比较对象的步骤包括:5. The original text screening method according to claim 1, wherein the step of obtaining one or more objects to be compared corresponding to the text to be screened in a preset original database after receiving the text to be screened comprises:
    在接收到待甄别文本时,确定所述待甄别文本的文本长度,并将所述待甄别文本截为所述文本长度对应数量的字符串;When receiving the text to be screened, determine the text length of the text to be screened, and cut the text to be screened into a number of character strings corresponding to the text length;
    在预设的原创数据库中获取与所述字符串匹配的匹配对象,并在所述匹配对象中选取预设数量的待比较对象。Obtain matching objects matching the character string in a preset original database, and select a preset number of objects to be compared from the matching objects.
  3. 如权利要求1所述的原创文本甄别方法,其中,所述对所述待甄别文本进行预处理,以得到一个以上的第一分句的步骤包括:5. The original text screening method of claim 1, wherein the step of preprocessing the text to be screened to obtain more than one first clause comprises:
    基于预设过滤规则,对所述待甄别文本进行过滤,以得到过滤文本;Filtering the text to be screened based on preset filtering rules to obtain filtered text;
    基于预设分句规则,对所述过滤文本进行分句,以得到一个以上的第一分句。Based on the preset sentence rules, the filtering text is divided into sentences to obtain more than one first sentence.
  4. 如权利要求3所述的原创文本甄别方法,其中,所述基于预设分句规则,对所述过滤文本进行分句,以得到一个以上的第一分句的步骤包括:3. The original text discrimination method according to claim 3, wherein the step of segmenting the filtered text based on preset sentence rules to obtain more than one first sentence comprises:
    基于预设分句规则,对所述过滤文本进行分句,以得到各个子句,并依次确定各个子句的字数是否达到预设字数;Based on the preset clause rules, clauses the filtered text to obtain each clause, and sequentially determines whether the number of words of each clause reaches the preset number of words;
    若当前子句的字数达到所述预设字数,则将当前子句设定为所述第一分句;If the number of words in the current clause reaches the preset number of words, set the current clause as the first clause;
    若当前子句的字数未达到所述预设字数,则将当前子句合并到基于前一子句所设定的所述第一分句中。If the number of words in the current clause does not reach the preset number of words, the current clause is merged into the first clause set based on the previous clause.
  5. 如权利要求1-4任一项所述的原创文本甄别方法,其中,所述将各个所述第一分句与各个所述待比较对象进行比对,确定所述第一分句中存在的非原创分句的步骤包括:The original text screening method according to any one of claims 1 to 4, wherein the comparison of each of the first clauses with each of the objects to be compared is performed to determine the existence of the first clause in the The steps for non-original clauses include:
    生成各个所述第一分句对应的第一哈希值;Generating a first hash value corresponding to each of the first clauses;
    调取各个所述待比较对象对应的哈希值集合,所述哈希值集合中包含多个第二哈希值;Retrieve a set of hash values corresponding to each of the objects to be compared, where the set of hash values includes multiple second hash values;
    将所述第一哈希值与所述第二哈希值进行比较,在所述第一哈希值中,确定与至少一个所述第二哈希值的海明距离小于或等于第一预设值的第三哈希值;The first hash value is compared with the second hash value, and in the first hash value, it is determined that the Hamming distance to at least one of the second hash values is less than or equal to the first predetermined value. Set the third hash value;
    在所述第一分句中,将所述第三哈希值对应的分句标记为非原创分句。In the first clause, the clause corresponding to the third hash value is marked as a non-original clause.
  6. 如权利要求5所述的原创文本甄别方法,其中,在所述将所述第三哈希值对应的分句标记为非原创分句的步骤之后,还包括:5. The original text discrimination method according to claim 5, wherein after the step of marking the clause corresponding to the third hash value as a non-original clause, the method further comprises:
    若确定所述待甄别文本中第i个第一分句到第i+k个第一分句与目标对象中第j分句到第j+k分句的海明距离都未超过第一预设值,则计算所述待甄别文本中第i-n个第一分句与所述目标对象中第j-n分句的第一编辑距离,以及所述待甄别文本中第i+k+m个第一分句与所述目标对象中第j+k+m分句的第二编辑距离,其中,所述目标对象为所述待比较对象中的一个对象,i、j为大于0的常数,k为预设常数,n为1到i的集合,m为1到无穷大的集合,且i+k+m小于等于所述待甄别文本的分句的数量,j+k+m小于等于所述目标对象的分句的数量;If it is determined that the Hamming distance between the ith first clause to the i+k first clause in the text to be screened and the jth clause to the j+k clause in the target object does not exceed the first prediction Set the value, calculate the first edit distance between the in-th first clause in the text to be screened and the jn-th clause in the target object, and the i+k+m-th first clause in the text to be screened The second edit distance between the clause and the j+k+mth clause in the target object, wherein the target object is one of the objects to be compared, i and j are constants greater than 0, and k is A preset constant, n is a set from 1 to i, m is a set from 1 to infinity, and i+k+m is less than or equal to the number of clauses of the text to be screened, and j+k+m is less than or equal to the target object The number of clauses;
    若所述第一编辑距离与所述第j-n分句的分句长度的比值小于第二预设值,则将所述第i-n个第一分句标记为非原创分句;若所述第二编辑距离与所述第j+k+m分句的分句长度的比值小于所述第二预设值,则将所述第i+k+m个第一分句标记为非原创分句。If the ratio of the first edit distance to the clause length of the jnth clause is less than a second preset value, mark the inth first clause as a non-original clause; if the second The ratio of the edit distance to the clause length of the j+k+mth clause is less than the second preset value, and the i+k+mth first clause is marked as a non-original clause.
  7. 如权利要求1所述的原创文本甄别方法,其中,在所述确定所述第一分句中存在的非原创分句之后,还包括:5. The original text discrimination method according to claim 1, wherein after said determining the non-original clauses existing in the first clause, the method further comprises:
    在所述待甄别文本中,统计所述非原创分句的字数,并基于所述字数以及所述待甄别文本的总字数,确定所述非原创分句在所述待甄别文本中的占比。In the text to be screened, count the number of words of the non-original clause, and determine the proportion of the non-original clause in the text to be screened based on the number of words and the total word count of the text to be screened .
  8. 一种原创文本甄别装置,其中,所述原创文本甄别装置包括:An original text screening device, wherein the original text screening device includes:
    获取模块,用于在接收到待甄别文本后,在预设的原创数据库中获取所述待甄别文本对应的一个以上的待比较对象;The obtaining module is configured to obtain more than one object to be compared corresponding to the text to be screened in a preset original database after receiving the text to be screened;
    预处理模块,用于对所述待甄别文本进行预处理,以得到一个以上的第一分句;The preprocessing module is used to preprocess the text to be screened to obtain more than one first clause;
    第一确定模块,用于将各个所述第一分句与各个所述待比较对象进行比对,确定所述第一分句中存在的非原创分句;The first determining module is configured to compare each of the first clauses with each of the objects to be compared, and determine the non-original clauses in the first clause;
    第二确定模块,用于若所述非原创分句在所述待甄别文本中的占比不大于预设的抄袭阈值,则确定所述待甄别文本为原创文本。The second determining module is configured to determine that the text to be screened is an original text if the proportion of the non-original clause in the text to be screened is not greater than a preset plagiarism threshold.
  9. 一种原创文本甄别设备,其中,所述原创文本甄别设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的原创文本甄别程序,所述原创文本甄别程序被所述处理器执行时实现如权利要求1至7中任一项所述的原创文本甄别方法的步骤。An original text screening device, wherein the original text screening device includes: a memory, a processor, and an original text screening program stored on the memory and running on the processor, the original text screening program being The processor implements the steps of the original text screening method according to any one of claims 1 to 7 when executed.
  10. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有原创文本甄别程序,所述原创文本甄别程序被处理器执行时实现如权利要求1至7中任一项所述的原创文本甄别方法的步骤。A computer-readable storage medium, wherein an original text screening program is stored on the computer-readable storage medium, and when the original text screening program is executed by a processor, the method according to any one of claims 1 to 7 is realized The steps of the original text screening method.
PCT/CN2020/101003 2019-07-23 2020-07-09 Original text screening method, apparatus, device and computer-readable storage medium WO2021012958A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910669863.0A CN110347806B (en) 2019-07-23 2019-07-23 Original text screening method, original text screening device, original text screening equipment and computer readable storage medium
CN201910669863.0 2019-07-23

Publications (1)

Publication Number Publication Date
WO2021012958A1 true WO2021012958A1 (en) 2021-01-28

Family

ID=68179981

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/101003 WO2021012958A1 (en) 2019-07-23 2020-07-09 Original text screening method, apparatus, device and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN110347806B (en)
WO (1) WO2021012958A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347806B (en) * 2019-07-23 2024-02-06 深圳前海微众银行股份有限公司 Original text screening method, original text screening device, original text screening equipment and computer readable storage medium
CN113836892B (en) * 2021-09-08 2023-08-08 灵犀量子(北京)医疗科技有限公司 Sample size data extraction method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160196342A1 (en) * 2015-01-06 2016-07-07 Inha-Industry Partnership Plagiarism Document Detection System Based on Synonym Dictionary and Automatic Reference Citation Mark Attaching System
CN106326197A (en) * 2016-08-23 2017-01-11 达而观信息科技(上海)有限公司 Method for fast detecting repeated copying texts
CN107992470A (en) * 2017-11-08 2018-05-04 中国科学院计算机网络信息中心 A kind of text duplicate checking method and system based on similarity
CN108009599A (en) * 2017-12-27 2018-05-08 福建中金在线信息科技有限公司 A kind of original document determination methods, device, electronic equipment and storage medium
CN109918670A (en) * 2019-03-12 2019-06-21 重庆誉存大数据科技有限公司 A kind of article duplicate checking method and system
CN110347806A (en) * 2019-07-23 2019-10-18 深圳前海微众银行股份有限公司 Original text discriminating method, device, equipment and computer readable storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100406671B1 (en) * 2000-07-24 2003-11-21 주식회사 유니마이다스 Method of searching for piracy and steal on a piece of writing
KR101565367B1 (en) * 2015-02-17 2015-11-03 주식회사 무하유 Method for calculating plagiarism rate of documents by number normalization
KR101663454B1 (en) * 2016-08-03 2016-10-07 주식회사 비욘드테크 Apparatus of sentence similarity calculation using keyword weight and method thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160196342A1 (en) * 2015-01-06 2016-07-07 Inha-Industry Partnership Plagiarism Document Detection System Based on Synonym Dictionary and Automatic Reference Citation Mark Attaching System
CN106326197A (en) * 2016-08-23 2017-01-11 达而观信息科技(上海)有限公司 Method for fast detecting repeated copying texts
CN107992470A (en) * 2017-11-08 2018-05-04 中国科学院计算机网络信息中心 A kind of text duplicate checking method and system based on similarity
CN108009599A (en) * 2017-12-27 2018-05-08 福建中金在线信息科技有限公司 A kind of original document determination methods, device, electronic equipment and storage medium
CN109918670A (en) * 2019-03-12 2019-06-21 重庆誉存大数据科技有限公司 A kind of article duplicate checking method and system
CN110347806A (en) * 2019-07-23 2019-10-18 深圳前海微众银行股份有限公司 Original text discriminating method, device, equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN110347806B (en) 2024-02-06
CN110347806A (en) 2019-10-18

Similar Documents

Publication Publication Date Title
CN111177569B (en) Recommendation processing method, device and equipment based on artificial intelligence
WO2021068339A1 (en) Text classification method and device, and computer readable storage medium
JP5746286B2 (en) High-performance data metatagging and data indexing method and system using a coprocessor
US8577882B2 (en) Method and system for searching multilingual documents
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
WO2022048363A1 (en) Website classification method and apparatus, computer device, and storage medium
CN110929145A (en) Public opinion analysis method, public opinion analysis device, computer device and storage medium
US20240028650A1 (en) Method, apparatus, and computer-readable medium for determining a data domain associated with data
WO2020258481A1 (en) Method and apparatus for intelligently recommending personalized text, and computer-readable storage medium
CN110569350A (en) Legal recommendation method, equipment and storage medium
WO2021012958A1 (en) Original text screening method, apparatus, device and computer-readable storage medium
TWI745777B (en) Data archiving method, device, computer device and storage medium
WO2021056750A1 (en) Search method and device, and storage medium
CN111797247B (en) Case pushing method and device based on artificial intelligence, electronic equipment and medium
CN111639250B (en) Enterprise description information acquisition method and device, electronic equipment and storage medium
CN117351336A (en) Image auditing method and related equipment
CN111985212A (en) Text keyword recognition method and device, computer equipment and readable storage medium
US11481389B2 (en) Generating an executable code based on a document
CN113627186B (en) Entity relation detection method based on artificial intelligence and related equipment
CN112529743B (en) Contract element extraction method, device, electronic equipment and medium
CN113486169B (en) Synonymous statement generation method, device, equipment and storage medium based on BERT model
JP2024507029A (en) Web page identification methods, devices, electronic devices, media and computer programs
CN112561714B (en) Nuclear protection risk prediction method and device based on NLP technology and related equipment
CN115098619A (en) Information duplication eliminating method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20844962

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20844962

Country of ref document: EP

Kind code of ref document: A1