WO2021239114A1

WO2021239114A1 - Method for synonym editing and determining creator of text

Info

Publication number: WO2021239114A1
Application number: PCT/CN2021/096771
Authority: WO
Inventors: 黄凯明; 杨磊; 潘覃
Original assignee: 支付宝(杭州)信息技术有限公司
Priority date: 2020-05-29
Filing date: 2021-05-28
Publication date: 2021-12-02
Also published as: CN111381191A; CN111381191B

Abstract

Disclosed is a method for synonym editing and determining the creator of a text. For an original text created by a creator, at least some of the key words in the original text are replaced on the basis of a digital serial number of the creator and a fixed replacement rule. Thus, for plagiarised texts produced by article spinners, the digital serial number can be restored on the basis of the key words in the plagiarised text and the fixed replacement rule to prove the identity of the creator of the original text corresponding to the plagiarised text.

Description

A method for synonymous modification of text and determination of text creator

Technical field

The embodiments of this specification relate to the field of information technology, and in particular to a method for synonymously modifying a text and determining the creator of the text.

Background technique

For text creators, how to effectively protect their copyright is a crucial issue.

In order to prevent the creator's text from being plagiarized, the usual idea is to add a number of disturbing characters between the lines of the text as the creator's mark. If a copyist does not know which characters in the text are interfering characters, even if the expression of the text is adjusted (commonly known as washing), the washed text will often retain the creator's mark.

However, the above-mentioned method of adding disturbing characters to the text often affects the readability of the text, and easily causes certain reading and comprehension barriers for readers.

Summary of the invention

In order to solve the problem of reducing the readability of the text in the existing method of adding disturbing characters to the text, the embodiment of this specification provides a method for synonymously modifying the text and determining the text creator. The technical solution is as follows: The first aspect of the embodiments of the specification provides a method for synonymously modifying text, including: obtaining the text to be modified, and extracting the keyword set of the text to be modified; for each keyword, determining the corresponding keyword The synonym set of, and the keyword and the corresponding synonym set to form a candidate word set; for each candidate word set, according to the first sorting rule, the words in the candidate word set are sorted; and, according to the first two collation, each ordered set of alternative words; creation obtaining the user to modify the text to be numbered; and, according to the digital number N _i bit i, the i-th set of alternative words N _i of the word hits is added to the set of words; i = (1,2, ..., S), S is the number of digital bits; for each keyword, if the keyword does not belong to the set of word hits, then The keyword in the text to be modified is replaced with a hit word that is synonymous with the keyword.

According to the second aspect of the embodiments of this specification, a method for determining a text creator is provided, including: obtaining a text to be determined, and extracting a keyword set of the text to be determined; for each keyword, determining the corresponding keyword The synonym set of, and the keyword and the corresponding synonym set to form a candidate word set; for each candidate word set, according to the first sorting rule, the words in the candidate word set are sorted; and, according to the first two collation, each ordered set of alternative words; for the i-th set of alternative words, it is determined that the alternative word keywords in sequence set N _i; i = (1,2, ..., S), S is numbered digits; determining numbered; wherein, the i-th digit of said digital number N _i; numbered corresponding to the determined user to identify the creator of the text to be determined.

According to the third aspect of the embodiments of this specification, another method for synonymously modifying text is provided, including: obtaining the text to be modified, and extracting the keyword set of the text to be modified; determining from the text to be modified A set of key paragraphs; the number of keywords contained in the set of key paragraphs is greater than the specified number; for each key paragraph, the following steps are performed: for each keyword in the key paragraph, determine the synonym set corresponding to the keyword, And the keyword and the corresponding synonym set form a candidate word set; for each candidate word set, the words in the candidate word set are sorted according to the first sorting rule; and, according to the second sorting rule, each ordered set of alternative words; creation obtaining the user to modify the text to be numbered; and, according to the digital number N _i bit i, the i-th set of alternative words N _i th The word is added to the hit word set; i=(1, 2,..., S), S is the number of digits; for each keyword in the key paragraph, if the keyword does not belong to the hit word set, then Replace the keyword in the key paragraph with a hit word that is synonymous with the keyword.

According to the fourth aspect of the embodiments of this specification, another method for determining the creator of a text is provided, including: obtaining a text to be determined, and extracting a keyword set of the text to be determined; determining that the text contains If the number of keywords in is greater than the specified number of paragraphs, a set of key paragraphs is obtained; for each key paragraph, the following steps are performed: For each keyword in the key paragraph, determine the synonym set corresponding to the keyword, and add the key Words and corresponding synonym sets form a candidate word set; for each candidate word set, the words in the candidate word set are sorted according to the first sorting rule; and, according to the second sorting rule, each candidate is sorted sorting the set of words; determining numbered; wherein, the i-th digit of said digital number _{N i; i = (1,2,} ..., S), S is the number of digital bits; in paragraph performed for each key After the steps are completed, the creator of the text to be determined is determined according to the number number determined based on each key paragraph.

The technical solutions provided in the embodiments of this specification are based on the original text created by the creator, and at least part of the keywords in the original text are replaced according to the creator’s digital number (acting as an identification mark) and fixed replacement rules, and the original text is modified. Text and make it public. In this way, for the plagiarized text produced by the scrubber based on the publicly modified text, the digital number can be restored according to the keywords in the plagiarized text and the fixed replacement rule to prove the identity of the creator of the original text corresponding to the plagiarized text.

Through the embodiment of this specification, the way of replacing keywords with synonyms will not affect the readability of the text. At the same time, the use of fixed replacement rules can make it possible to restore the creator without comparing with the original text when analyzing plagiarized text The digital number is more convenient.

It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and cannot limit the embodiments of this specification.

In addition, any one of the embodiments of the present specification does not need to achieve all the above-mentioned effects.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of this specification or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some of the embodiments described in the embodiments of this specification. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings.

FIG. 1 is a schematic flowchart of a method for synonymously modifying a text provided by an embodiment of this specification;

2 is a schematic flowchart of a method for determining a text creator provided by an embodiment of this specification;

FIG. 3 is a schematic flowchart of another method for synonym modification of text provided by an embodiment of this specification;

4 is a schematic flowchart of another method for determining a text creator provided by an embodiment of this specification;

Figure 5 is a schematic structural diagram of a device for synonymously modifying text provided by an embodiment of this specification;

Fig. 6 is a schematic structural diagram of a device for determining a text creator provided by an embodiment of this specification;

Figure 7 is a schematic structural diagram of a device for synonymously modifying text provided by an embodiment of this specification;

Figure 8 is a schematic structural diagram of a device for determining a text creator provided by an embodiment of this specification;

Fig. 9 is a schematic structural diagram of a device for configuring the method of the embodiment of this specification.

Detailed ways

Generally speaking, synonymous modification of the original text of the creator (that is, synonymous replacement of some words in the original text), to obtain the modified text and make it public, can prevent the creator's text from being plagiarized to a certain extent. When a plagiarist plagiarizes a publicly revised text, as long as the substituted synonyms are not lost in the plagiarized text, he can use this as a clue to prove that the plagiarized text infringes the copyright of the original text.

However, the above-mentioned method also has certain drawbacks. Specifically, on the one hand, if the plagiarist has made substantial changes to the revised text after understanding the main point of the revised text (such as deleting large sections of content, adding large sections of content, and making large changes to the expression), the plagiarized text will be obtained It’s easy to lose the substituted synonyms in, which makes it impossible to prove that the plagiarized text infringes the copyright of the original text. On the other hand, when plagiarized text is found, it is necessary to compare the plagiarized text with the original text to discover which words in the plagiarized text are Replaced, this is more troublesome.

For this reason, in the embodiments of this specification, on the one hand, only part or all of the keywords in the original text are replaced by synonyms to obtain the modified text. In this way, since the keywords of the original text are often closely related to the subject matter of the original text, even Plagiarists make substantial changes to the revised text, and the obtained plagiarized text is unlikely to lose synonyms of the original text keywords. On the other hand, at least part of the keywords in the original text are replaced by synonyms according to the original text’s creator’s digital number (its role is to uniquely identify the creator’s identity) and fixed replacement rules. In this way, when plagiarized text is found, When the original text is not required, the digital number can be restored according to fixed rules and keywords in the plagiarized text to prove that the plagiarized text infringes the copyright of the original text.

In addition, it should be noted that in the following text, "collection" usually includes at least one object.

In order to enable those skilled in the art to better understand the technical solutions in the embodiments of this specification, the technical solutions in the embodiments of this specification will be described in detail below in conjunction with the drawings in the embodiments of this specification. Obviously, the described implementation The examples are only a part of the embodiments in this specification, not all the embodiments. Based on the embodiments in this specification, all other embodiments obtained by a person of ordinary skill in the art should fall within the scope of protection.

The technical solutions provided by the embodiments of this specification will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a method for synonymously modifying text provided by an embodiment of the present specification, including the following steps: S100: Obtain a text to be modified, and extract a keyword set of the text to be modified.

The text to be modified refers to the original text created by the creator. In order to protect the copyright of the original text of the creator, the original text can be modified synonymously based on the method shown in FIG. 1.

In the embodiment of this specification, a term frequency-inverse document frequency (TF-IDF) algorithm may be used to extract a set of keywords from the text to be modified. In the TF-IDF algorithm. Word frequency TF focuses on the frequency of occurrence of a word in the text, and the keywords of the text are often words that appear frequently in the text; while the inverse text frequency index IDF focuses on whether a word is a common word, if it is a common word, even if it is high in the text Frequent occurrences are not keywords, so common words have lower weights, while uncommon words have higher weights. If uncommon words appear frequently in the text, they are keywords.

In addition, it is also possible to extract the set of keywords in the text to be modified based on the bm25 algorithm (an algorithm for measuring the relevance of words and text). The higher the relevance of the word to the text to be modified, the more likely it is to be determined as Key words.

S102: For each keyword, determine the synonym set corresponding to the keyword, and combine the keyword and the corresponding synonym set to form a candidate word set.

In the embodiment of this specification, the synonym set corresponding to each keyword can be determined by querying the synonym table; the word vector of each keyword can also be determined based on the word2vec algorithm, and then for each keyword, the word of the keyword The distance between the vector and the word vector of each word in the corpus is calculated, and the words in the corpus whose distance is less than the specified distance are determined as synonyms of the keyword.

S104: For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule.

In the embodiments of this specification, the first sorting rule refers to a rule for sorting the words in each candidate word set, and the second sorting rule refers to a rule for sorting among the candidate word sets.

It is worth emphasizing that when several candidate word sets have been fixed, the ranking results of the words in each candidate word set according to the first sorting rule are also fixed, and each candidate word set is sorted according to the second sorting rule. The result of sorting between word sets is also fixed.

S106: obtaining the digital creation of a user ID to be modified text; and, according to the digital number N _i i-th bit, add i-th set of alternative words N _i th word to word hits collection.

S108: For each keyword, if the keyword does not belong to the hit word set, replace the keyword in the text to be modified with a hit word synonymous with the keyword.

In the embodiments of this specification, the user's digital number refers to a number that uniquely identifies the user's identity. The user’s ID number, mobile phone number, or the unique number obtained by the user after registering an account in a certain business system can be used as the user’s digital number, or according to certain mapping rules, the user’s unique account name registered in the business system Mapped to digital numbers.

This article will mark the number of digits as S, and the number is usually in decimal. It can be understood that in the method shown in FIG. 1, at least S keywords can be determined from the text to be modified to form a keyword set.

In addition, it should be noted that in practical applications, the value range of each digit of the digital number needs to be considered to set the number of words in each candidate word set.

For example, if the digital number has S bits and the value range of each bit is (1, 9), which means that there are 9 values on each bit, then the number of words in each candidate word set can be set to 9 , Which means that you need to determine 8 synonyms for each keyword to meet the demand.

Of course, when determining the synonyms of each keyword, it can also be determined based on the digital numbers of all users stored in the system. For example, it is stipulated in the system that the number number has S bits, and the value range of each bit is (1,5). Then, it means that there are 5 values for each bit. Then you can set each candidate word set The number of words is 5, which means that at least 4 synonyms must be determined for each keyword to meet the demand.

In the present embodiment, the description is defined i = (1,2, ..., S ), numbered i-th digit of N _i.

The hit word set refers to the set of words that should eventually appear at each keyword position in the modified text. It is worth emphasizing here that for a text to be modified, the keywords are fixed, the first sorting rule and the second sorting rule are also fixed, and the creator’s digital number is fixed, so the final hit words are also stable. After replacing the keywords in the text to be modified according to the fixed set of hit words (some keywords are the hit words themselves, no need to replace), the modified text is obtained. After the revised text is washed by the copyist, the plagiarized text is obtained. Plagiarized text usually does not lose the keywords in the text to be modified. Therefore, according to the keywords in the plagiarized text and fixed replacement rules, the digital number can be restored.

Through the method shown in Figure 1, for the original text created by the creator, at least part of the keywords in the original text are replaced according to the creator’s digital number (acting as an identity identifier) and fixed replacement rules to obtain the modified text and public. In this way, for the plagiarized text produced by the scrubber based on the publicly modified text, the digital number can be restored according to the keywords in the plagiarized text and the fixed replacement rule to prove the identity of the creator of the original text corresponding to the plagiarized text. The method of synonym substitution for keywords will not affect the readability of the text. At the same time, the use of fixed replacement rules can make it possible to restore the creator’s digital number without comparing it with the original text when analyzing plagiarized text. convenient.

In the method shown in Figure 1, all keywords appearing in the modified text can be replaced by synonyms. In this way, because keywords are sometimes not only distributed in one or a few paragraphs, even if the copyist will modify Deleting some paragraphs of the text may not necessarily completely remove keywords from the plagiarized text.

In addition, in the embodiment of the present specification, the first sorting rule and the second sorting rule can be flexibly set, as long as the sorting can be fixed. For example, the first sorting rule may be: if the text to be modified is a Chinese character text, the first character of each word in the candidate word set is used as a reference, and the first letter of the pinyin is in the order from front to back. The words in the selected word set are sorted; the second sorting rule can be: if the text to be modified is a Chinese character text, then the first character of the first word in each candidate word set is used as the reference, and the first letter of the pinyin is determined according to the pinyin Sort the set of candidate words in front-to-back order.

It should be noted that if the first character of the two candidate words is the same or the pinyin first letter of the first word is the same, the order of the first character of the second word will be distinguished from front to back. .

Of course, it can also be sorted according to other rules such as the strokes of Chinese characters. In addition, if the text to be modified is an English text, the first letter of each word in the candidate word set can be used as the basis, and the words in the candidate word set can be sorted in the order of the first letter from front to back.

In the embodiment of this specification, the modified text can be submitted to the blockchain for storage, and the data can not be tampered with in the blockchain, which can be regarded as "the user of the digital number is the creator of the modified text" Credible proof. Of course, the modified text can also be submitted to a high-security storage device for storage.

Fig. 2 is a schematic flowchart of a method for determining a text creator provided by an embodiment of the present specification, including the following steps: S200: Acquire a text to be determined, and extract a keyword set of the text to be determined.

The text to be determined refers to a text that is suspected of plagiarism. In practical applications, the creator finds that a certain text may be a plagiarized text obtained by plagiarizing its publicly modified text, which can be proved by the method shown in Figure 2.

S202: For each keyword, determine a synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set.

S204: For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule.

Regarding the implementation of the steps before step S206, reference may be made to the foregoing.

S206: the alternative words for the i-th set, determining that the alternative word N _i in Sequence keyword set.

S208: Determine the digital number.

In the embodiment of this specification, the ordinal digits of the keywords in the first candidate word set to the S-th candidate word set can be sequentially combined into a digital number, where the i-th digit of the digital number is N _i .

S210: Identify the user corresponding to the determined digital number as the creator of the text to be determined.

If the text to be determined is a plagiarized text, it generally does not lose the keywords in the modified text itself (otherwise the key information of the text will be lost, which will affect the expression of the theme of the text). Therefore, the user corresponding to the restored digital number is the one who modified the text creator.

Fig. 3 is a schematic flowchart of another method for synonym modification of text provided by an embodiment of the present specification, including the following steps: S300: Obtain the text to be modified, and extract the keyword set of the text to be modified.

S302: Determine a set of key paragraphs from the text to be modified; the number of keywords included in the set of key paragraphs is greater than a specified number.

S304: Perform steps S3041-S3044 for each key paragraph.

S3041: For each keyword in the key paragraph, determine a synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set.

S3042: For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule.

S3043: Get modified user authoring the text to be numbered; and, according to the digital number N _i i-th bit, add i-th set of alternative words N _i th word to word hits collection.

S3044: For each keyword in the key paragraph, if the keyword does not belong to the hit word set, replace the keyword in the key paragraph with a hit word synonymous with the keyword.

The method shown in Figure 3 is modified on the basis of the method shown in Figure 1. Considering that in practice, all keyword positions in the text are replaced by synonyms, and the modification range is too large. Therefore, you can choose to replace keywords with synonyms only for key paragraphs in the text.

Fig. 4 is a schematic flowchart of another method for determining a text creator provided by an embodiment of the present specification, including the following steps: S400: Acquire a text to be determined, and extract a keyword set of the text to be determined.

S402: Determine, from the to-be-determined text, paragraphs that contain more keywords than a specified number, and obtain a set of key paragraphs.

S404: For each key paragraph, perform the following steps S4041-S4044.

S4041: For each keyword in the key paragraph, determine a synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set.

S4042: For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort each candidate word set according to the second sorting rule.

S4043: Determine the digital number.

S406: After the execution of the steps for each key paragraph is completed, the creator of the text to be determined is determined according to the digital number determined based on each key paragraph.

The method shown in FIG. 4 is based on the method shown in FIG. 3.

In practical applications, the plagiarism may delete some key paragraphs in the modified text to obtain the plagiarized text.

If the text to be determined is a plagiarized text, and the plagiarized text only retains a key paragraph in the modified text, then the user corresponding to the digital number determined based on the key paragraph can be determined as the creator of the text to be determined .

If the text to be determined is a plagiarized text, and the plagiarized text retains more than one key paragraph in the modified text, then there may be a problem of inconsistency in number numbers determined based on different key paragraphs. For this reason, in the method shown in Figure 3, the check digit P can be calculated according to the number number and preset calculation rules, and then the Pth word in the S+1th candidate word set is added to Hit word collection. This is equivalent to adding a check mark in addition to the creator's mark in the text to be modified to verify whether the creator's mark is damaged or tampered with. Among them, the number of candidate word sets is at least S+1.

Wherein, the preset calculation rule can be set according to actual needs, as long as the digital number can be stably mapped into a check digit.

For example, the preset calculation rule can be

will

As the check digit P.

For another example, the preset calculation rule can be:

will

Converted to binary, the last bit of the obtained binary number, if the last bit is 0, then P is 1, if the last bit is 1, then P is 2.

In the method shown in Figure 4, for each key paragraph in the text to be determined (some key paragraphs in the modified text may be lost), the check digit can be calculated according to the determined number number and the preset calculation rule Q; Determine whether the Q-th word in the S+1-th candidate word set is a keyword in the key paragraph; if so, add the determined number number to the number set corresponding to the key paragraph; if not, then Correct the determined digital number to obtain at least one modified digital number and add it to the number set corresponding to the key paragraph; according to the number set corresponding to each key paragraph, the user corresponding to the number number with the highest frequency is determined as the number set. The creator of the text to be determined.

For each revised digital number, the Q obtained by recalculating based on the digital number satisfies: the Q-th word in the S+1-th candidate word set is the keyword in the key paragraph. Further, for each digital number after correction, it is also satisfied that the change degree characterizing value used to characterize "the degree of change from the determined digital number to the modified digital number" is smaller than the specified value. The degree of change is positively correlated with the characterization value of the degree of change. Understandably, it is assumed here that even if the plagiarism makes significant changes to the revised text, he will stick to the subject matter of the revised text as much as possible. Therefore, if a revised digital number can pass the verification, the greater the degree of change The smaller, the more likely it is the digital number of the actual creator.

In order to better clarify this scheme, the following examples are given.

Assuming that the user's digital number has 3 digits (S=3), the value range of each bit is (1,2). Therefore, it is necessary to extract S+1 (that is, 4) keywords for each key paragraph, and to determine at least one synonym for each keyword.

Suppose a key paragraph of the text to be modified (original text) is:

The Red Sea passed early, and the ship sailed on the Indian Ocean. But the sun still set unpleasantly late and rose early, encroaching on most of the night. The night seemed to be soaked with oil and turned into a translucent body; it hugged the sun and couldn't tell it, maybe it was intoxicated by the sun, so the night after the sunset was fading with red. When Hong Xiao was drunk and woke up, the sleeper in the cabin woke up with sweat, took a shower and rushed to the deck to blow the sea breeze. It was the beginning of another day. This is the hottest time of the year in late July, at the peak of the Chinese calendar. In China, the heat was even worse than usual. Afterwards, everyone said it was a sign of war, because this was the 26th year of the Republic of China.

The key words in the above key paragraphs include: occupation, rush, interest, and war.

For these four keywords, synonyms can be determined respectively:

(1) Synonyms for encroachment: encroachment, erosion, and embezzlement;

(2) Synonyms for rushing: rushing, rushing;

(3) Synonyms of interest: serious;

(4) Synonym for Bingge: war.

In this way, the following 4 candidate word sets are obtained:

(1) Embezzlement, encroachment, erosion, and embezzlement;

(2) Arrive, rush, rush;

(3) Interests and serious;

(4) Soldiers and wars.

The first rule and the second rule are used for sorting (the first character is sorted from front to back within and between sets), and we get:

(1) Soldiers and wars;

(2) Arrive, rush, rush;

(3) Interests and serious;

(4) Encroachment, erosion, embezzlement, embezzlement.

Assuming that the number of the creator of the text to be modified is 121, for the first three candidate word sets, the first word (bingge), the second word (rush away), and the first word (interest) are hit in sequence. right

Convert to binary and take the last bit, the last bit is 0, then the checksum P is 1. The first word (invasion) in the fourth candidate word set is also added to the hit word set.

Through the above, the set of hit words corresponding to the above-mentioned key paragraphs can be obtained as: soldiering, rushing, taking advantage, and invading. According to the hit word set, replace the keywords in the key paragraph (if the keyword itself is a hit word, no replacement is needed), the key paragraph in the modified text obtained after modification is:

The Red Sea passed early, and the ship sailed on the Indian Ocean. But the sun still set innocently, settling late and rising early, "grabbing" most of the night. The night seemed to be soaked with oil and turned into a translucent body; it hugged the sun and couldn't tell it, maybe it was intoxicated by the sun, so the night after the sunset was fading with red. When Hongxiao woke up, the sleeper in the cabin woke up with sweat, took a shower and "rushed" to the sea breeze on the deck. It was the beginning of another day. This is the hottest time of the year in late July, at the peak of the Chinese calendar. In China, it was more hot than usual. Afterwards, everyone said it was a phenomenon of "soldier fighting", because this was the 26th year of the Republic of China.

In actual applications, the above operations are performed for each key paragraph of the text to be modified.

When a plagiarist plagiarizes a publicly plagiarized text, the plagiarized text obtained often retains the main point of the key paragraph, but the expression will be changed, which can be as follows:

Ships traveling in the Indian Ocean have already sailed through the Red Sea. However, the sun still set slowly and rose early and reluctantly, "robbing" the beautiful night. The night is translucent, embracing the sun, the sun may be intoxicated. After the people in the cabin woke up, they "rushed" to the deck to blow the sea breeze and start a new day. This is the hottest time of the year in the doom of the Chinese Lunar Calendar. China's heat is even more "interesting" than in previous years, and it feels "war". After all, it is the 26th year of the Republic of China.

Regarding the paragraph of the plagiarized text, although the expression changes greatly, it can still be determined that the paragraph is a key paragraph, and the key words are determined as follows: invading, driving away, interest, and fighting.

Based on the keywords of the key paragraph of the plagiarized text, determine the 4 candidate word sets, and complete the sorting, to obtain the sorted 4 candidate word sets consistent with the modification stage, as follows:

(1) Soldiers and wars;

(2) Arrive, rush, rush;

(3) Interests and serious;

(4) Encroachment, erosion, embezzlement, embezzlement.

Among them, the Bingge appearing in the plagiarized text is the first word in the first candidate word set, so the first digit of the number is 1; the rushing out in the plagiarized text is the second word set in the candidate word set The second word, so the second number of the number is 2; the interest in the plagiarized text is the first word in the third candidate word set, so the third 2 of the number number is 1. The plagiarism in the plagiarized text is the first word in the fourth candidate word set, so the check number P is 1, indicating that the binary form of the sum of the three digits of the number number should be 0. In fact, The sum of the three digits of the number 121 is 4, the binary form is 100, and the last digit is 0, which is verified.

However, in actual applications, plagiarizing text may delete or modify certain keywords in some modified texts, for example, as follows:

Ships traveling in the Indian Ocean have already sailed through the Red Sea. However, the sun still set slowly and rose early and reluctantly, "robbing" the beautiful night. The night is translucent, embracing the sun, the sun may be intoxicated. After the people in the cabin woke up, they "rushed" to the deck to blow the sea breeze and start a new day. This is the hottest time of the year in the doom of the Chinese Lunar Calendar. China's heat is more "serious" than in previous years, and it feels like "war". After all, it is the 26th year of the Republic of China.

According to the key paragraph of this plagiarized text, the restored digital number may be 122, the sum of the three digits of the digital number is 5, the binary form is 101, the last digit is 1, and the corresponding checksum should be 2. According to this plagiarized text, the confirmed check number is 1, and the check fails.

In fact, there are often more than one key paragraphs in plagiarized texts, and the number numbers determined based on each key paragraph may not be consistent, and the number numbers corresponding to some key paragraphs may pass the verification, and the number numbers corresponding to some key paragraphs may Failed to verify.

In this case, take this key paragraph of the plagiarized text in the above example as an example. If the determined digital number fails the verification, the digital number shall be corrected to the minimum degree of modification in order to pass the verification. Obviously, the correction of 122 to 121 can pass the verification. In this way, the corrected number 121 is added to the number set corresponding to the key paragraph.

From the perspective of the entire plagiarized text, for any key paragraph, at least one number in the number set corresponding to the key paragraph is a number that can pass verification. Then, in the number sets corresponding to each key paragraph, the number number with the highest frequency is counted. The high probability is the number number of the actual creator, and the user corresponding to the number number with the highest frequency can be determined as the creator.

Fig. 5 is a schematic structural diagram of an apparatus for synonymously modifying text provided by an embodiment of this specification, including: an acquisition module 501, which acquires the text to be modified, and extracts the keyword set of the text to be modified; and a determination module 502, For each keyword, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set; the sorting module 503, for each candidate word set, according to the first sorting rule, The words in the candidate word set are sorted; and, according to the second sorting rule, each candidate word set is sorted; the adding module 504 obtains the digital number of the user who authored the text to be modified; and, according to the numbered bit i N _i, the i-th add alternative words collection of words N _i into a set of word hits; i = (1,2, ..., S), S is the number of digital bits; Review Module 505, for each keyword, if the keyword does not belong to the hit word set, replace the keyword in the text to be modified with a hit word synonymous with the keyword.

The sorting module 503, if the text to be modified is a Chinese character text, use the first character of each word in the candidate word set as a reference, and set the candidate words in the order of the first letter of the pinyin from front to back. Sort the words in.

The sorting module 503, if the text to be modified is a Chinese character text, the first character of the first word in each candidate word set is used as a reference, and the first letter of the pinyin is sorted from front to back. The word set is sorted.

The device also includes: an attestation module 506, which submits the modified text to the blockchain for attestation.

6 is a schematic structural diagram of an apparatus for determining a text creator provided by an embodiment of the present specification, including: an obtaining module 601, which obtains the text to be determined, and extracts the keyword set of the text to be determined; the first determining module 602, For each keyword, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set; the sorting module 603, for each candidate word set, according to the first sorting rule, The words in the candidate word set are sorted; and, according to the second sorting rule, each candidate word set is sorted; the second determination module 604, for the i-th candidate word set, determines that the candidate word set is The sequence of the keyword _Ni ; i=(1,2,...,S), S is the number of digits in the number;

A third determination module 605, determines whether the digital number; wherein, the i-th digit of said digital number N _i; determining a fourth module 606, a digital number corresponding to the determined user to identify the creator of the text to be determined.

FIG. 7 is a schematic structural diagram of a device for synonymously modifying text according to an embodiment of this specification, including: an acquisition module 701, which acquires the text to be modified, and extracts the keyword set of the text to be modified; and a determination module 702, A set of key paragraphs is determined from the text to be modified; the number of keywords contained in the set of key paragraphs is greater than the specified number; the execution module 703, for each key paragraph, executes the following steps: for each key paragraph Keywords, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set; for each candidate word set, according to the first sorting rule, the candidate word set sorting words; and, according to a second ordering rules, the respective rank the set of alternative words; Get modified user authoring the text to be numbered; and, according to the digital number N _i i-th bit, the first Alternatively, the i-th word N _i in the set of words added to the hit keyword set; i = (1,2, ..., S), S is the number of digital bits; key for each of the keywords paragraph, if If the keyword does not belong to the hit word set, then the keyword in the key paragraph is replaced with a hit word synonymous with the keyword.

The execution module 703 calculates the check digit P according to the number number and preset calculation rules; adds the Pth word in the S+1th candidate word set to the hit word set.

FIG. 8 is a schematic structural diagram of a device for determining a text creator provided by an embodiment of this specification, including: an obtaining module 801, which obtains the text to be determined, and extracts the keyword set of the text to be determined; the first determining module 802, It is determined from the text to be determined that the number of keywords contained is greater than the specified number of paragraphs to obtain a set of key paragraphs; the execution module 803, for each key paragraph, executes the following steps: for each key word in the key paragraph , Determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set; for each candidate word set, according to the first sorting rule, the words in the candidate word set are processed sorting; and, according to a second ordering rules, the respective rank the set of alternative words; determining numbered; wherein, the i-th digit of said digital number _{N i; i = (1,2,} ..., S), S is the number of digits in the number; the second determination module 804, after completing the steps for each key paragraph, determines the creator of the text to be determined according to the number number determined based on each key paragraph.

The second determining module 804 calculates the check digit Q for each key paragraph according to the determined number number and preset calculation rules; determines whether the Q-th word in the S+1-th candidate word set is Keywords in the key paragraph; if yes, add the determined number number to the number set corresponding to the key paragraph; if not, then correct the determined number number to obtain at least one revised number number and add it to the The number set corresponding to the key paragraph; for each revised number number, the Q obtained by recalculating based on the number number satisfies: the Qth word in the S+1th candidate word set is the key in the key paragraph Words; According to the number sets corresponding to each key paragraph, the user corresponding to the number with the highest frequency is determined as the creator of the text to be determined.

The embodiments of this specification also provide a computer device, which includes at least a memory, a processor, and a computer program stored in the memory and capable of running on the processor. The processor implements the client in this specification when the processor executes the program. The method executed by the device or server device.

FIG. 9 shows a more specific hardware structure diagram of a computing device provided by an embodiment of this specification. The device may include a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. The processor 1010, the memory 1020, the input/output interface 1030, and the communication interface 1040 realize the communication connection between each other in the device through the bus 1050.

The processor 1010 may be implemented by a general CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits for execution related Program to realize the technical solutions provided in the embodiments of this specification.

The memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, etc. The memory 1020 may store an operating system and other application programs. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, related program codes are stored in the memory 1020 and called and executed by the processor 1010.

The input/output interface 1030 is used to connect an input/output module to realize information input and output. The input/output/module can be configured in the device as a component (not shown in the figure), or can be connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and an output device may include a display, a speaker, a vibrator, an indicator light, and the like.

The communication interface 1040 is used to connect a communication module (not shown in the figure) to realize the communication interaction between the device and other devices. The communication module can realize communication through wired means (such as USB, network cable, etc.), or through wireless means (such as mobile network, WIFI, Bluetooth, etc.).

The bus 1050 includes a path to transmit information between various components of the device (for example, the processor 1010, the memory 1020, the input/output interface 1030, and the communication interface 1040).

It should be noted that although the above device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040, and the bus 1050, in the specific implementation process, the device may also include the necessary equipment for normal operation. Other components. In addition, those skilled in the art can understand that the above-mentioned device may also include only the components necessary to implement the solutions of the embodiments of the present specification, and not necessarily include all the components shown in the figures.

The embodiments of this specification also provide a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the method executed by the client device or the server device in this specification is implemented.

Computer-readable media includes permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

From the description of the foregoing implementation manners, it can be known that those skilled in the art can clearly understand that the embodiments of this specification can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the embodiments of this specification can be embodied in the form of software products, which can be stored in storage media, such as ROM/RAM, Magnetic disks, optical disks, etc., include several instructions to enable a computer device (which may be a personal computer, a service device, or a network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments of this specification.

The systems, methods, modules, or units illustrated in the above embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions. A typical implementation device is a computer. The specific form of the computer can be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email receiving and sending device, and a game control A console, a tablet computer, a wearable device, or a combination of any of these devices.

The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment. The device embodiments described above are merely illustrative, and the modules described as separate components may or may not be physically separated. The functions of the modules can be combined in the same way when implementing the solutions of the embodiments of this specification. Or multiple software and/or hardware implementations. It is also possible to select some or all of the modules according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement without creative work.

The above are only specific implementations of the embodiments of this specification. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of the embodiments of this specification, several improvements and modifications can be made. These Improvements and retouching should also be regarded as the protection scope of the embodiments of this specification.

Claims

A method of synonymously modifying text, including:

Obtain the text to be modified, and extract the keyword set of the text to be modified;

For each keyword, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set;

For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule;

Creation acquiring the text to be modified numbered user; and, according to the digital number N i i-th bit, add i-th set of alternative words N i a set of words into word hits; i = (1, 2, …, S), S is the number of digits;

For each keyword, if the keyword does not belong to the hit word set, then the keyword in the text to be modified is replaced with a hit word synonymous with the keyword.
The method according to claim 1, wherein, according to the first sorting rule, sorting the words in the candidate word set includes:

If the text to be modified is a Chinese character text, the first character of each word in the candidate word set is used as a reference, and the words in the candidate word set are sorted according to the pinyin first letter from front to back.
The method according to claim 1, wherein sorting the candidate word sets according to the second sorting rule includes:

If the text to be modified is a Chinese character text, the first character of the first word in each candidate word set is used as a reference, and each candidate word set is sorted according to the pinyin first letter from front to back.
The method of claim 1, further comprising:

Submit the revised text to the blockchain for storage.
A method of identifying the creator of a text, including:

Acquiring the text to be determined, and extracting the keyword set of the text to be determined;

For each keyword, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set;

For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule;

Alternatively for the i-th word set, determining that the alternative word keywords in Sequence set N i; i = (1,2, ..., S), S is the number of digital bits;

Determining numbered; wherein, the i-th digit of said digital number N i;

The user corresponding to the determined digital number is identified as the creator of the text to be determined.
A method of synonymously modifying text, including:

Obtain the text to be modified, and extract the keyword set of the text to be modified;

Determine a set of key paragraphs from the text to be modified; the number of keywords contained in the set of key paragraphs is greater than a specified number;

For each key paragraph, perform the following steps:

For each keyword in the key paragraph, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set;

For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule;

Creation acquiring the text to be modified numbered user; and, according to the digital number N i i-th bit, add i-th set of alternative words N i a set of words into word hits; i = (1, 2, …, S), S is the number of digits;

For each keyword in the key paragraph, if the keyword does not belong to the hit word set, the keyword in the key paragraph is replaced with a hit word synonymous with the keyword.
The method according to claim 6, for each key paragraph, the following steps are further executed:

According to the digital number and the preset calculation rule, the check digit P is calculated;

The Pth word in the S+1th candidate word set is added to the hit word set.
A method of identifying the creator of a text, including:

Acquiring the text to be determined, and extracting the keyword set of the text to be determined;

From the text to be determined, it is determined that the number of keywords contained is greater than the specified number of paragraphs, and a set of key paragraphs is obtained;

For each key paragraph, perform the following steps:

For each keyword in the key paragraph, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set;

For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule;

Determining numbered; wherein, the i-th digit of said digital number N i; i = (1,2, ..., S), S is the number of digital bits;

After the steps are executed for each key paragraph, the creator of the text to be determined is determined according to the number number determined based on each key paragraph.
8. The method according to claim 8, determining the creator of the text to be determined according to the number number determined based on each key paragraph, which specifically includes:

For each key paragraph, calculate the check digit Q according to the determined number number and preset calculation rules;

Determine whether the Q-th word in the S+1-th candidate word set is a keyword in the key paragraph;

If yes, add the determined number number to the number set corresponding to the key paragraph;

If not, correct the determined digital number to obtain at least one corrected digital number and add it to the number set corresponding to the key paragraph; for each corrected digital number, recalculate Q based on the digital number Satisfaction: The Q-th word in the S+1-th candidate word set is the keyword in the key paragraph;

According to the number sets corresponding to each key paragraph, the user corresponding to the number number with the highest frequency is determined as the creator of the text to be determined.
A device for synonymous modification of text, including:

An acquiring module, acquiring the text to be modified, and extracting the keyword set of the text to be modified;

The determining module, for each keyword, determines the synonym set corresponding to the keyword, and forms a candidate word set with the keyword and the corresponding synonym set;

The sorting module, for each candidate word set, sorts the words in the candidate word set according to the first sorting rule; and sorts the candidate word sets according to the second sorting rule;

Adding module acquires the creation of the text to be modified numbered user; and, according to the digital number N i i-th bit, add i-th set of alternative words N i hit word to word set ; I=(1,2,...,S), S is the number of digits in the number;

The modification module, for each keyword, if the keyword does not belong to the hit word set, replace the keyword in the text to be modified with a hit word synonymous with the keyword.
The device according to claim 10, wherein the sorting module, if the text to be modified is a Chinese character text, the first character of each word in the candidate word set is used as a reference, according to the first letter of the pinyin from front to back Order, sort the words in the candidate word set.
The device according to claim 10, wherein the sorting module, if the text to be modified is a Chinese character text, the first character of the first word in each candidate word set is used as a reference, and the first letter of pinyin is used from front to front. In the latter order, sort the set of candidate words.
The device according to claim 10, further comprising:

The deposit certificate module submits the revised text to the blockchain for deposit certificate.
A device for identifying the creator of a text, including:

An obtaining module, which obtains the text to be determined, and extracts the keyword set of the text to be determined;

The first determining module, for each keyword, determines the synonym set corresponding to the keyword, and forms a candidate word set with the keyword and the corresponding synonym set;

The sorting module, for each candidate word set, sorts the words in the candidate word set according to the first sorting rule; and sorts the candidate word sets according to the second sorting rule;

Second determining module, the alternative words for the i-th set of alternative words is determined that the rank order of keywords set N i; i = (1,2, ..., S), S is the number of digital bits;

A third determination module to determine a digital number; wherein, the i-th digit of said digital number N i;

The fourth determining module determines the user corresponding to the determined digital number as the creator of the text to be determined.
A device for synonymous modification of text, including:

An acquiring module, acquiring the text to be modified, and extracting the keyword set of the text to be modified;

The determining module determines a set of key paragraphs from the text to be modified; the number of keywords contained in the set of key paragraphs is greater than a specified number;

The execution module, for each key paragraph, executes the following steps: For each keyword in the key paragraph, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set; For each candidate word set, the words in the candidate word set are sorted according to the first sorting rule; and, according to the second sorting rule, the candidate word sets are sorted; and the text for the creation of the to-be-modified text is obtained numbered user; and, according to the digital number N i i-th bit, add i-th set of alternative words N i a set of words into word hits; i = (1,2, ..., S ), S is the number of digits; for each keyword in the key paragraph, if the keyword does not belong to the set of hit words, replace the keyword in the key paragraph with the same meaning as the keyword The hit word.
The device according to claim 15, wherein the execution module calculates the check digit P according to the digital number and a preset calculation rule;

The Pth word in the S+1th candidate word set is added to the hit word set.
A device for identifying the creator of a text, including:

An obtaining module, which obtains the text to be determined, and extracts the keyword set of the text to be determined;

The first determining module determines paragraphs that contain more keywords than a specified number from the text to be determined, and obtains a set of key paragraphs;

The execution module, for each key paragraph, executes the following steps: For each keyword in the key paragraph, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set; For each candidate word set, the words in the candidate word set are sorted according to the first sorting rule; and, according to the second sorting rule, the candidate word sets are sorted; the number is determined; wherein, the numbered i-th digit of N i; i = (1,2, ..., S), S is the number of digital bits;

The second determining module determines the creator of the text to be determined according to the digital number determined based on each key paragraph after the execution of the steps for each key paragraph is completed.
17. The device of claim 17, wherein the second determining module calculates the check digit Q for each key paragraph according to the determined number number and the preset calculation rule; determines the S+1th candidate word set Whether the Q-th word in is a keyword in the key paragraph; if yes, add the determined number number to the number set corresponding to the key paragraph; if not, then correct the determined number number to obtain at least one amendment The last number number is added to the number set corresponding to the key paragraph; for each number number after correction, the recalculated Q based on the number number satisfies: the Qth in the S+1th candidate word set Words are keywords in the key paragraphs; according to the number sets corresponding to each key paragraph, the user corresponding to the number number with the highest frequency is determined as the creator of the text to be determined.
A computer device, comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program as described in any one of claims 1-9 method.