WO2021239114A1 - Method for synonym editing and determining creator of text - Google Patents

Method for synonym editing and determining creator of text Download PDF

Info

Publication number
WO2021239114A1
WO2021239114A1 PCT/CN2021/096771 CN2021096771W WO2021239114A1 WO 2021239114 A1 WO2021239114 A1 WO 2021239114A1 CN 2021096771 W CN2021096771 W CN 2021096771W WO 2021239114 A1 WO2021239114 A1 WO 2021239114A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
keyword
candidate word
determined
words
Prior art date
Application number
PCT/CN2021/096771
Other languages
French (fr)
Chinese (zh)
Inventor
黄凯明
杨磊
潘覃
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2021239114A1 publication Critical patent/WO2021239114A1/en

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01RMEASURING ELECTRIC VARIABLES; MEASURING MAGNETIC VARIABLES
    • G01R31/00Arrangements for testing electric properties; Arrangements for locating electric faults; Arrangements for electrical testing characterised by what is being tested not provided for elsewhere
    • G01R31/50Testing of electric apparatus, lines, cables or components for short-circuits, continuity, leakage current or incorrect line connections
    • G01R31/62Testing of transformers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the embodiments of this specification relate to the field of information technology, and in particular to a method for synonymously modifying a text and determining the creator of the text.
  • the usual idea is to add a number of disturbing characters between the lines of the text as the creator's mark. If a copyist does not know which characters in the text are interfering characters, even if the expression of the text is adjusted (commonly known as washing), the washed text will often retain the creator's mark.
  • the technical solutions provided in the embodiments of this specification are based on the original text created by the creator, and at least part of the keywords in the original text are replaced according to the creator’s digital number (acting as an identification mark) and fixed replacement rules, and the original text is modified. Text and make it public.
  • the digital number can be restored according to the keywords in the plagiarized text and the fixed replacement rule to prove the identity of the creator of the original text corresponding to the plagiarized text.
  • the way of replacing keywords with synonyms will not affect the readability of the text.
  • the use of fixed replacement rules can make it possible to restore the creator without comparing with the original text when analyzing plagiarized text
  • the digital number is more convenient.
  • any one of the embodiments of the present specification does not need to achieve all the above-mentioned effects.
  • FIG. 1 is a schematic flowchart of a method for synonymously modifying a text provided by an embodiment of this specification
  • FIG. 2 is a schematic flowchart of a method for determining a text creator provided by an embodiment of this specification
  • FIG. 3 is a schematic flowchart of another method for synonym modification of text provided by an embodiment of this specification.
  • FIG. 4 is a schematic flowchart of another method for determining a text creator provided by an embodiment of this specification
  • Figure 5 is a schematic structural diagram of a device for synonymously modifying text provided by an embodiment of this specification
  • Fig. 6 is a schematic structural diagram of a device for determining a text creator provided by an embodiment of this specification
  • Figure 7 is a schematic structural diagram of a device for synonymously modifying text provided by an embodiment of this specification.
  • Figure 8 is a schematic structural diagram of a device for determining a text creator provided by an embodiment of this specification.
  • Fig. 9 is a schematic structural diagram of a device for configuring the method of the embodiment of this specification.
  • synonymous modification of the original text of the creator that is, synonymous replacement of some words in the original text
  • a plagiarist plagiarizes a publicly revised text as long as the substituted synonyms are not lost in the plagiarized text, he can use this as a clue to prove that the plagiarized text infringes the copyright of the original text.
  • the above-mentioned method also has certain drawbacks. Specifically, on the one hand, if the plagiarist has made substantial changes to the revised text after understanding the main point of the revised text (such as deleting large sections of content, adding large sections of content, and making large changes to the expression), the plagiarized text will be obtained It’s easy to lose the substituted synonyms in, which makes it impossible to prove that the plagiarized text infringes the copyright of the original text. On the other hand, when plagiarized text is found, it is necessary to compare the plagiarized text with the original text to discover which words in the plagiarized text are Replaced, this is more troublesome.
  • selection usually includes at least one object.
  • Fig. 1 is a schematic flowchart of a method for synonymously modifying text provided by an embodiment of the present specification, including the following steps: S100: Obtain a text to be modified, and extract a keyword set of the text to be modified.
  • the text to be modified refers to the original text created by the creator.
  • the original text can be modified synonymously based on the method shown in FIG. 1.
  • a term frequency-inverse document frequency (TF-IDF) algorithm may be used to extract a set of keywords from the text to be modified.
  • TF-IDF term frequency-inverse document frequency
  • Word frequency TF focuses on the frequency of occurrence of a word in the text, and the keywords of the text are often words that appear frequently in the text; while the inverse text frequency index IDF focuses on whether a word is a common word, if it is a common word, even if it is high in the text Frequent occurrences are not keywords, so common words have lower weights, while uncommon words have higher weights. If uncommon words appear frequently in the text, they are keywords.
  • S102 For each keyword, determine the synonym set corresponding to the keyword, and combine the keyword and the corresponding synonym set to form a candidate word set.
  • the synonym set corresponding to each keyword can be determined by querying the synonym table; the word vector of each keyword can also be determined based on the word2vec algorithm, and then for each keyword, the word of the keyword The distance between the vector and the word vector of each word in the corpus is calculated, and the words in the corpus whose distance is less than the specified distance are determined as synonyms of the keyword.
  • S104 For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule.
  • the first sorting rule refers to a rule for sorting the words in each candidate word set
  • the second sorting rule refers to a rule for sorting among the candidate word sets.
  • S106 obtaining the digital creation of a user ID to be modified text; and, according to the digital number N i i-th bit, add i-th set of alternative words N i th word to word hits collection.
  • the user's digital number refers to a number that uniquely identifies the user's identity.
  • the user’s ID number, mobile phone number, or the unique number obtained by the user after registering an account in a certain business system can be used as the user’s digital number, or according to certain mapping rules, the user’s unique account name registered in the business system Mapped to digital numbers.
  • This article will mark the number of digits as S, and the number is usually in decimal. It can be understood that in the method shown in FIG. 1, at least S keywords can be determined from the text to be modified to form a keyword set.
  • the digital number has S bits and the value range of each bit is (1, 9), which means that there are 9 values on each bit, then the number of words in each candidate word set can be set to 9 , Which means that you need to determine 8 synonyms for each keyword to meet the demand.
  • the synonyms of each keyword it can also be determined based on the digital numbers of all users stored in the system. For example, it is stipulated in the system that the number number has S bits, and the value range of each bit is (1,5). Then, it means that there are 5 values for each bit. Then you can set each candidate word set The number of words is 5, which means that at least 4 synonyms must be determined for each keyword to meet the demand.
  • the hit word set refers to the set of words that should eventually appear at each keyword position in the modified text. It is worth emphasizing here that for a text to be modified, the keywords are fixed, the first sorting rule and the second sorting rule are also fixed, and the creator’s digital number is fixed, so the final hit words are also stable. After replacing the keywords in the text to be modified according to the fixed set of hit words (some keywords are the hit words themselves, no need to replace), the modified text is obtained. After the revised text is washed by the copyist, the plagiarized text is obtained. Plagiarized text usually does not lose the keywords in the text to be modified. Therefore, according to the keywords in the plagiarized text and fixed replacement rules, the digital number can be restored.
  • the first sorting rule and the second sorting rule can be flexibly set, as long as the sorting can be fixed.
  • the first sorting rule may be: if the text to be modified is a Chinese character text, the first character of each word in the candidate word set is used as a reference, and the first letter of the pinyin is in the order from front to back.
  • the words in the selected word set are sorted;
  • the second sorting rule can be: if the text to be modified is a Chinese character text, then the first character of the first word in each candidate word set is used as the reference, and the first letter of the pinyin is determined according to the pinyin Sort the set of candidate words in front-to-back order.
  • first character of the two candidate words is the same or the pinyin first letter of the first word is the same, the order of the first character of the second word will be distinguished from front to back. .
  • the text to be modified is an English text
  • the first letter of each word in the candidate word set can be used as the basis, and the words in the candidate word set can be sorted in the order of the first letter from front to back.
  • the modified text can be submitted to the blockchain for storage, and the data can not be tampered with in the blockchain, which can be regarded as "the user of the digital number is the creator of the modified text" Credible proof.
  • the modified text can also be submitted to a high-security storage device for storage.
  • Fig. 2 is a schematic flowchart of a method for determining a text creator provided by an embodiment of the present specification, including the following steps: S200: Acquire a text to be determined, and extract a keyword set of the text to be determined.
  • the text to be determined refers to a text that is suspected of plagiarism.
  • the creator finds that a certain text may be a plagiarized text obtained by plagiarizing its publicly modified text, which can be proved by the method shown in Figure 2.
  • S202 For each keyword, determine a synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set.
  • S204 For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule.
  • the ordinal digits of the keywords in the first candidate word set to the S-th candidate word set can be sequentially combined into a digital number, where the i-th digit of the digital number is N i .
  • S210 Identify the user corresponding to the determined digital number as the creator of the text to be determined.
  • the text to be determined is a plagiarized text, it generally does not lose the keywords in the modified text itself (otherwise the key information of the text will be lost, which will affect the expression of the theme of the text). Therefore, the user corresponding to the restored digital number is the one who modified the text creator.
  • Fig. 3 is a schematic flowchart of another method for synonym modification of text provided by an embodiment of the present specification, including the following steps: S300: Obtain the text to be modified, and extract the keyword set of the text to be modified.
  • S302 Determine a set of key paragraphs from the text to be modified; the number of keywords included in the set of key paragraphs is greater than a specified number.
  • S3042 For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule.
  • S3043 Get modified user authoring the text to be numbered; and, according to the digital number N i i-th bit, add i-th set of alternative words N i th word to word hits collection.
  • Fig. 4 is a schematic flowchart of another method for determining a text creator provided by an embodiment of the present specification, including the following steps: S400: Acquire a text to be determined, and extract a keyword set of the text to be determined.
  • S402 Determine, from the to-be-determined text, paragraphs that contain more keywords than a specified number, and obtain a set of key paragraphs.
  • S4041 For each keyword in the key paragraph, determine a synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set.
  • S4042 For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort each candidate word set according to the second sorting rule.
  • the method shown in FIG. 4 is based on the method shown in FIG. 3.
  • the plagiarism may delete some key paragraphs in the modified text to obtain the plagiarized text.
  • the text to be determined is a plagiarized text
  • the plagiarized text only retains a key paragraph in the modified text
  • the user corresponding to the digital number determined based on the key paragraph can be determined as the creator of the text to be determined .
  • the check digit P can be calculated according to the number number and preset calculation rules, and then the Pth word in the S+1th candidate word set is added to Hit word collection. This is equivalent to adding a check mark in addition to the creator's mark in the text to be modified to verify whether the creator's mark is damaged or tampered with.
  • the number of candidate word sets is at least S+1.
  • the preset calculation rule can be set according to actual needs, as long as the digital number can be stably mapped into a check digit.
  • the preset calculation rule can be will As the check digit P.
  • the preset calculation rule can be: will Converted to binary, the last bit of the obtained binary number, if the last bit is 0, then P is 1, if the last bit is 1, then P is 2.
  • the check digit can be calculated according to the determined number number and the preset calculation rule Q; Determine whether the Q-th word in the S+1-th candidate word set is a keyword in the key paragraph; if so, add the determined number number to the number set corresponding to the key paragraph; if not, then Correct the determined digital number to obtain at least one modified digital number and add it to the number set corresponding to the key paragraph; according to the number set corresponding to each key paragraph, the user corresponding to the number number with the highest frequency is determined as the number set.
  • the Q obtained by recalculating based on the digital number satisfies: the Q-th word in the S+1-th candidate word set is the keyword in the key paragraph. Further, for each digital number after correction, it is also satisfied that the change degree characterizing value used to characterize "the degree of change from the determined digital number to the modified digital number" is smaller than the specified value. The degree of change is positively correlated with the characterization value of the degree of change. Understandably, it is assumed here that even if the plagiarism makes significant changes to the revised text, he will stick to the subject matter of the revised text as much as possible. Therefore, if a revised digital number can pass the verification, the greater the degree of change The smaller, the more likely it is the digital number of the actual creator.
  • the key words in the above key paragraphs include: occupation, rush, interest, and war.
  • the first rule and the second rule are used for sorting (the first character is sorted from front to back within and between sets), and we get:
  • the first word (bingge), the second word (rush away), and the first word (interest) are hit in sequence. right Convert to binary and take the last bit, the last bit is 0, then the checksum P is 1.
  • the first word (invasion) in the fourth candidate word set is also added to the hit word set.
  • the set of hit words corresponding to the above-mentioned key paragraphs can be obtained as: soldiering, rushing, taking advantage, and invading.
  • the hit word set replace the keywords in the key paragraph (if the keyword itself is a hit word, no replacement is needed), the key paragraph in the modified text obtained after modification is:
  • the Bingge appearing in the plagiarized text is the first word in the first candidate word set, so the first digit of the number is 1; the rushing out in the plagiarized text is the second word set in the candidate word set The second word, so the second number of the number is 2; the interest in the plagiarized text is the first word in the third candidate word set, so the third 2 of the number number is 1.
  • the plagiarism in the plagiarized text is the first word in the fourth candidate word set, so the check number P is 1, indicating that the binary form of the sum of the three digits of the number number should be 0. In fact, The sum of the three digits of the number 121 is 4, the binary form is 100, and the last digit is 0, which is verified.
  • plagiarizing text may delete or modify certain keywords in some modified texts, for example, as follows:
  • the restored digital number may be 122, the sum of the three digits of the digital number is 5, the binary form is 101, the last digit is 1, and the corresponding checksum should be 2. According to this plagiarized text, the confirmed check number is 1, and the check fails.
  • At least one number in the number set corresponding to the key paragraph is a number that can pass verification. Then, in the number sets corresponding to each key paragraph, the number number with the highest frequency is counted. The high probability is the number number of the actual creator, and the user corresponding to the number number with the highest frequency can be determined as the creator.
  • the sorting module 503 if the text to be modified is a Chinese character text, use the first character of each word in the candidate word set as a reference, and set the candidate words in the order of the first letter of the pinyin from front to back. Sort the words in.
  • the sorting module 503 if the text to be modified is a Chinese character text, the first character of the first word in each candidate word set is used as a reference, and the first letter of the pinyin is sorted from front to back. The word set is sorted.
  • the device also includes: an attestation module 506, which submits the modified text to the blockchain for attestation.
  • a third determination module 605 determines whether the digital number; wherein, the i-th digit of said digital number N i; determining a fourth module 606, a digital number corresponding to the determined user to identify the creator of the text to be determined.
  • the execution module 703 calculates the check digit P according to the number number and preset calculation rules; adds the Pth word in the S+1th candidate word set to the hit word set.
  • the second determining module 804 calculates the check digit Q for each key paragraph according to the determined number number and preset calculation rules; determines whether the Q-th word in the S+1-th candidate word set is Keywords in the key paragraph; if yes, add the determined number number to the number set corresponding to the key paragraph; if not, then correct the determined number number to obtain at least one revised number number and add it to the The number set corresponding to the key paragraph; for each revised number number, the Q obtained by recalculating based on the number number satisfies: the Qth word in the S+1th candidate word set is the key in the key paragraph Words; According to the number sets corresponding to each key paragraph, the user corresponding to the number with the highest frequency is determined as the creator of the text to be determined.
  • the embodiments of this specification also provide a computer device, which includes at least a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor implements the client in this specification when the processor executes the program. The method executed by the device or server device.
  • FIG. 9 shows a more specific hardware structure diagram of a computing device provided by an embodiment of this specification.
  • the device may include a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050.
  • the processor 1010, the memory 1020, the input/output interface 1030, and the communication interface 1040 realize the communication connection between each other in the device through the bus 1050.
  • the processor 1010 may be implemented by a general CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits for execution related Program to realize the technical solutions provided in the embodiments of this specification.
  • a general CPU Central Processing Unit, central processing unit
  • a microprocessor an application specific integrated circuit (Application Specific Integrated Circuit, ASIC)
  • ASIC Application Specific Integrated Circuit
  • the memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, etc.
  • the memory 1020 may store an operating system and other application programs. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, related program codes are stored in the memory 1020 and called and executed by the processor 1010.
  • the input/output interface 1030 is used to connect an input/output module to realize information input and output.
  • the input/output/module can be configured in the device as a component (not shown in the figure), or can be connected to the device to provide corresponding functions.
  • the input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and an output device may include a display, a speaker, a vibrator, an indicator light, and the like.
  • the communication interface 1040 is used to connect a communication module (not shown in the figure) to realize the communication interaction between the device and other devices.
  • the communication module can realize communication through wired means (such as USB, network cable, etc.), or through wireless means (such as mobile network, WIFI, Bluetooth, etc.).
  • the bus 1050 includes a path to transmit information between various components of the device (for example, the processor 1010, the memory 1020, the input/output interface 1030, and the communication interface 1040).
  • the above device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040, and the bus 1050, in the specific implementation process, the device may also include the necessary equipment for normal operation. Other components.
  • the above-mentioned device may also include only the components necessary to implement the solutions of the embodiments of the present specification, and not necessarily include all the components shown in the figures.
  • the embodiments of this specification also provide a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the method executed by the client device or the server device in this specification is implemented.
  • Computer-readable media includes permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
  • a typical implementation device is a computer.
  • the specific form of the computer can be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email receiving and sending device, and a game control A console, a tablet computer, a wearable device, or a combination of any of these devices.
  • the various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments.
  • the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
  • the device embodiments described above are merely illustrative, and the modules described as separate components may or may not be physically separated.
  • the functions of the modules can be combined in the same way when implementing the solutions of the embodiments of this specification. Or multiple software and/or hardware implementations. It is also possible to select some or all of the modules according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement without creative work.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Power Engineering (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed is a method for synonym editing and determining the creator of a text. For an original text created by a creator, at least some of the key words in the original text are replaced on the basis of a digital serial number of the creator and a fixed replacement rule. Thus, for plagiarised texts produced by article spinners, the digital serial number can be restored on the basis of the key words in the plagiarised text and the fixed replacement rule to prove the identity of the creator of the original text corresponding to the plagiarised text.

Description

一种对文本进行同义修改、确定文本创作者的方法A method for synonymous modification of text and determination of text creator 技术领域Technical field
本说明书实施例涉及信息技术领域,尤其涉及一种对文本进行同义修改、确定文本创作者的方法。The embodiments of this specification relate to the field of information technology, and in particular to a method for synonymously modifying a text and determining the creator of the text.
背景技术Background technique
对于文本的创作者而言,如何有效保护其版权,是至关重要的课题。For text creators, how to effectively protect their copyright is a crucial issue.
为了防止创作者的文本被抄袭,通常采取的思路是,在文本的字里行间加入若干干扰字符作为创作者标记。抄袭者如果不知道文本中的哪些字符是干扰字符,则即便对文本的表述进行调整(俗称洗稿),洗稿后的文本也往往会保留创作者标记。In order to prevent the creator's text from being plagiarized, the usual idea is to add a number of disturbing characters between the lines of the text as the creator's mark. If a copyist does not know which characters in the text are interfering characters, even if the expression of the text is adjusted (commonly known as washing), the washed text will often retain the creator's mark.
然而,上述这种向文本中加入干扰字符的方式往往会影响文本的可读性,容易给读者造成一定的阅读理解障碍。However, the above-mentioned method of adding disturbing characters to the text often affects the readability of the text, and easily causes certain reading and comprehension barriers for readers.
发明内容Summary of the invention
为了解决现有的向文本中加入干扰字符的方式存在的降低文本可读性的问题,本说明书实施例提供一种对文本进行同义修改、确定文本创作者的方法,技术方案如下:根据本说明书实施例的第1方面,提供一种对文本进行同义修改的方法,包括:获取待修改文本,并提取所述待修改文本的关键词集合;针对每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序;获取创作所述待修改文本的用户的数字编号;以及,根据所述数字编号的第i位N i,将第i个备选词集合中的第N i个词添加到命中词集合;i=(1,2,…,S),S为数字编号位数;针对每个关键词,若该关键词不属于所述命中词集合,则将所述待修改文本中的该关键词替换成与该关键词同义的命中词。 In order to solve the problem of reducing the readability of the text in the existing method of adding disturbing characters to the text, the embodiment of this specification provides a method for synonymously modifying the text and determining the text creator. The technical solution is as follows: The first aspect of the embodiments of the specification provides a method for synonymously modifying text, including: obtaining the text to be modified, and extracting the keyword set of the text to be modified; for each keyword, determining the corresponding keyword The synonym set of, and the keyword and the corresponding synonym set to form a candidate word set; for each candidate word set, according to the first sorting rule, the words in the candidate word set are sorted; and, according to the first two collation, each ordered set of alternative words; creation obtaining the user to modify the text to be numbered; and, according to the digital number N i bit i, the i-th set of alternative words N i of the word hits is added to the set of words; i = (1,2, ..., S), S is the number of digital bits; for each keyword, if the keyword does not belong to the set of word hits, then The keyword in the text to be modified is replaced with a hit word that is synonymous with the keyword.
根据本说明书实施例的第2方面,提供一种确定文本创作者的方法,包括:获取待确定文本,并提取所述待确定文本的关键词集合;针对每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序;针对第i个备选词集合,确定该备选词集合中关键词的序位N i;i=(1,2,…,S),S为数字编号位数;确定数字编号;其中,所述数字编号的第i位数字为N i;将确定的数字编号对应的用户认定为所述待确定文本的创作者。 According to the second aspect of the embodiments of this specification, a method for determining a text creator is provided, including: obtaining a text to be determined, and extracting a keyword set of the text to be determined; for each keyword, determining the corresponding keyword The synonym set of, and the keyword and the corresponding synonym set to form a candidate word set; for each candidate word set, according to the first sorting rule, the words in the candidate word set are sorted; and, according to the first two collation, each ordered set of alternative words; for the i-th set of alternative words, it is determined that the alternative word keywords in sequence set N i; i = (1,2, ..., S), S is numbered digits; determining numbered; wherein, the i-th digit of said digital number N i; numbered corresponding to the determined user to identify the creator of the text to be determined.
根据本说明书实施例的第3方面,提供另一种对文本进行同义修改的方法,包括:获取待修改文本,并提取所述待修改文本的关键词集合;从所述待修改文本中确定出关 键段落集合;所述关键段落集合包含的关键词的数量大于指定数量;针对每个关键段落,执行以下步骤:针对该关键段落中的每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序;获取创作所述待修改文本的用户的数字编号;以及,根据所述数字编号的第i位N i,将第i个备选词集合中的第N i个词添加到命中词集合;i=(1,2,…,S),S为数字编号位数;针对该关键段落中的每个关键词,若该关键词不属于所述命中词集合,则将该关键段落中的该关键词替换成与该关键词同义的命中词。 According to the third aspect of the embodiments of this specification, another method for synonymously modifying text is provided, including: obtaining the text to be modified, and extracting the keyword set of the text to be modified; determining from the text to be modified A set of key paragraphs; the number of keywords contained in the set of key paragraphs is greater than the specified number; for each key paragraph, the following steps are performed: for each keyword in the key paragraph, determine the synonym set corresponding to the keyword, And the keyword and the corresponding synonym set form a candidate word set; for each candidate word set, the words in the candidate word set are sorted according to the first sorting rule; and, according to the second sorting rule, each ordered set of alternative words; creation obtaining the user to modify the text to be numbered; and, according to the digital number N i bit i, the i-th set of alternative words N i th The word is added to the hit word set; i=(1, 2,..., S), S is the number of digits; for each keyword in the key paragraph, if the keyword does not belong to the hit word set, then Replace the keyword in the key paragraph with a hit word that is synonymous with the keyword.
根据本说明书实施例的第4方面,提供另一种确定文本创作者的方法,包括:获取待确定文本,并提取所述待确定文本的关键词集合;从所述待确定文本中确定出包含的关键词的数量大于指定数量的段落,得到关键段落集合;针对每个关键段落,执行以下步骤:针对该关键段落中的每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序;确定数字编号;其中,所述数字编号的第i位数字为N i;i=(1,2,…,S),S为数字编号位数;在针对每个关键段落执行步骤完毕后,根据基于每个关键段落确定的数字编号,确定所述待确定文本的创作者。 According to the fourth aspect of the embodiments of this specification, another method for determining the creator of a text is provided, including: obtaining a text to be determined, and extracting a keyword set of the text to be determined; determining that the text contains If the number of keywords in is greater than the specified number of paragraphs, a set of key paragraphs is obtained; for each key paragraph, the following steps are performed: For each keyword in the key paragraph, determine the synonym set corresponding to the keyword, and add the key Words and corresponding synonym sets form a candidate word set; for each candidate word set, the words in the candidate word set are sorted according to the first sorting rule; and, according to the second sorting rule, each candidate is sorted sorting the set of words; determining numbered; wherein, the i-th digit of said digital number N i; i = (1,2, ..., S), S is the number of digital bits; in paragraph performed for each key After the steps are completed, the creator of the text to be determined is determined according to the number number determined based on each key paragraph.
本说明书实施例所提供的技术方案,针对创作者创作的原始文本,根据创作者的数字编号(起到身份标识作用)与固定替换规则对该原始文本中的至少部分关键词进行替换,得到修改文本并公开。如此,针对洗稿者根据公开的修改文本制作的抄袭文本,可以根据该抄袭文本中的关键词与固定替换规则还原出数字编号,证明该抄袭文本对应的原始文本的创作者身份。The technical solutions provided in the embodiments of this specification are based on the original text created by the creator, and at least part of the keywords in the original text are replaced according to the creator’s digital number (acting as an identification mark) and fixed replacement rules, and the original text is modified. Text and make it public. In this way, for the plagiarized text produced by the scrubber based on the publicly modified text, the digital number can be restored according to the keywords in the plagiarized text and the fixed replacement rule to prove the identity of the creator of the original text corresponding to the plagiarized text.
通过本说明书实施例,对关键词进行同义词替换的方式不会影响文本的可读性,同时,采用固定替换规则可以使得在分析抄袭文本时,无需与原始文本进行比对就可以还原出创作者的数字编号,更为便利。Through the embodiment of this specification, the way of replacing keywords with synonyms will not affect the readability of the text. At the same time, the use of fixed replacement rules can make it possible to restore the creator without comparing with the original text when analyzing plagiarized text The digital number is more convenient.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本说明书实施例。It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and cannot limit the embodiments of this specification.
此外,本说明书实施例中的任一实施例并不需要达到上述的全部效果。In addition, any one of the embodiments of the present specification does not need to achieve all the above-mentioned effects.
附图说明Description of the drawings
为了更清楚地说明本说明书实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书实施例中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of this specification or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some of the embodiments described in the embodiments of this specification. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings.
图1是本说明书实施例提供的一种对文本进行同义修改的方法的流程示意图;FIG. 1 is a schematic flowchart of a method for synonymously modifying a text provided by an embodiment of this specification;
图2是本说明书实施例提供的一种确定文本创作者的方法的流程示意图;2 is a schematic flowchart of a method for determining a text creator provided by an embodiment of this specification;
图3是本说明书实施例提供的另一种对文本进行同义词修改的方法的流程示意图;FIG. 3 is a schematic flowchart of another method for synonym modification of text provided by an embodiment of this specification;
图4是本说明书实施例提供的另一种确定文本创作者的方法的流程示意图;4 is a schematic flowchart of another method for determining a text creator provided by an embodiment of this specification;
图5是本说明书实施例提供的一种对文本进行同义修改的装置的结构示意图;Figure 5 is a schematic structural diagram of a device for synonymously modifying text provided by an embodiment of this specification;
图6是本说明书实施例提供的一种确定文本创作者的装置的结构示意图;Fig. 6 is a schematic structural diagram of a device for determining a text creator provided by an embodiment of this specification;
图7是本说明书实施例提供的一种对文本进行同义修改的装置的结构示意图;Figure 7 is a schematic structural diagram of a device for synonymously modifying text provided by an embodiment of this specification;
图8是本说明书实施例提供的一种确定文本创作者的装置的结构示意图;Figure 8 is a schematic structural diagram of a device for determining a text creator provided by an embodiment of this specification;
图9是用于配置本说明书实施例方法的一种设备的结构示意图。Fig. 9 is a schematic structural diagram of a device for configuring the method of the embodiment of this specification.
具体实施方式Detailed ways
一般而言,对创作者的原始文本进行同义修改(即对原始文本中的一些词进行同义替换),得到修改文本并公开,可以在一定程度防止创作者的文本被抄袭。抄袭者在抄袭公开修改文本时,只要得到的抄袭文本中没有丢失那些替换的同义词,就可以以此为线索证明抄袭文本侵犯了原始文本的版权。Generally speaking, synonymous modification of the original text of the creator (that is, synonymous replacement of some words in the original text), to obtain the modified text and make it public, can prevent the creator's text from being plagiarized to a certain extent. When a plagiarist plagiarizes a publicly revised text, as long as the substituted synonyms are not lost in the plagiarized text, he can use this as a clue to prove that the plagiarized text infringes the copyright of the original text.
然而,上述这种方式也存在一定弊端。具体而言,一方面,如果抄袭者在理解修改文本的主旨之后,对修改文本进行了大幅度修改(如删除大段内容、增加大段内容、对表述修改较大),则得到的抄袭文本中很容易丢失替换的同义词,导致无法证明抄袭文本侵犯了原始文本的版权;另一方面,当发现抄袭文本时,需要将抄袭文本与原始文本进行比对,才能发现抄袭文本中的哪些词是替换过的,这比较麻烦。However, the above-mentioned method also has certain drawbacks. Specifically, on the one hand, if the plagiarist has made substantial changes to the revised text after understanding the main point of the revised text (such as deleting large sections of content, adding large sections of content, and making large changes to the expression), the plagiarized text will be obtained It’s easy to lose the substituted synonyms in, which makes it impossible to prove that the plagiarized text infringes the copyright of the original text. On the other hand, when plagiarized text is found, it is necessary to compare the plagiarized text with the original text to discover which words in the plagiarized text are Replaced, this is more troublesome.
为此,在本说明书实施例中,一方面,仅对原始文本中的部分或全部关键词进行同义词替换得到修改文本,如此,由于原始文本的关键词往往与原始文本的主旨紧密相关,因此即便抄袭者对修改文本进行大幅度修改,得到的抄袭文本中也不太可能丢失原始文本关键词的同义词。另一方面,根据原始文本的创作者的数字编号(其作用是唯一标识创作者的身份)与固定替换规则来对原始文本中的至少部分关键词进行同义词替换,如此,当发现抄袭文本时,不需要原始文本的情况下,也能根据固定规则与抄袭文本中的关键词还原出数字编号,以证明抄袭文本侵犯了原始文本的版权。For this reason, in the embodiments of this specification, on the one hand, only part or all of the keywords in the original text are replaced by synonyms to obtain the modified text. In this way, since the keywords of the original text are often closely related to the subject matter of the original text, even Plagiarists make substantial changes to the revised text, and the obtained plagiarized text is unlikely to lose synonyms of the original text keywords. On the other hand, at least part of the keywords in the original text are replaced by synonyms according to the original text’s creator’s digital number (its role is to uniquely identify the creator’s identity) and fixed replacement rules. In this way, when plagiarized text is found, When the original text is not required, the digital number can be restored according to fixed rules and keywords in the plagiarized text to prove that the plagiarized text infringes the copyright of the original text.
此外需要说明的是,在后文中,“集合”通常包含至少一个对象。In addition, it should be noted that in the following text, "collection" usually includes at least one object.
为了使本领域技术人员更好地理解本说明书实施例中的技术方案,下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行详细地描述,显然,所描述的实施例仅仅是本说明书的一部分实施例,而不是全部的实施例。基于本说明书中的实施例,本领域普通技术人员所获得的所有其他实施例,都应当属于保护的范围。In order to enable those skilled in the art to better understand the technical solutions in the embodiments of this specification, the technical solutions in the embodiments of this specification will be described in detail below in conjunction with the drawings in the embodiments of this specification. Obviously, the described implementation The examples are only a part of the embodiments in this specification, not all the embodiments. Based on the embodiments in this specification, all other embodiments obtained by a person of ordinary skill in the art should fall within the scope of protection.
以下结合附图,详细说明本说明书各实施例提供的技术方案。The technical solutions provided by the embodiments of this specification will be described in detail below with reference to the accompanying drawings.
图1是本说明书实施例提供的一种对文本进行同义修改的方法的流程示意图,包括以下步骤:S100:获取待修改文本,并提取所述待修改文本的关键词集合。Fig. 1 is a schematic flowchart of a method for synonymously modifying text provided by an embodiment of the present specification, including the following steps: S100: Obtain a text to be modified, and extract a keyword set of the text to be modified.
所述待修改文本是指创作者创作的原始文本。为了保护创作者的原始文本的版权,可以基于图1所示的方法对原始文本进行同义修改。The text to be modified refers to the original text created by the creator. In order to protect the copyright of the original text of the creator, the original text can be modified synonymously based on the method shown in FIG. 1.
在本说明书实施例中,可以采用词频-逆文本频率指数(Term Frequency–Inverse Document Frequency,TF-IDF)算法,从所述待修改文本中提取关键词集合。在TF-IDF算法中。词频TF关注某个词在文本中的出现频次,文本的关键词往往是文本中高频出现的词;而逆文本频率指数IDF关注某个词是否是常见词,如果是常见词,即便在文本中高频出现,也不是关键词,因此常见词的权重较低,而不常见词权重较高,如果不常见词在文本中高频出现,则是关键词。In the embodiment of this specification, a term frequency-inverse document frequency (TF-IDF) algorithm may be used to extract a set of keywords from the text to be modified. In the TF-IDF algorithm. Word frequency TF focuses on the frequency of occurrence of a word in the text, and the keywords of the text are often words that appear frequently in the text; while the inverse text frequency index IDF focuses on whether a word is a common word, if it is a common word, even if it is high in the text Frequent occurrences are not keywords, so common words have lower weights, while uncommon words have higher weights. If uncommon words appear frequently in the text, they are keywords.
此外,也可以基于bm25算法(一种用于衡量词与文本相关性的算法)来提取待修改文本中的关键词集合,与待修改文本相关性越高的词,就越有可能被确定为关键词。In addition, it is also possible to extract the set of keywords in the text to be modified based on the bm25 algorithm (an algorithm for measuring the relevance of words and text). The higher the relevance of the word to the text to be modified, the more likely it is to be determined as Key words.
S102:针对每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合。S102: For each keyword, determine the synonym set corresponding to the keyword, and combine the keyword and the corresponding synonym set to form a candidate word set.
在本说明书实施例中,可以通过查询同义词表,确定每个关键词对应的同义词集合;也可以基于word2vec算法确定每个关键词的词向量,然后针对每个关键词,将该关键词的词向量与语料库中每个词的词向量进行距离计算,将距离小于指定距离的语料库中的词确定为该关键词的同义词。In the embodiment of this specification, the synonym set corresponding to each keyword can be determined by querying the synonym table; the word vector of each keyword can also be determined based on the word2vec algorithm, and then for each keyword, the word of the keyword The distance between the vector and the word vector of each word in the corpus is calculated, and the words in the corpus whose distance is less than the specified distance are determined as synonyms of the keyword.
S104:针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序。S104: For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule.
在本说明书实施例中,第一排序规则是指对每个备选词集合内部的各词进行排序的规则,第二排序规则是指对各备选词集合间进行排序的规则。In the embodiments of this specification, the first sorting rule refers to a rule for sorting the words in each candidate word set, and the second sorting rule refers to a rule for sorting among the candidate word sets.
值得强调的是,在若干备选词集合已经固定的情况下,根据第一排序规则对每个备选词集合内部各词进行排序的排序结果也是固定的,根据第二排序规则对各备选词集合间进行排序的排序结果也是固定的。It is worth emphasizing that when several candidate word sets have been fixed, the ranking results of the words in each candidate word set according to the first sorting rule are also fixed, and each candidate word set is sorted according to the second sorting rule. The result of sorting between word sets is also fixed.
S106:获取创作所述待修改文本的用户的数字编号;以及,根据所述数字编号的第i位N i,将第i个备选词集合中的第N i个词添加到命中词集合。 S106: obtaining the digital creation of a user ID to be modified text; and, according to the digital number N i i-th bit, add i-th set of alternative words N i th word to word hits collection.
S108:针对每个关键词,若该关键词不属于所述命中词集合,则将所述待修改文本中的该关键词替换成与该关键词同义的命中词。S108: For each keyword, if the keyword does not belong to the hit word set, replace the keyword in the text to be modified with a hit word synonymous with the keyword.
在本说明书实施例中,用户的数字编号是指唯一标识用户身份的编号。可以将用户的身份证号、手机号或者用户在某个业务系统中注册账户后获得的唯一编号作为用户的数字编号,也可以根据一定的映射规则,将用户在业务系统中注册的唯一账户名映射为数字编号。In the embodiments of this specification, the user's digital number refers to a number that uniquely identifies the user's identity. The user’s ID number, mobile phone number, or the unique number obtained by the user after registering an account in a certain business system can be used as the user’s digital number, or according to certain mapping rules, the user’s unique account name registered in the business system Mapped to digital numbers.
本文将数字编号的位数记为S,并且,数字编号通常是十进制的。可以理解,在图1所示的方法中,可以从待修改文本中确定出至少S个关键词组成关键词集合。This article will mark the number of digits as S, and the number is usually in decimal. It can be understood that in the method shown in FIG. 1, at least S keywords can be determined from the text to be modified to form a keyword set.
此外需要说明的是,实际应用中还需要考虑数字编号每个位的取值范围来设置每个 备选词集合中词的数量。In addition, it should be noted that in practical applications, the value range of each digit of the digital number needs to be considered to set the number of words in each candidate word set.
例如,如果数字编号有S位,每个位的取值范围是(1,9),意味着每个位上有9个取值,那么可以设置每个备选词集合中词的数量为9,这就意味着,需要为每个关键词确定8个同义词才能满足需求。For example, if the digital number has S bits and the value range of each bit is (1, 9), which means that there are 9 values on each bit, then the number of words in each candidate word set can be set to 9 , Which means that you need to determine 8 synonyms for each keyword to meet the demand.
当然,在确定每个关键词的同义词时,也可以根据系统内存储的所有用户的数字编号情况来确定。例如,系统中规定,数字编号有S位,每个位的取值范围是(1,5),那么,意味着每个位上有5个取值,那么可以设置每个备选词集合中词的数量为5,这就意味着,需要为每个关键词确定至少4个同义词才能满足需求。Of course, when determining the synonyms of each keyword, it can also be determined based on the digital numbers of all users stored in the system. For example, it is stipulated in the system that the number number has S bits, and the value range of each bit is (1,5). Then, it means that there are 5 values for each bit. Then you can set each candidate word set The number of words is 5, which means that at least 4 synonyms must be determined for each keyword to meet the demand.
在本说明书实施例中,定义i=(1,2,…,S),数字编号的第i位数字为N iIn the present embodiment, the description is defined i = (1,2, ..., S ), numbered i-th digit of N i.
命中词集合是指修改后的文本中,每个关键词位置上最终应该出现的词的集合。此处值得强调,对于某个待修改文本来说,其关键词是固定的,第一排序规则与第二排序规则也是固定的,创作者的数字编号是固定的,因此最终得到的命中词也是固定的。根据固定的命中词集合对待修改文本中的关键词进行替换(有的关键词就是命中词本身,无需替换)后,得到修改文本。修改文本被抄袭者洗稿后,得到抄袭文本。抄袭文本通常不会丢失待修改文本中的关键词,因此,根据抄袭文本中的关键词与固定的替换规则,可以还原出数字编号。The hit word set refers to the set of words that should eventually appear at each keyword position in the modified text. It is worth emphasizing here that for a text to be modified, the keywords are fixed, the first sorting rule and the second sorting rule are also fixed, and the creator’s digital number is fixed, so the final hit words are also stable. After replacing the keywords in the text to be modified according to the fixed set of hit words (some keywords are the hit words themselves, no need to replace), the modified text is obtained. After the revised text is washed by the copyist, the plagiarized text is obtained. Plagiarized text usually does not lose the keywords in the text to be modified. Therefore, according to the keywords in the plagiarized text and fixed replacement rules, the digital number can be restored.
通过图1所示的方法,针对创作者创作的原始文本,根据创作者的数字编号(起到身份标识作用)与固定替换规则对该原始文本中的至少部分关键词进行替换,得到修改文本并公开。如此,针对洗稿者根据公开的修改文本制作的抄袭文本,可以根据该抄袭文本中的关键词与固定替换规则还原出数字编号,证明该抄袭文本对应的原始文本的创作者身份。对关键词进行同义词替换的方式不会影响文本的可读性,同时,采用固定替换规则可以使得在分析抄袭文本时,无需与原始文本进行比对就可以还原出创作者的数字编号,更为便利。Through the method shown in Figure 1, for the original text created by the creator, at least part of the keywords in the original text are replaced according to the creator’s digital number (acting as an identity identifier) and fixed replacement rules to obtain the modified text and public. In this way, for the plagiarized text produced by the scrubber based on the publicly modified text, the digital number can be restored according to the keywords in the plagiarized text and the fixed replacement rule to prove the identity of the creator of the original text corresponding to the plagiarized text. The method of synonym substitution for keywords will not affect the readability of the text. At the same time, the use of fixed replacement rules can make it possible to restore the creator’s digital number without comparing it with the original text when analyzing plagiarized text. convenient.
图1所示的方法中,可以对待修改文本中出现的所有关键词位置都进行同义词替换,如此,由于关键词有时并不会仅分布在一个或少数几个段落,因此,即便抄袭者将修改文本的一些段落删除,也不一定可以在抄袭文本中彻底去除关键词。In the method shown in Figure 1, all keywords appearing in the modified text can be replaced by synonyms. In this way, because keywords are sometimes not only distributed in one or a few paragraphs, even if the copyist will modify Deleting some paragraphs of the text may not necessarily completely remove keywords from the plagiarized text.
此外,在本说明书实施例中,可以灵活设置第一排序规则与第二排序规则,只要能够起到固定排序的作用即可。例如,第一排序规则可以是:若所述待修改文本为汉字文本,则以该备选词集合中每个词的首字为基准,按照拼音首字母由前到后的顺序,将该备选词集合中的词进行排序;第二排序规则可以是:若所述待修改文本为汉字文本,则以每个备选词集合中第一个词的首字为基准,按照拼音首字母由前到后的顺序,将各备选词集合进行排序。In addition, in the embodiment of the present specification, the first sorting rule and the second sorting rule can be flexibly set, as long as the sorting can be fixed. For example, the first sorting rule may be: if the text to be modified is a Chinese character text, the first character of each word in the candidate word set is used as a reference, and the first letter of the pinyin is in the order from front to back. The words in the selected word set are sorted; the second sorting rule can be: if the text to be modified is a Chinese character text, then the first character of the first word in each candidate word set is used as the reference, and the first letter of the pinyin is determined according to the pinyin Sort the set of candidate words in front-to-back order.
需要说明的是,如果两个备选词的首字一样或首字的拼音首字母一样,则按照第二个字的拼音首字符由前到后的顺序区分这两个备选词的先后顺序。It should be noted that if the first character of the two candidate words is the same or the pinyin first letter of the first word is the same, the order of the first character of the second word will be distinguished from front to back. .
当然,还可以根据汉字的笔画等其他规则来排序。此外,如果待修改文本为英文文本,则可以该备选词集合中每个词的首字母为基准,按照首字母由前到后的顺序,将该 备选词集合中的词进行排序。Of course, it can also be sorted according to other rules such as the strokes of Chinese characters. In addition, if the text to be modified is an English text, the first letter of each word in the candidate word set can be used as the basis, and the words in the candidate word set can be sorted in the order of the first letter from front to back.
在本说明书实施例中,可以将修改后的文本提交至区块链进行存证,利用区块链中数据不可篡改的特性,可以作为“所述数字编号的用户是修改文本的创作者”的可信证明。当然,也可以将修改后的文本提交给高安全级别的存储设备中进行存储。In the embodiment of this specification, the modified text can be submitted to the blockchain for storage, and the data can not be tampered with in the blockchain, which can be regarded as "the user of the digital number is the creator of the modified text" Credible proof. Of course, the modified text can also be submitted to a high-security storage device for storage.
图2是本说明书实施例提供的一种确定文本创作者的方法的流程示意图,包括如下步骤:S200:获取待确定文本,并提取所述待确定文本的关键词集合。Fig. 2 is a schematic flowchart of a method for determining a text creator provided by an embodiment of the present specification, including the following steps: S200: Acquire a text to be determined, and extract a keyword set of the text to be determined.
所述待确定文本是指疑似抄袭文本。在实际应用中,创作者发现某个文本有可能是对其公开的修改文本进行抄袭得到的抄袭文本,可以通过图2所示的方法进行证明。The text to be determined refers to a text that is suspected of plagiarism. In practical applications, the creator finds that a certain text may be a plagiarized text obtained by plagiarizing its publicly modified text, which can be proved by the method shown in Figure 2.
S202:针对每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合。S202: For each keyword, determine a synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set.
S204:针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序。S204: For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule.
关于步骤S206之前的步骤实现,可以参考前文。Regarding the implementation of the steps before step S206, reference may be made to the foregoing.
S206:针对第i个备选词集合,确定该备选词集合中关键词的序位N iS206: the alternative words for the i-th set, determining that the alternative word N i in Sequence keyword set.
S208:确定数字编号。S208: Determine the digital number.
在本说明书实施例中,可以将第1个备选词集合至第S个备选词集合中的关键词的序位数依次组合成数字编号,其中,数字编号的第i位数字为N iIn the embodiment of this specification, the ordinal digits of the keywords in the first candidate word set to the S-th candidate word set can be sequentially combined into a digital number, where the i-th digit of the digital number is N i .
S210:将确定的数字编号对应的用户认定为所述待确定文本的创作者。S210: Identify the user corresponding to the determined digital number as the creator of the text to be determined.
如果待确定文本是抄袭文本,其本身一般不会丢失修改文本中的关键词(否则会丢失文本的关键信息,影响文本主旨的表达),因此,还原出的数字编号对应的用户就是修改文本的创作者。If the text to be determined is a plagiarized text, it generally does not lose the keywords in the modified text itself (otherwise the key information of the text will be lost, which will affect the expression of the theme of the text). Therefore, the user corresponding to the restored digital number is the one who modified the text creator.
图3是本说明书实施例提供的另一种对文本进行同义词修改的方法的流程示意图,包括如下步骤:S300:获取待修改文本,并提取所述待修改文本的关键词集合。Fig. 3 is a schematic flowchart of another method for synonym modification of text provided by an embodiment of the present specification, including the following steps: S300: Obtain the text to be modified, and extract the keyword set of the text to be modified.
S302:从所述待修改文本中确定出关键段落集合;所述关键段落集合包含的关键词的数量大于指定数量。S302: Determine a set of key paragraphs from the text to be modified; the number of keywords included in the set of key paragraphs is greater than a specified number.
S304:针对每个关键段落,执行步骤S3041-S3044。S304: Perform steps S3041-S3044 for each key paragraph.
S3041:针对该关键段落中的每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合。S3041: For each keyword in the key paragraph, determine a synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set.
S3042:针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序。S3042: For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule.
S3043:获取创作所述待修改文本的用户的数字编号;以及,根据所述数字编号的 第i位N i,将第i个备选词集合中的第N i个词添加到命中词集合。 S3043: Get modified user authoring the text to be numbered; and, according to the digital number N i i-th bit, add i-th set of alternative words N i th word to word hits collection.
S3044:针对该关键段落中的每个关键词,若该关键词不属于所述命中词集合,则将该关键段落中的该关键词替换成与该关键词同义的命中词。S3044: For each keyword in the key paragraph, if the keyword does not belong to the hit word set, replace the keyword in the key paragraph with a hit word synonymous with the keyword.
图3所示方法是在图1所示方法基础上改动得到的。考虑到在实践中,将文本中的所有关键词位置都进行同义词替换,修改幅度过大,因此,可以选择仅针对文本中的关键段落进行关键词的同义词替换。The method shown in Figure 3 is modified on the basis of the method shown in Figure 1. Considering that in practice, all keyword positions in the text are replaced by synonyms, and the modification range is too large. Therefore, you can choose to replace keywords with synonyms only for key paragraphs in the text.
图4是本说明书实施例提供的另一种确定文本创作者的方法的流程示意图,包括如下步骤:S400:获取待确定文本,并提取所述待确定文本的关键词集合。Fig. 4 is a schematic flowchart of another method for determining a text creator provided by an embodiment of the present specification, including the following steps: S400: Acquire a text to be determined, and extract a keyword set of the text to be determined.
S402:从所述待确定文本中确定出包含的关键词的数量大于指定数量的段落,得到关键段落集合。S402: Determine, from the to-be-determined text, paragraphs that contain more keywords than a specified number, and obtain a set of key paragraphs.
S404:针对每个关键段落,执行以下步骤S4041-S4044。S404: For each key paragraph, perform the following steps S4041-S4044.
S4041:针对该关键段落中的每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合。S4041: For each keyword in the key paragraph, determine a synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set.
S4042:针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序。S4042: For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort each candidate word set according to the second sorting rule.
S4043:确定数字编号。S4043: Determine the digital number.
S406:在针对每个关键段落执行步骤完毕后,根据基于每个关键段落确定的数字编号,确定所述待确定文本的创作者。S406: After the execution of the steps for each key paragraph is completed, the creator of the text to be determined is determined according to the digital number determined based on each key paragraph.
图4所示的方法基于图3所示的方法。The method shown in FIG. 4 is based on the method shown in FIG. 3.
在实际应用中,抄袭者可能会删除修改文本中的一些关键段落,得到抄袭文本。In practical applications, the plagiarism may delete some key paragraphs in the modified text to obtain the plagiarized text.
倘若所述待确定文本是抄袭文本,且抄袭文本仅保留了修改文本中的一个关键段落,那么,可以将基于该关键段落确定出的数字编号对应的用户确定为所述待确定文本的创作者。If the text to be determined is a plagiarized text, and the plagiarized text only retains a key paragraph in the modified text, then the user corresponding to the digital number determined based on the key paragraph can be determined as the creator of the text to be determined .
倘若所述待确定文本是抄袭文本,且抄袭文本保留了修改文本中的不止一个关键段落,那么,有可能存在基于不同的关键段落确定的数字编号不一致的问题。为此,在图3所示的方法中,可以根据所述数字编号与预设计算规则,计算得到校验数字P,然后将第S+1个备选词集合中的第P个词添加到命中词集合。这相当于,除了在待修改文本中加入创作者标记之外,还加入了校验标记,用于校验创作者标记是否损坏或者是否被篡改。其中,备选词集合的数量至少为S+1。If the text to be determined is a plagiarized text, and the plagiarized text retains more than one key paragraph in the modified text, then there may be a problem of inconsistency in number numbers determined based on different key paragraphs. For this reason, in the method shown in Figure 3, the check digit P can be calculated according to the number number and preset calculation rules, and then the Pth word in the S+1th candidate word set is added to Hit word collection. This is equivalent to adding a check mark in addition to the creator's mark in the text to be modified to verify whether the creator's mark is damaged or tampered with. Among them, the number of candidate word sets is at least S+1.
其中,预设计算规则可以根据实际需要设定,只要可以将所述数字编号稳定地映射成一个校验数字即可。Wherein, the preset calculation rule can be set according to actual needs, as long as the digital number can be stably mapped into a check digit.
例如,预设计算规则可以是,计算
Figure PCTCN2021096771-appb-000001
Figure PCTCN2021096771-appb-000002
作为校验数字P。
For example, the preset calculation rule can be
Figure PCTCN2021096771-appb-000001
will
Figure PCTCN2021096771-appb-000002
As the check digit P.
又如,预设计算规则可以是,计算
Figure PCTCN2021096771-appb-000003
Figure PCTCN2021096771-appb-000004
转换成二进制,取得到的二进制数的最后一位,如果最后一位是0,则P为1,如果最后一位是1,则P为2。
For another example, the preset calculation rule can be:
Figure PCTCN2021096771-appb-000003
will
Figure PCTCN2021096771-appb-000004
Converted to binary, the last bit of the obtained binary number, if the last bit is 0, then P is 1, if the last bit is 1, then P is 2.
在图4所示的方法中,可以针对待确定文本(可能丢失了修改文本中的某些关键段落)中的每个关键段落,根据确定的数字编号与预设计算规则,计算得到校验数字Q;判断第S+1个备选词集合中的第Q个词是否为该关键段落中的关键词;若是,则将确定的数字编号加入到该关键段落对应的编号集合;若否,则对确定的数字编号进行修正,得到至少一个修正后的数字编号并加入到该关键段落对应的编号集合;根据各关键段落分别对应的编号集合,将出现频次最高的数字编号对应的用户确定为所述待确定文本的创作者。In the method shown in Figure 4, for each key paragraph in the text to be determined (some key paragraphs in the modified text may be lost), the check digit can be calculated according to the determined number number and the preset calculation rule Q; Determine whether the Q-th word in the S+1-th candidate word set is a keyword in the key paragraph; if so, add the determined number number to the number set corresponding to the key paragraph; if not, then Correct the determined digital number to obtain at least one modified digital number and add it to the number set corresponding to the key paragraph; according to the number set corresponding to each key paragraph, the user corresponding to the number number with the highest frequency is determined as the number set. The creator of the text to be determined.
针对修正后的每个数字编号,基于该数字编号进行重新计算得到的Q满足:第S+1个备选词集合中的第Q个词为该关键段落中的关键词。进一步地,针对修正后的每个数字编号,还满足:用于表征“由确定的数字编号修正为该修改后的数字编号的改动程度”的改动程度表征值小于指定值。改动程度与改动程度表征值正相关。可以理解,此处是假设抄袭者即便对修改文本进行较大幅度改动,也会尽可能坚持修改文本的主旨,因此,某个修正后的数字编号在能够通过校验的情况下,改动程度越小,越有可能是实际创作者的数字编号。For each revised digital number, the Q obtained by recalculating based on the digital number satisfies: the Q-th word in the S+1-th candidate word set is the keyword in the key paragraph. Further, for each digital number after correction, it is also satisfied that the change degree characterizing value used to characterize "the degree of change from the determined digital number to the modified digital number" is smaller than the specified value. The degree of change is positively correlated with the characterization value of the degree of change. Understandably, it is assumed here that even if the plagiarism makes significant changes to the revised text, he will stick to the subject matter of the revised text as much as possible. Therefore, if a revised digital number can pass the verification, the greater the degree of change The smaller, the more likely it is the digital number of the actual creator.
为了更好的阐明本方案,以下进行举例。In order to better clarify this scheme, the following examples are given.
假设用户的数字编号有3位(S=3),每位的取值范围是(1,2)。因此,需要针对每个关键段落,提取S+1个(即4个)关键词,并且,为每个关键词确定至少一个同义词。Assuming that the user's digital number has 3 digits (S=3), the value range of each bit is (1,2). Therefore, it is necessary to extract S+1 (that is, 4) keywords for each key paragraph, and to determine at least one synonym for each keyword.
假设待修改文本(原始文本)的一个关键段落为:Suppose a key paragraph of the text to be modified (original text) is:
红海早过了,船在印度洋面上开驶着。但是太阳依然不饶人地迟落早起,侵占去大部分的夜。夜仿佛纸浸了油,变成半透明体;它给太阳拥抱住了,分不出身来,也许是给太阳陶醉了,所以夕阳霞隐褪后的夜色也带着酡红。到红消醉醒,船舱里的睡人也一身腻汗地醒来,洗了澡赶到甲板上吹海风,又是一天开始。这是七月下旬,合中国旧历的三伏,一年最热的时候。在中国热得更比常年利害,事后大家都说是兵戈之象,因为这就是民国二十六年。The Red Sea passed early, and the ship sailed on the Indian Ocean. But the sun still set unpleasantly late and rose early, encroaching on most of the night. The night seemed to be soaked with oil and turned into a translucent body; it hugged the sun and couldn't tell it, maybe it was intoxicated by the sun, so the night after the sunset was fading with red. When Hong Xiao was drunk and woke up, the sleeper in the cabin woke up with sweat, took a shower and rushed to the deck to blow the sea breeze. It was the beginning of another day. This is the hottest time of the year in late July, at the peak of the Chinese calendar. In China, the heat was even worse than usual. Afterwards, everyone said it was a sign of war, because this was the 26th year of the Republic of China.
上述的关键段落中的关键词包括:侵占、赶到、利害、兵戈。The key words in the above key paragraphs include: occupation, rush, interest, and war.
对于这四个关键词,可以分别确定同义词:For these four keywords, synonyms can be determined respectively:
(1)侵占的同义词:侵夺、侵蚀、侵吞;(1) Synonyms for encroachment: encroachment, erosion, and embezzlement;
(2)赶到的同义词:赶往、赶去;(2) Synonyms for rushing: rushing, rushing;
(3)利害的同义词:严重;(3) Synonyms of interest: serious;
(4)兵戈的同义词:战乱。(4) Synonym for Bingge: war.
如此,得到以下4个备选词集合:In this way, the following 4 candidate word sets are obtained:
(1)侵占、侵夺、侵蚀、侵吞;(1) Embezzlement, encroachment, erosion, and embezzlement;
(2)赶到、赶往、赶去;(2) Arrive, rush, rush;
(3)利害、严重;(3) Interests and serious;
(4)兵戈、战乱。(4) Soldiers and wars.
采用第一规则与第二规则进行排序(集合内与集合间皆根据首字符由前到后排序),得到:The first rule and the second rule are used for sorting (the first character is sorted from front to back within and between sets), and we get:
(1)兵戈、战乱;(1) Soldiers and wars;
(2)赶到、赶去、赶往;(2) Arrive, rush, rush;
(3)利害、严重;(3) Interests and serious;
(4)侵夺、侵蚀、侵吞、侵占。(4) Encroachment, erosion, embezzlement, embezzlement.
假设待修改文本的创作者的数字编号为121,对于前三个备选词集合,依次命中第1个词(兵戈)、第2个词(赶去)、第1个词(利害)。对
Figure PCTCN2021096771-appb-000005
转换成二进制取最后一位,最后一位是0,则校验数P为1。将第四个备选词集合中的第1个词(侵夺)也加入到命中词集合。
Assuming that the number of the creator of the text to be modified is 121, for the first three candidate word sets, the first word (bingge), the second word (rush away), and the first word (interest) are hit in sequence. right
Figure PCTCN2021096771-appb-000005
Convert to binary and take the last bit, the last bit is 0, then the checksum P is 1. The first word (invasion) in the fourth candidate word set is also added to the hit word set.
通过以上,可以得到上述关键段落对应的命中词集合为:兵戈、赶去、利害、侵夺。根据该命中词集合,对该关键段落中的关键词进行替换(如果关键词本身就是命中词,则无需替换),修改后得到的修改文本中的该关键段落为:Through the above, the set of hit words corresponding to the above-mentioned key paragraphs can be obtained as: soldiering, rushing, taking advantage, and invading. According to the hit word set, replace the keywords in the key paragraph (if the keyword itself is a hit word, no replacement is needed), the key paragraph in the modified text obtained after modification is:
红海早过了,船在印度洋面上开驶着。但是太阳依然不饶人地迟落早起,“侵夺”去大部分的夜。夜仿佛纸浸了油,变成半透明体;它给太阳拥抱住了,分不出身来,也许是给太阳陶醉了,所以夕阳霞隐褪后的夜色也带着酡红。到红消醉醒,船舱里的睡人也一身腻汗地醒来,洗了澡“赶去”甲板上吹海风,又是一天开始。这是七月下旬,合中国旧历的三伏,一年最热的时候。在中国热得更比常年“利害”,事后大家都说是“兵戈”之象,因为这就是民国二十六年。The Red Sea passed early, and the ship sailed on the Indian Ocean. But the sun still set innocently, settling late and rising early, "grabbing" most of the night. The night seemed to be soaked with oil and turned into a translucent body; it hugged the sun and couldn't tell it, maybe it was intoxicated by the sun, so the night after the sunset was fading with red. When Hongxiao woke up, the sleeper in the cabin woke up with sweat, took a shower and "rushed" to the sea breeze on the deck. It was the beginning of another day. This is the hottest time of the year in late July, at the peak of the Chinese calendar. In China, it was more hot than usual. Afterwards, everyone said it was a phenomenon of "soldier fighting", because this was the 26th year of the Republic of China.
在实际应用中,针对待修改文本的每个关键段落都会执行上述操作。In actual applications, the above operations are performed for each key paragraph of the text to be modified.
抄袭者对公开的抄袭文本进行抄袭时,得到的抄袭文本往往会保留该关键段落的主旨,但是表述会发生改变,可以如下:When a plagiarist plagiarizes a publicly plagiarized text, the plagiarized text obtained often retains the main point of the key paragraph, but the expression will be changed, which can be as follows:
在印度洋上行使的船已经开过红海。然而太阳仍然不依不饶地迟迟落下、早早升起,“侵夺”了美好的夜。夜呈现半透明,将太阳拥抱住了,太阳也许陶醉了。船舱里的人们醒来后“赶去”甲板上吹海风,开始新的一天。这是中国旧历的三伏,一年最热的时候。中国热得比往年更“利害”,有“兵戈”的感觉,毕竟是民国二十六年。Ships traveling in the Indian Ocean have already sailed through the Red Sea. However, the sun still set slowly and rose early and reluctantly, "robbing" the beautiful night. The night is translucent, embracing the sun, the sun may be intoxicated. After the people in the cabin woke up, they "rushed" to the deck to blow the sea breeze and start a new day. This is the hottest time of the year in the doom of the Chinese Lunar Calendar. China's heat is even more "interesting" than in previous years, and it feels "war". After all, it is the 26th year of the Republic of China.
对于上述抄袭文本的段落,虽然表述变动较大,但是依然可以确定该段落是关键段落,且确定出关键词依次为:侵夺、赶去、利害、兵戈。Regarding the paragraph of the plagiarized text, although the expression changes greatly, it can still be determined that the paragraph is a key paragraph, and the key words are determined as follows: invading, driving away, interest, and fighting.
可以基于抄袭文本的该关键段落的关键词,确定4个备选词集合,并完成排序,得到与修改阶段一致的排序后的4个备选词集合,如下:Based on the keywords of the key paragraph of the plagiarized text, determine the 4 candidate word sets, and complete the sorting, to obtain the sorted 4 candidate word sets consistent with the modification stage, as follows:
(1)兵戈、战乱;(1) Soldiers and wars;
(2)赶到、赶去、赶往;(2) Arrive, rush, rush;
(3)利害、严重;(3) Interests and serious;
(4)侵夺、侵蚀、侵吞、侵占。(4) Encroachment, erosion, embezzlement, embezzlement.
其中,抄袭文本中出现的兵戈是第1个备选词集合中的第1个词,所以数字编号的第1位是1;抄袭文本中出现的赶去是第2个备选词集合中的第2个词,所以数字编号的第2为是2;抄袭文本中出现的利害是第3个备选词集合中的第1个词,所以数字编号的第3位2是1。抄袭文本中出现的侵夺是第4个备选词集合中的第1个词,所以校验数P为1,说明数字编号的三位之和的二进制形式最后一位应该是0,实际上,数字编号121的三位之和为4,二进制形式为100,最后一位是0,通过校验。Among them, the Bingge appearing in the plagiarized text is the first word in the first candidate word set, so the first digit of the number is 1; the rushing out in the plagiarized text is the second word set in the candidate word set The second word, so the second number of the number is 2; the interest in the plagiarized text is the first word in the third candidate word set, so the third 2 of the number number is 1. The plagiarism in the plagiarized text is the first word in the fourth candidate word set, so the check number P is 1, indicating that the binary form of the sum of the three digits of the number number should be 0. In fact, The sum of the three digits of the number 121 is 4, the binary form is 100, and the last digit is 0, which is verified.
然而,在实际应用中,抄袭文本可能将某些修改文本中的某些关键词删除或者修改,例如如下:However, in actual applications, plagiarizing text may delete or modify certain keywords in some modified texts, for example, as follows:
在印度洋上行使的船已经开过红海。然而太阳仍然不依不饶地迟迟落下、早早升起,“侵夺”了美好的夜。夜呈现半透明,将太阳拥抱住了,太阳也许陶醉了。船舱里的人们醒来后“赶去”甲板上吹海风,开始新的一天。这是中国旧历的三伏,一年最热的时候。中国热得比往年更“严重”,有“兵戈”的感觉,毕竟是民国二十六年。Ships traveling in the Indian Ocean have already sailed through the Red Sea. However, the sun still set slowly and rose early and reluctantly, "robbing" the beautiful night. The night is translucent, embracing the sun, the sun may be intoxicated. After the people in the cabin woke up, they "rushed" to the deck to blow the sea breeze and start a new day. This is the hottest time of the year in the doom of the Chinese Lunar Calendar. China's heat is more "serious" than in previous years, and it feels like "war". After all, it is the 26th year of the Republic of China.
根据此抄袭文本的关键段落,还原出的数字编号可能为122,数字编号的三位之和为5,二进制形式为101,最后一位是1,对应的校验数应当为2。而根据此抄袭文本,确定的校验数为1,校验不通过。According to the key paragraph of this plagiarized text, the restored digital number may be 122, the sum of the three digits of the digital number is 5, the binary form is 101, the last digit is 1, and the corresponding checksum should be 2. According to this plagiarized text, the confirmed check number is 1, and the check fails.
事实上,抄袭文本中往往有不止一个关键段落,基于每个关键段落确定的数字编号可能不尽一致,并且有的关键段落对应的数字编号可能通过校验,有的关键段落对 应的数字编号可能未通过校验。In fact, there are often more than one key paragraphs in plagiarized texts, and the number numbers determined based on each key paragraph may not be consistent, and the number numbers corresponding to some key paragraphs may pass the verification, and the number numbers corresponding to some key paragraphs may Failed to verify.
对于这种情况,以上述举例中抄袭文本的这个关键段落为例,确定的数字编号未通过校验,则对数字编号进行最小改动程度的修正,以便通过校验。显然,将122修正为121可以通过校验,如此,将修正后的数字编号121加入到该关键段落对应的编号集合。In this case, take this key paragraph of the plagiarized text in the above example as an example. If the determined digital number fails the verification, the digital number shall be corrected to the minimum degree of modification in order to pass the verification. Obviously, the correction of 122 to 121 can pass the verification. In this way, the corrected number 121 is added to the number set corresponding to the key paragraph.
从抄袭文本全篇来看,对于任一关键段落,该关键段落对应的编号集合中的至少一个数字编号都是可以通过校验的数字编号。那么,在各关键段落分别对应的编号集合中统计出出现频次最高的数字编号,大概率就是实际创作者的数字编号,可以将出现频次最高的数字编号对应的用户确定为创作者。From the perspective of the entire plagiarized text, for any key paragraph, at least one number in the number set corresponding to the key paragraph is a number that can pass verification. Then, in the number sets corresponding to each key paragraph, the number number with the highest frequency is counted. The high probability is the number number of the actual creator, and the user corresponding to the number number with the highest frequency can be determined as the creator.
图5是本说明书实施例提供的一种对文本进行同义修改的装置的结构示意图,包括:获取模块501,获取待修改文本,并提取所述待修改文本的关键词集合;确定模块502,针对每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;排序模块503,针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序;添加模块504,获取创作所述待修改文本的用户的数字编号;以及,根据所述数字编号的第i位N i,将第i个备选词集合中的第N i个词添加到命中词集合;i=(1,2,…,S),S为数字编号位数;修改模块505,针对每个关键词,若该关键词不属于所述命中词集合,则将所述待修改文本中的该关键词替换成与该关键词同义的命中词。 Fig. 5 is a schematic structural diagram of an apparatus for synonymously modifying text provided by an embodiment of this specification, including: an acquisition module 501, which acquires the text to be modified, and extracts the keyword set of the text to be modified; and a determination module 502, For each keyword, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set; the sorting module 503, for each candidate word set, according to the first sorting rule, The words in the candidate word set are sorted; and, according to the second sorting rule, each candidate word set is sorted; the adding module 504 obtains the digital number of the user who authored the text to be modified; and, according to the numbered bit i N i, the i-th add alternative words collection of words N i into a set of word hits; i = (1,2, ..., S), S is the number of digital bits; Review Module 505, for each keyword, if the keyword does not belong to the hit word set, replace the keyword in the text to be modified with a hit word synonymous with the keyword.
所述排序模块503,若所述待修改文本为汉字文本,则以该备选词集合中每个词的首字为基准,按照拼音首字母由前到后的顺序,将该备选词集合中的词进行排序。The sorting module 503, if the text to be modified is a Chinese character text, use the first character of each word in the candidate word set as a reference, and set the candidate words in the order of the first letter of the pinyin from front to back. Sort the words in.
所述排序模块503,若所述待修改文本为汉字文本,则以每个备选词集合中第一个词的首字为基准,按照拼音首字母由前到后的顺序,将各备选词集合进行排序。The sorting module 503, if the text to be modified is a Chinese character text, the first character of the first word in each candidate word set is used as a reference, and the first letter of the pinyin is sorted from front to back. The word set is sorted.
所述装置还包括:存证模块506,将修改后的文本提交至区块链进行存证。The device also includes: an attestation module 506, which submits the modified text to the blockchain for attestation.
图6是本说明书实施例提供的一种确定文本创作者的装置的结构示意图,包括:获取模块601,获取待确定文本,并提取所述待确定文本的关键词集合;第一确定模块602,针对每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;排序模块603,针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序;第二确定模块604,针对第i个备选词集合,确定该备选词集合中关键词的序位N i;i=(1,2,…,S),S为数字编号位数; 6 is a schematic structural diagram of an apparatus for determining a text creator provided by an embodiment of the present specification, including: an obtaining module 601, which obtains the text to be determined, and extracts the keyword set of the text to be determined; the first determining module 602, For each keyword, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set; the sorting module 603, for each candidate word set, according to the first sorting rule, The words in the candidate word set are sorted; and, according to the second sorting rule, each candidate word set is sorted; the second determination module 604, for the i-th candidate word set, determines that the candidate word set is The sequence of the keyword Ni ; i=(1,2,...,S), S is the number of digits in the number;
第三确定模块605,确定数字编号;其中,所述数字编号的第i位数字为N i;第四确定模块606,将确定的数字编号对应的用户认定为所述待确定文本的创作者。 A third determination module 605, determines whether the digital number; wherein, the i-th digit of said digital number N i; determining a fourth module 606, a digital number corresponding to the determined user to identify the creator of the text to be determined.
图7是本说明书实施例提供的一种对文本进行同义修改的装置的结构示意图, 包括:获取模块701,获取待修改文本,并提取所述待修改文本的关键词集合;确定模块702,从所述待修改文本中确定出关键段落集合;所述关键段落集合包含的关键词的数量大于指定数量;执行模块703,针对每个关键段落,执行以下步骤:针对该关键段落中的每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序;获取创作所述待修改文本的用户的数字编号;以及,根据所述数字编号的第i位N i,将第i个备选词集合中的第N i个词添加到命中词集合;i=(1,2,…,S),S为数字编号位数;针对该关键段落中的每个关键词,若该关键词不属于所述命中词集合,则将该关键段落中的该关键词替换成与该关键词同义的命中词。 FIG. 7 is a schematic structural diagram of a device for synonymously modifying text according to an embodiment of this specification, including: an acquisition module 701, which acquires the text to be modified, and extracts the keyword set of the text to be modified; and a determination module 702, A set of key paragraphs is determined from the text to be modified; the number of keywords contained in the set of key paragraphs is greater than the specified number; the execution module 703, for each key paragraph, executes the following steps: for each key paragraph Keywords, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set; for each candidate word set, according to the first sorting rule, the candidate word set sorting words; and, according to a second ordering rules, the respective rank the set of alternative words; Get modified user authoring the text to be numbered; and, according to the digital number N i i-th bit, the first Alternatively, the i-th word N i in the set of words added to the hit keyword set; i = (1,2, ..., S), S is the number of digital bits; key for each of the keywords paragraph, if If the keyword does not belong to the hit word set, then the keyword in the key paragraph is replaced with a hit word synonymous with the keyword.
所述执行模块703,根据所述数字编号与预设计算规则,计算得到校验数字P;将第S+1个备选词集合中的第P个词添加到命中词集合。The execution module 703 calculates the check digit P according to the number number and preset calculation rules; adds the Pth word in the S+1th candidate word set to the hit word set.
图8是本说明书实施例提供的一种确定文本创作者的装置的结构示意图,包括:获取模块801,获取待确定文本,并提取所述待确定文本的关键词集合;第一确定模块802,从所述待确定文本中确定出包含的关键词的数量大于指定数量的段落,得到关键段落集合;执行模块803,针对每个关键段落,执行以下步骤:针对该关键段落中的每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序;确定数字编号;其中,所述数字编号的第i位数字为N i;i=(1,2,…,S),S为数字编号位数;第二确定模块804,在针对每个关键段落执行步骤完毕后,根据基于每个关键段落确定的数字编号,确定所述待确定文本的创作者。 FIG. 8 is a schematic structural diagram of a device for determining a text creator provided by an embodiment of this specification, including: an obtaining module 801, which obtains the text to be determined, and extracts the keyword set of the text to be determined; the first determining module 802, It is determined from the text to be determined that the number of keywords contained is greater than the specified number of paragraphs to obtain a set of key paragraphs; the execution module 803, for each key paragraph, executes the following steps: for each key word in the key paragraph , Determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set; for each candidate word set, according to the first sorting rule, the words in the candidate word set are processed sorting; and, according to a second ordering rules, the respective rank the set of alternative words; determining numbered; wherein, the i-th digit of said digital number N i; i = (1,2, ..., S), S is the number of digits in the number; the second determination module 804, after completing the steps for each key paragraph, determines the creator of the text to be determined according to the number number determined based on each key paragraph.
所述第二确定模块804,针对每个关键段落,根据确定的数字编号与预设计算规则,计算得到校验数字Q;判断第S+1个备选词集合中的第Q个词是否为该关键段落中的关键词;若是,则将确定的数字编号加入到该关键段落对应的编号集合;若否,则对确定的数字编号进行修正,得到至少一个修正后的数字编号并加入到该关键段落对应的编号集合;针对修正后的每个数字编号,基于该数字编号进行重新计算得到的Q满足:第S+1个备选词集合中的第Q个词为该关键段落中的关键词;根据各关键段落分别对应的编号集合,将出现频次最高的数字编号对应的用户确定为所述待确定文本的创作者。The second determining module 804 calculates the check digit Q for each key paragraph according to the determined number number and preset calculation rules; determines whether the Q-th word in the S+1-th candidate word set is Keywords in the key paragraph; if yes, add the determined number number to the number set corresponding to the key paragraph; if not, then correct the determined number number to obtain at least one revised number number and add it to the The number set corresponding to the key paragraph; for each revised number number, the Q obtained by recalculating based on the number number satisfies: the Qth word in the S+1th candidate word set is the key in the key paragraph Words; According to the number sets corresponding to each key paragraph, the user corresponding to the number with the highest frequency is determined as the creator of the text to be determined.
本说明书实施例还提供一种计算机设备,其至少包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,处理器执行所述程序时实现本说明书中的客户端设备或服务端设备执行的方法。The embodiments of this specification also provide a computer device, which includes at least a memory, a processor, and a computer program stored in the memory and capable of running on the processor. The processor implements the client in this specification when the processor executes the program. The method executed by the device or server device.
图9示出了本说明书实施例所提供的一种更为具体的计算设备硬件结构示意图,该设备可以包括:处理器1010、存储器1020、输入/输出接口1030、通信接口1040和 总线1050。其中处理器1010、存储器1020、输入/输出接口1030和通信接口1040通过总线1050实现彼此之间在设备内部的通信连接。FIG. 9 shows a more specific hardware structure diagram of a computing device provided by an embodiment of this specification. The device may include a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. The processor 1010, the memory 1020, the input/output interface 1030, and the communication interface 1040 realize the communication connection between each other in the device through the bus 1050.
处理器1010可以采用通用的CPU(Central Processing Unit,中央处理器)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本说明书实施例所提供的技术方案。The processor 1010 may be implemented by a general CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits for execution related Program to realize the technical solutions provided in the embodiments of this specification.
存储器1020可以采用ROM(Read Only Memory,只读存储器)、RAM(Random Access Memory,随机存取存储器)、静态存储设备,动态存储设备等形式实现。存储器1020可以存储操作系统和其他应用程序,在通过软件或者固件来实现本说明书实施例所提供的技术方案时,相关的程序代码保存在存储器1020中,并由处理器1010来调用执行。The memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, etc. The memory 1020 may store an operating system and other application programs. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, related program codes are stored in the memory 1020 and called and executed by the processor 1010.
输入/输出接口1030用于连接输入/输出模块,以实现信息输入及输出。输入输出/模块可以作为组件配置在设备中(图中未示出),也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等,输出设备可以包括显示器、扬声器、振动器、指示灯等。The input/output interface 1030 is used to connect an input/output module to realize information input and output. The input/output/module can be configured in the device as a component (not shown in the figure), or can be connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and an output device may include a display, a speaker, a vibrator, an indicator light, and the like.
通信接口1040用于连接通信模块(图中未示出),以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信。The communication interface 1040 is used to connect a communication module (not shown in the figure) to realize the communication interaction between the device and other devices. The communication module can realize communication through wired means (such as USB, network cable, etc.), or through wireless means (such as mobile network, WIFI, Bluetooth, etc.).
总线1050包括一通路,在设备的各个组件(例如处理器1010、存储器1020、输入/输出接口1030和通信接口1040)之间传输信息。The bus 1050 includes a path to transmit information between various components of the device (for example, the processor 1010, the memory 1020, the input/output interface 1030, and the communication interface 1040).
需要说明的是,尽管上述设备仅示出了处理器1010、存储器1020、输入/输出接口1030、通信接口1040以及总线1050,但是在具体实施过程中,该设备还可以包括实现正常运行所必需的其他组件。此外,本领域的技术人员可以理解的是,上述设备中也可以仅包含实现本说明书实施例方案所必需的组件,而不必包含图中所示的全部组件。It should be noted that although the above device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040, and the bus 1050, in the specific implementation process, the device may also include the necessary equipment for normal operation. Other components. In addition, those skilled in the art can understand that the above-mentioned device may also include only the components necessary to implement the solutions of the embodiments of the present specification, and not necessarily include all the components shown in the figures.
本说明书实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现本说明书中的客户端设备或服务端设备执行的方法。The embodiments of this specification also provide a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the method executed by the client device or the server device in this specification is implemented.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可 读媒体(transitory media),如调制的数据信号和载波。Computer-readable media includes permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本说明书实施例可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本说明书实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务设备,或者网络设备等)执行本说明书实施例各个实施例或者实施例的某些部分所述的方法。From the description of the foregoing implementation manners, it can be known that those skilled in the art can clearly understand that the embodiments of this specification can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the embodiments of this specification can be embodied in the form of software products, which can be stored in storage media, such as ROM/RAM, Magnetic disks, optical disks, etc., include several instructions to enable a computer device (which may be a personal computer, a service device, or a network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments of this specification.
上述实施例阐明的系统、方法、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。The systems, methods, modules, or units illustrated in the above embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions. A typical implementation device is a computer. The specific form of the computer can be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email receiving and sending device, and a game control A console, a tablet computer, a wearable device, or a combination of any of these devices.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,在实施本说明书实施例方案时可以把各模块的功能在同一个或多个软件和/或硬件中实现。也可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment. The device embodiments described above are merely illustrative, and the modules described as separate components may or may not be physically separated. The functions of the modules can be combined in the same way when implementing the solutions of the embodiments of this specification. Or multiple software and/or hardware implementations. It is also possible to select some or all of the modules according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement without creative work.
以上所述仅是本说明书实施例的具体实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本说明书实施例原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本说明书实施例的保护范围。The above are only specific implementations of the embodiments of this specification. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of the embodiments of this specification, several improvements and modifications can be made. These Improvements and retouching should also be regarded as the protection scope of the embodiments of this specification.

Claims (19)

  1. 一种对文本进行同义修改的方法,包括:A method of synonymously modifying text, including:
    获取待修改文本,并提取所述待修改文本的关键词集合;Obtain the text to be modified, and extract the keyword set of the text to be modified;
    针对每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;For each keyword, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set;
    针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序;For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule;
    获取创作所述待修改文本的用户的数字编号;以及,根据所述数字编号的第i位N i,将第i个备选词集合中的第N i个词添加到命中词集合;i=(1,2,…,S),S为数字编号位数; Creation acquiring the text to be modified numbered user; and, according to the digital number N i i-th bit, add i-th set of alternative words N i a set of words into word hits; i = (1, 2, …, S), S is the number of digits;
    针对每个关键词,若该关键词不属于所述命中词集合,则将所述待修改文本中的该关键词替换成与该关键词同义的命中词。For each keyword, if the keyword does not belong to the hit word set, then the keyword in the text to be modified is replaced with a hit word synonymous with the keyword.
  2. 如权利要求1所述的方法,根据第一排序规则,将该备选词集合中的词进行排序,包括:The method according to claim 1, wherein, according to the first sorting rule, sorting the words in the candidate word set includes:
    若所述待修改文本为汉字文本,则以该备选词集合中每个词的首字为基准,按照拼音首字母由前到后的顺序,将该备选词集合中的词进行排序。If the text to be modified is a Chinese character text, the first character of each word in the candidate word set is used as a reference, and the words in the candidate word set are sorted according to the pinyin first letter from front to back.
  3. 如权利要求1所述的方法,根据第二排序规则,将各备选词集合进行排序,包括:The method according to claim 1, wherein sorting the candidate word sets according to the second sorting rule includes:
    若所述待修改文本为汉字文本,则以每个备选词集合中第一个词的首字为基准,按照拼音首字母由前到后的顺序,将各备选词集合进行排序。If the text to be modified is a Chinese character text, the first character of the first word in each candidate word set is used as a reference, and each candidate word set is sorted according to the pinyin first letter from front to back.
  4. 如权利要求1所述的方法,所述方法还包括:The method of claim 1, further comprising:
    将修改后的文本提交至区块链进行存证。Submit the revised text to the blockchain for storage.
  5. 一种确定文本创作者的方法,包括:A method of identifying the creator of a text, including:
    获取待确定文本,并提取所述待确定文本的关键词集合;Acquiring the text to be determined, and extracting the keyword set of the text to be determined;
    针对每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;For each keyword, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set;
    针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序;For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule;
    针对第i个备选词集合,确定该备选词集合中关键词的序位N i;i=(1,2,…,S),S为数字编号位数; Alternatively for the i-th word set, determining that the alternative word keywords in Sequence set N i; i = (1,2, ..., S), S is the number of digital bits;
    确定数字编号;其中,所述数字编号的第i位数字为N iDetermining numbered; wherein, the i-th digit of said digital number N i;
    将确定的数字编号对应的用户认定为所述待确定文本的创作者。The user corresponding to the determined digital number is identified as the creator of the text to be determined.
  6. 一种对文本进行同义修改的方法,包括:A method of synonymously modifying text, including:
    获取待修改文本,并提取所述待修改文本的关键词集合;Obtain the text to be modified, and extract the keyword set of the text to be modified;
    从所述待修改文本中确定出关键段落集合;所述关键段落集合包含的关键词的数量大于指定数量;Determine a set of key paragraphs from the text to be modified; the number of keywords contained in the set of key paragraphs is greater than a specified number;
    针对每个关键段落,执行以下步骤:For each key paragraph, perform the following steps:
    针对该关键段落中的每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;For each keyword in the key paragraph, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set;
    针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序;For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule;
    获取创作所述待修改文本的用户的数字编号;以及,根据所述数字编号的第i位N i,将第i个备选词集合中的第N i个词添加到命中词集合;i=(1,2,…,S),S为数字编号位数; Creation acquiring the text to be modified numbered user; and, according to the digital number N i i-th bit, add i-th set of alternative words N i a set of words into word hits; i = (1, 2, …, S), S is the number of digits;
    针对该关键段落中的每个关键词,若该关键词不属于所述命中词集合,则将该关键段落中的该关键词替换成与该关键词同义的命中词。For each keyword in the key paragraph, if the keyword does not belong to the hit word set, the keyword in the key paragraph is replaced with a hit word synonymous with the keyword.
  7. 如权利要求6所述的方法,针对每个关键段落,还执行以下步骤:The method according to claim 6, for each key paragraph, the following steps are further executed:
    根据所述数字编号与预设计算规则,计算得到校验数字P;According to the digital number and the preset calculation rule, the check digit P is calculated;
    将第S+1个备选词集合中的第P个词添加到命中词集合。The Pth word in the S+1th candidate word set is added to the hit word set.
  8. 一种确定文本创作者的方法,包括:A method of identifying the creator of a text, including:
    获取待确定文本,并提取所述待确定文本的关键词集合;Acquiring the text to be determined, and extracting the keyword set of the text to be determined;
    从所述待确定文本中确定出包含的关键词的数量大于指定数量的段落,得到关键段落集合;From the text to be determined, it is determined that the number of keywords contained is greater than the specified number of paragraphs, and a set of key paragraphs is obtained;
    针对每个关键段落,执行以下步骤:For each key paragraph, perform the following steps:
    针对该关键段落中的每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;For each keyword in the key paragraph, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set;
    针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序;For each candidate word set, sort the words in the candidate word set according to the first sorting rule; and sort the candidate word sets according to the second sorting rule;
    确定数字编号;其中,所述数字编号的第i位数字为N i;i=(1,2,…,S),S为数字编号位数; Determining numbered; wherein, the i-th digit of said digital number N i; i = (1,2, ..., S), S is the number of digital bits;
    在针对每个关键段落执行步骤完毕后,根据基于每个关键段落确定的数字编号,确定所述待确定文本的创作者。After the steps are executed for each key paragraph, the creator of the text to be determined is determined according to the number number determined based on each key paragraph.
  9. 如权利要求8所述的方法,根据基于每个关键段落确定的数字编号,确定所述待确定文本的创作者,具体包括:8. The method according to claim 8, determining the creator of the text to be determined according to the number number determined based on each key paragraph, which specifically includes:
    针对每个关键段落,根据确定的数字编号与预设计算规则,计算得到校验数字Q;For each key paragraph, calculate the check digit Q according to the determined number number and preset calculation rules;
    判断第S+1个备选词集合中的第Q个词是否为该关键段落中的关键词;Determine whether the Q-th word in the S+1-th candidate word set is a keyword in the key paragraph;
    若是,则将确定的数字编号加入到该关键段落对应的编号集合;If yes, add the determined number number to the number set corresponding to the key paragraph;
    若否,则对确定的数字编号进行修正,得到至少一个修正后的数字编号并加入到该关键段落对应的编号集合;针对修正后的每个数字编号,基于该数字编号进行重新计算得到的Q满足:第S+1个备选词集合中的第Q个词为该关键段落中的关键词;If not, correct the determined digital number to obtain at least one corrected digital number and add it to the number set corresponding to the key paragraph; for each corrected digital number, recalculate Q based on the digital number Satisfaction: The Q-th word in the S+1-th candidate word set is the keyword in the key paragraph;
    根据各关键段落分别对应的编号集合,将出现频次最高的数字编号对应的用户确定为所述待确定文本的创作者。According to the number sets corresponding to each key paragraph, the user corresponding to the number number with the highest frequency is determined as the creator of the text to be determined.
  10. 一种对文本进行同义修改的装置,包括:A device for synonymous modification of text, including:
    获取模块,获取待修改文本,并提取所述待修改文本的关键词集合;An acquiring module, acquiring the text to be modified, and extracting the keyword set of the text to be modified;
    确定模块,针对每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;The determining module, for each keyword, determines the synonym set corresponding to the keyword, and forms a candidate word set with the keyword and the corresponding synonym set;
    排序模块,针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序;The sorting module, for each candidate word set, sorts the words in the candidate word set according to the first sorting rule; and sorts the candidate word sets according to the second sorting rule;
    添加模块,获取创作所述待修改文本的用户的数字编号;以及,根据所述数字编号的第i位N i,将第i个备选词集合中的第N i个词添加到命中词集合;i=(1,2,…,S),S为数字编号位数; Adding module acquires the creation of the text to be modified numbered user; and, according to the digital number N i i-th bit, add i-th set of alternative words N i hit word to word set ; I=(1,2,...,S), S is the number of digits in the number;
    修改模块,针对每个关键词,若该关键词不属于所述命中词集合,则将所述待修改文本中的该关键词替换成与该关键词同义的命中词。The modification module, for each keyword, if the keyword does not belong to the hit word set, replace the keyword in the text to be modified with a hit word synonymous with the keyword.
  11. 如权利要求10所述的装置,所述排序模块,若所述待修改文本为汉字文本,则以该备选词集合中每个词的首字为基准,按照拼音首字母由前到后的顺序,将该备选词集合中的词进行排序。The device according to claim 10, wherein the sorting module, if the text to be modified is a Chinese character text, the first character of each word in the candidate word set is used as a reference, according to the first letter of the pinyin from front to back Order, sort the words in the candidate word set.
  12. 如权利要求10所述的装置,所述排序模块,若所述待修改文本为汉字文本,则以每个备选词集合中第一个词的首字为基准,按照拼音首字母由前到后的顺序,将各备选词集合进行排序。The device according to claim 10, wherein the sorting module, if the text to be modified is a Chinese character text, the first character of the first word in each candidate word set is used as a reference, and the first letter of pinyin is used from front to front. In the latter order, sort the set of candidate words.
  13. 如权利要求10所述的装置,所述装置还包括:The device according to claim 10, further comprising:
    存证模块,将修改后的文本提交至区块链进行存证。The deposit certificate module submits the revised text to the blockchain for deposit certificate.
  14. 一种确定文本创作者的装置,包括:A device for identifying the creator of a text, including:
    获取模块,获取待确定文本,并提取所述待确定文本的关键词集合;An obtaining module, which obtains the text to be determined, and extracts the keyword set of the text to be determined;
    第一确定模块,针对每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;The first determining module, for each keyword, determines the synonym set corresponding to the keyword, and forms a candidate word set with the keyword and the corresponding synonym set;
    排序模块,针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序;The sorting module, for each candidate word set, sorts the words in the candidate word set according to the first sorting rule; and sorts the candidate word sets according to the second sorting rule;
    第二确定模块,针对第i个备选词集合,确定该备选词集合中关键词的序位N i;i=(1,2,…,S),S为数字编号位数; Second determining module, the alternative words for the i-th set of alternative words is determined that the rank order of keywords set N i; i = (1,2, ..., S), S is the number of digital bits;
    第三确定模块,确定数字编号;其中,所述数字编号的第i位数字为N iA third determination module to determine a digital number; wherein, the i-th digit of said digital number N i;
    第四确定模块,将确定的数字编号对应的用户认定为所述待确定文本的创作者。The fourth determining module determines the user corresponding to the determined digital number as the creator of the text to be determined.
  15. 一种对文本进行同义修改的装置,包括:A device for synonymous modification of text, including:
    获取模块,获取待修改文本,并提取所述待修改文本的关键词集合;An acquiring module, acquiring the text to be modified, and extracting the keyword set of the text to be modified;
    确定模块,从所述待修改文本中确定出关键段落集合;所述关键段落集合包含的关键词的数量大于指定数量;The determining module determines a set of key paragraphs from the text to be modified; the number of keywords contained in the set of key paragraphs is greater than a specified number;
    执行模块,针对每个关键段落,执行以下步骤:针对该关键段落中的每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根 据第二排序规则,将各备选词集合进行排序;获取创作所述待修改文本的用户的数字编号;以及,根据所述数字编号的第i位N i,将第i个备选词集合中的第N i个词添加到命中词集合;i=(1,2,…,S),S为数字编号位数;针对该关键段落中的每个关键词,若该关键词不属于所述命中词集合,则将该关键段落中的该关键词替换成与该关键词同义的命中词。 The execution module, for each key paragraph, executes the following steps: For each keyword in the key paragraph, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set; For each candidate word set, the words in the candidate word set are sorted according to the first sorting rule; and, according to the second sorting rule, the candidate word sets are sorted; and the text for the creation of the to-be-modified text is obtained numbered user; and, according to the digital number N i i-th bit, add i-th set of alternative words N i a set of words into word hits; i = (1,2, ..., S ), S is the number of digits; for each keyword in the key paragraph, if the keyword does not belong to the set of hit words, replace the keyword in the key paragraph with the same meaning as the keyword The hit word.
  16. 如权利要求15所述的装置,所述执行模块,根据所述数字编号与预设计算规则,计算得到校验数字P;The device according to claim 15, wherein the execution module calculates the check digit P according to the digital number and a preset calculation rule;
    将第S+1个备选词集合中的第P个词添加到命中词集合。The Pth word in the S+1th candidate word set is added to the hit word set.
  17. 一种确定文本创作者的装置,包括:A device for identifying the creator of a text, including:
    获取模块,获取待确定文本,并提取所述待确定文本的关键词集合;An obtaining module, which obtains the text to be determined, and extracts the keyword set of the text to be determined;
    第一确定模块,从所述待确定文本中确定出包含的关键词的数量大于指定数量的段落,得到关键段落集合;The first determining module determines paragraphs that contain more keywords than a specified number from the text to be determined, and obtains a set of key paragraphs;
    执行模块,针对每个关键段落,执行以下步骤:针对该关键段落中的每个关键词,确定该关键词对应的同义词集合,并将该关键词与对应的同义词集合组成备选词集合;针对每个备选词集合,根据第一排序规则,将该备选词集合中的词进行排序;以及,根据第二排序规则,将各备选词集合进行排序;确定数字编号;其中,所述数字编号的第i位数字为N i;i=(1,2,…,S),S为数字编号位数; The execution module, for each key paragraph, executes the following steps: For each keyword in the key paragraph, determine the synonym set corresponding to the keyword, and form a candidate word set with the keyword and the corresponding synonym set; For each candidate word set, the words in the candidate word set are sorted according to the first sorting rule; and, according to the second sorting rule, the candidate word sets are sorted; the number is determined; wherein, the numbered i-th digit of N i; i = (1,2, ..., S), S is the number of digital bits;
    第二确定模块,在针对每个关键段落执行步骤完毕后,根据基于每个关键段落确定的数字编号,确定所述待确定文本的创作者。The second determining module determines the creator of the text to be determined according to the digital number determined based on each key paragraph after the execution of the steps for each key paragraph is completed.
  18. 如权利要求17所述的装置,所述第二确定模块,针对每个关键段落,根据确定的数字编号与预设计算规则,计算得到校验数字Q;判断第S+1个备选词集合中的第Q个词是否为该关键段落中的关键词;若是,则将确定的数字编号加入到该关键段落对应的编号集合;若否,则对确定的数字编号进行修正,得到至少一个修正后的数字编号并加入到该关键段落对应的编号集合;针对修正后的每个数字编号,基于该数字编号进行重新计算得到的Q满足:第S+1个备选词集合中的第Q个词为该关键段落中的关键词;根据各关键段落分别对应的编号集合,将出现频次最高的数字编号对应的用户确定为所述待确定文本的创作者。17. The device of claim 17, wherein the second determining module calculates the check digit Q for each key paragraph according to the determined number number and the preset calculation rule; determines the S+1th candidate word set Whether the Q-th word in is a keyword in the key paragraph; if yes, add the determined number number to the number set corresponding to the key paragraph; if not, then correct the determined number number to obtain at least one amendment The last number number is added to the number set corresponding to the key paragraph; for each number number after correction, the recalculated Q based on the number number satisfies: the Qth in the S+1th candidate word set Words are keywords in the key paragraphs; according to the number sets corresponding to each key paragraph, the user corresponding to the number number with the highest frequency is determined as the creator of the text to be determined.
  19. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如权利要求1~9任一项所述的方法。A computer device, comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program as described in any one of claims 1-9 method.
PCT/CN2021/096771 2020-05-29 2021-05-28 Method for synonym editing and determining creator of text WO2021239114A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010478444.1 2020-05-29
CN202010478444.1A CN111381191B (en) 2020-05-29 2020-05-29 Method for synonymy modifying text and determining text creator

Publications (1)

Publication Number Publication Date
WO2021239114A1 true WO2021239114A1 (en) 2021-12-02

Family

ID=71220415

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096771 WO2021239114A1 (en) 2020-05-29 2021-05-28 Method for synonym editing and determining creator of text

Country Status (2)

Country Link
CN (1) CN111381191B (en)
WO (1) WO2021239114A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111381191B (en) * 2020-05-29 2020-09-01 支付宝(杭州)信息技术有限公司 Method for synonymy modifying text and determining text creator

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901325A (en) * 2010-07-21 2010-12-01 赵步 Copyright protection method
CN102650986A (en) * 2011-02-27 2012-08-29 孙星明 Synonym expansion method and device both used for text duplication detection
KR101663454B1 (en) * 2016-08-03 2016-10-07 주식회사 비욘드테크 Apparatus of sentence similarity calculation using keyword weight and method thereof
CN110990532A (en) * 2019-11-28 2020-04-10 中国银行股份有限公司 Method and device for processing text
CN111381191A (en) * 2020-05-29 2020-07-07 支付宝(杭州)信息技术有限公司 Method for synonymy modifying text and determining text creator

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833579B (en) * 2010-05-11 2012-09-05 同方知网(北京)技术有限公司 Method and system for automatically detecting academic misconduct literature
CN206451175U (en) * 2016-08-31 2017-08-29 青海民族大学 A kind of Tibetan language paper copy detection system based on Tibetan language sentence level
CN109446301A (en) * 2018-09-18 2019-03-08 沈文策 A kind of lookup method and device of similar article
CN109783806B (en) * 2018-12-21 2023-05-02 众安信息技术服务有限公司 Text matching method utilizing semantic parsing structure
CN110134925A (en) * 2019-05-15 2019-08-16 北京信息科技大学 A kind of Chinese patent text similarity calculating method
CN110321925B (en) * 2019-05-24 2022-11-18 中国工程物理研究院计算机应用研究所 Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
CN110489745B (en) * 2019-07-31 2020-12-22 北京大学 Paper text similarity detection method based on citation network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901325A (en) * 2010-07-21 2010-12-01 赵步 Copyright protection method
CN102650986A (en) * 2011-02-27 2012-08-29 孙星明 Synonym expansion method and device both used for text duplication detection
KR101663454B1 (en) * 2016-08-03 2016-10-07 주식회사 비욘드테크 Apparatus of sentence similarity calculation using keyword weight and method thereof
CN110990532A (en) * 2019-11-28 2020-04-10 中国银行股份有限公司 Method and device for processing text
CN111381191A (en) * 2020-05-29 2020-07-07 支付宝(杭州)信息技术有限公司 Method for synonymy modifying text and determining text creator

Also Published As

Publication number Publication date
CN111381191A (en) 2020-07-07
CN111381191B (en) 2020-09-01

Similar Documents

Publication Publication Date Title
Fu et al. Toward efficient multi-keyword fuzzy search over encrypted outsourced data with accuracy improvement
US9594806B1 (en) Detecting name-triggering queries
US10430610B2 (en) Adaptive data obfuscation
US20190019058A1 (en) System and method for detecting homoglyph attacks with a siamese convolutional neural network
TWI659358B (en) Method and device for calculating string distance
CN108170650B (en) Text comparison method and text comparison device
WO2014201047A1 (en) Fast, scalable dictionary construction and maintenance
CN110162752B (en) Article judging and re-processing method and device and electronic equipment
CN109165382A (en) A kind of similar defect report recommended method that weighted words vector sum latent semantic analysis combines
WO2021239114A1 (en) Method for synonym editing and determining creator of text
CN104281275A (en) Method and device for inputting English
JP2019020794A (en) Document management device, document management system, and program
CN111832264A (en) PDF file based signature position determination method, device and equipment
CN115314236A (en) System and method for detecting phishing domains in a Domain Name System (DNS) record set
CN107329964A (en) A kind of text handling method and device
CN110427496B (en) Knowledge graph expansion method and device for text processing
CN105354506B (en) The method and apparatus of hidden file
CN111190235A (en) Block chain information receiving and recording platform
CN112182448A (en) Page information processing method, device and equipment
US20220092157A1 (en) Digital watermarking for textual data
CN106326209B (en) Tibetan character error detection method and system and Tibetan character string error detection method and system
CN115310436A (en) Document outline extraction method and device, electronic equipment and storage medium
JP2019020795A (en) Document management device, document management system, and program
US8548800B2 (en) Substitution, insertion, and deletion (SID) distance and voice impressions detector (VID) distance
US20240127381A1 (en) Machine-learning based techniques for predicting trademark similarity

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21813299

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21813299

Country of ref document: EP

Kind code of ref document: A1