WO2021017440A1 - 基于区块链的文本相似性检测方法及装置、电子设备 - Google Patents

基于区块链的文本相似性检测方法及装置、电子设备 Download PDF

Info

Publication number
WO2021017440A1
WO2021017440A1 PCT/CN2020/072148 CN2020072148W WO2021017440A1 WO 2021017440 A1 WO2021017440 A1 WO 2021017440A1 CN 2020072148 W CN2020072148 W CN 2020072148W WO 2021017440 A1 WO2021017440 A1 WO 2021017440A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
target
similarity
unit
smart contract
Prior art date
Application number
PCT/CN2020/072148
Other languages
English (en)
French (fr)
Inventor
黄凯明
杨磊
Original Assignee
创新先进技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 创新先进技术有限公司 filed Critical 创新先进技术有限公司
Priority to US16/782,938 priority Critical patent/US10909317B2/en
Priority to US17/164,741 priority patent/US11100284B2/en
Publication of WO2021017440A1 publication Critical patent/WO2021017440A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange

Definitions

  • One or more embodiments of this specification relate to the field of blockchain technology, and in particular to a method and device for detecting text similarity based on blockchain, and electronic equipment.
  • Blockchain technology also known as distributed ledger technology, is an emerging technology in which several computing devices participate in "bookkeeping" and jointly maintain a complete distributed database. Because the blockchain technology has the characteristics of decentralization, openness and transparency, each computing device can participate in database records, and the rapid data synchronization between computing devices, the blockchain technology has been widely used in many fields. To apply.
  • one or more embodiments of this specification provide a blockchain-based text similarity detection method, device, computer equipment, and computer-readable storage medium.
  • one or more embodiments of this specification provide a method for detecting text similarity based on blockchain, which is applied to a blockchain network deployed with a smart contract for detecting the similarity of original text to a target.
  • the method is executed by the node device of the blockchain network and includes:
  • the first text includes at least one first text unit with a preset length;
  • the smart contract stores a number of target text vectors, and each target text vector is based on the target original text
  • the included target text unit of the preset length is generated;
  • the text similarity detection logic for executing the smart contract statement includes:
  • the similarity detection result includes the similarity detection result of the at least one first text unit.
  • the smart contract generates a target text vector index for the plurality of target text vectors.
  • the method further includes:
  • the similarity between the second text and the target original text is calculated.
  • calculating the similarity between the second text and the target original text based on the second text unit whose similarity detection result is similar includes:
  • the method when the similarity between the second text and the target original text is greater than a preset similarity threshold, the method further includes:
  • the deposit certificate transaction including the second text and source information of the second text.
  • this specification also provides a block chain-based text similarity detection device, which is applied to a block chain network deployed with a smart contract for detecting the similarity of the target original text, and the device is applied to the district
  • the node device side of the block chain network includes:
  • a receiving unit receiving a first transaction that includes a first text, the first text being a text to be detected similar to the target original text;
  • the execution unit calls the smart contract, executes the text similarity detection logic declared by the smart contract, and obtains the similarity detection result between the first text and the target original text.
  • the first text includes at least one first text unit with a preset length;
  • the smart contract stores a number of target text vectors, and each target text vector is based on the target original text
  • the included target text unit of the preset length is generated;
  • the text similarity detection logic for executing the smart contract statement includes:
  • the similarity detection result includes the similarity detection result of the at least one first text unit.
  • the smart contract generates a target text vector index for the plurality of target text vectors.
  • the device further includes:
  • a dividing unit dividing the second text into a plurality of second text units of the preset length
  • the sending unit sends a second transaction including the second text unit to the blockchain to call the smart contract, execute the text similarity detection logic declared by the smart contract, and obtain each second text Similarity detection result with the target original text;
  • a calculation unit based on a second text unit whose similarity detection result is similar, calculates the similarity between the second text and the target original text.
  • the calculation unit is further configured to:
  • the sending unit when the similarity between the second text and the target original text is greater than a preset similarity threshold, the sending unit is further configured to:
  • the deposit certificate transaction including the second text and source information of the second text.
  • this specification also provides a computer device, including: a memory and a processor; the memory stores a computer program that can be run by the processor; when the processor runs the computer program, it executes as described above The block chain-based text similarity detection method described in each embodiment.
  • this specification also provides a computer-readable storage medium on which a computer program is stored.
  • the computer program When the computer program is run by a processor, it executes the blockchain-based text similarity as described in the above embodiments. Detection method.
  • Fig. 1 is a schematic diagram of creating a smart contract provided by an exemplary embodiment
  • Fig. 2 is a schematic diagram of invoking a smart contract provided by an exemplary embodiment
  • Fig. 3 is a schematic diagram of creating a smart contract and invoking a smart contract provided by an exemplary embodiment
  • FIG. 4 is a schematic flowchart of a method for detecting text similarity based on blockchain according to an exemplary embodiment
  • Fig. 5 is a schematic diagram of a text similarity detection device based on blockchain provided by an exemplary embodiment
  • Fig. 6 is a hardware structure diagram for running the embodiment of the blockchain-based text similarity detection device provided in this specification.
  • the steps of the corresponding method may not be executed in the order shown and described in this specification.
  • the method includes more or fewer steps than described in this specification.
  • a single step described in this specification may be decomposed into multiple steps for description in other embodiments; and multiple steps described in this specification may also be combined into a single step in other embodiments. description.
  • the detected infringing text is usually documented or electronically deposited by a notary office.
  • the time window from detection to deposit is long, which is prone to possible infringement. Those who deny or eliminate evidence.
  • this specification provides a blockchain-based text similarity detection method, which is applied to a blockchain network deployed with a smart contract for detecting the similarity of the original text with the target.
  • the blockchain network described in one or more embodiments of this specification can specifically refer to a P2P network system with a distributed data storage structure reached by each node device through a consensus mechanism.
  • the data in the blockchain is distributed in time Within the connected “blocks”, the next block can contain the data summary of the previous block, and it can be achieved according to the specific consensus mechanism (such as POW, POS, DPOS or PBFT, etc.) Full backup of all or part of the node data.
  • the real data generated in the physical world it can be constructed into a standard transaction format supported by the blockchain, and then published to the blockchain, and the node devices in the blockchain will perform consensus processing on the received transactions , And after reaching a consensus, the node device as the accounting node in the block chain will package the transaction into the block and carry out persistent storage in the block chain.
  • the consensus algorithms supported in the blockchain can include:
  • the first type of consensus algorithm that is, the consensus algorithm that node devices need to compete for the accounting right of each round of accounting cycle; for example, Proof of Work (POW), Proof of Stake (POS), appointment Consensus algorithms such as Delegated Proof of Stake (DPOS);
  • POW Proof of Work
  • POS Proof of Stake
  • DPOS Delegated Proof of Stake
  • the second type of consensus algorithm is a consensus algorithm that pre-selects accounting nodes for each round of accounting cycles (without competing for accounting rights); for example, practical Byzantine Fault Tolerance (PBFT) and other consensus algorithms.
  • PBFT Byzantine Fault Tolerance
  • all node devices that compete for the right to bookkeeping can execute the transaction after receiving the transaction.
  • one node device may win this round of contention for the right to bookkeeping and become the bookkeeping node.
  • the accounting node can package the received transaction with other transactions to generate the latest block, and send the generated latest block or the block header of the latest block to other node devices for consensus.
  • the node device with the right to book accounts has been agreed before this round of bookkeeping. Therefore, after the node device receives the transaction, if it is not the billing node of the current round, it can send the transaction to the billing node. For this round of billing nodes, the transaction can be executed during or before the process of packaging the transaction with other transactions to generate the latest block. After the accounting node generates the latest block, it can send the latest block or the block header of the latest block to other node devices for consensus.
  • the accounting node in this round can package the received transaction to generate the latest block, and the generated latest block or the latest block
  • the header of the block is sent to other node devices for consensus verification. If other node devices receive the latest block or the block header of the latest block, and there is no problem after verification, the latest block can be appended to the end of the original blockchain to complete the blockchain accounting process. In the process of verifying the new block or block header sent by the accounting node, other nodes can also execute the transactions contained in the block.
  • the blockchain network system operates under the corresponding consensus mechanism, the data that has been included in the blockchain database is difficult to be tampered with by any node.
  • a blockchain using Pow consensus requires at least full The attack of 51% of the network's computing power may tamper with existing data. Therefore, the blockchain system has the characteristics of ensuring data security and preventing attack and tampering that other centralized database systems cannot match. It can be seen that the data included in the distributed database of the blockchain will not be attacked or tampered with, thereby ensuring the authenticity of the data information stored in the distributed database of the blockchain.
  • Example types of blockchain networks may include public blockchain networks, private blockchain networks, and consortium blockchain networks.
  • blockchain is usually associated with the Bitcoin cryptocurrency network
  • the blockchain used in this article can refer to DLS (Distributed Ledger System) that does not refer to any specific use case.
  • the consensus process is controlled by the nodes of the consensus network.
  • hundreds, thousands, or even millions of entities can collaborate in a public blockchain network, and each entity operates at least one node in the public blockchain network. Therefore, the public blockchain network can be considered as a public network relative to the participating entities.
  • Example public blockchain networks include the Bitcoin network, which is a peer-to-peer payment network.
  • the Bitcoin network utilizes a distributed ledger, called a blockchain.
  • blockchain is often used to refer to distributed ledgers that do not specifically refer to the Bitcoin network.
  • public blockchain networks support public transactions.
  • Public transactions are shared with all nodes in the public blockchain network and stored in the global blockchain.
  • the global blockchain is a blockchain replicated across all nodes. In other words, for the global blockchain, all nodes are in a completely consistent state.
  • consensus protocols include, but are not limited to, proof-of-work (POW) implemented in the Bitcoin network.
  • private blockchain networks are provided to specific entities, and specific entities centrally control read and write permissions. This entity controls which nodes can participate in the blockchain network. Therefore, private blockchain networks are often referred to as permissioned networks, which impose restrictions on who is allowed to participate in the network and the level of participation (for example, only in certain transactions). Various types of access control mechanisms can be used (for example, existing participants vote to add new entities, and regulators can control access).
  • the alliance blockchain network is private among participating entities.
  • the consensus process is controlled by an authorized group of nodes (consortium member nodes), and one or more nodes are operated by corresponding entities (for example, enterprises).
  • entities for example, enterprises
  • a consortium consisting of ten (10) entities e.g., enterprises
  • each entity operates at least one node in the consortium blockchain network. Therefore, in terms of participating entities, the consortium blockchain network can be considered a private network.
  • each entity (node) must sign each block in order to make the block valid and add the valid block to the blockchain.
  • at least a subset of entities (nodes) e.g., at least 7 entities
  • Smart contracts on the blockchain are contracts that can be triggered and executed by transactions on the blockchain. Smart contracts can be defined in the form of codes.
  • Ethereum Taking Ethereum as an example, it supports users to create and call some complex logic in the Ethereum network.
  • Ethereum is a programmable blockchain, and its core is the Ethereum Virtual Machine (EVM).
  • EVM Ethereum Virtual Machine
  • Each Ethereum node can run the EVM.
  • EVM is a Turing complete virtual machine, through which various complex logic can be realized.
  • Users publish and call smart contracts in Ethereum run on the EVM.
  • EVM directly runs virtual machine code (virtual machine bytecode, hereinafter referred to as "bytecode”), so the smart contract deployed on the blockchain can be bytecode.
  • bytecode virtual machine code
  • each node can execute the transaction in the EVM.
  • the From field of the transaction in the figure is used to record the address of the account that initiated the creation of the smart contract.
  • the field value of the Data field of the transaction can be stored as a bytecode.
  • the field value of the To field of the transaction is a null (empty). Account.
  • a contract account corresponding to the smart contract appears on the blockchain and has a specific address; for example, "0x68e12cf284" in each node in Figure 1 represents the address of the created contract account ; Contract code (Code) and account storage (Storage) will be stored in the account storage of the contract account.
  • the behavior of the smart contract is controlled by the contract code, and the account storage of the smart contract saves the state of the contract.
  • smart contracts enable virtual accounts containing contract codes and account storage to be generated on the blockchain.
  • the Data field containing the transaction for creating the smart contract can store the bytecode of the smart contract.
  • the bytecode consists of a series of bytes, and each byte can identify an operation.
  • developers can choose a high-level language to write smart contract code instead of directly writing bytecode.
  • high-level languages such as Solidity, Serpent, LLL, etc. can be used.
  • smart contract code written in a high-level language it can be compiled by a compiler to generate bytecode that can be deployed on the blockchain.
  • the contract code written with it is very similar to the class in the object-oriented programming language.
  • a variety of members can be declared in a contract, including state variables, functions, function modifiers, and events.
  • State variables are values permanently stored in the storage (Storage) field of the smart contract and are used to save the state of the contract.
  • each node can execute the transaction in the EVM.
  • the From field of the transaction in the figure is used to record the address of the account that initiated the invocation of the smart contract
  • the To field is used to record the address of the smart contract being called
  • the Data field of the transaction is used to record the method and parameters of invoking the smart contract.
  • the account status of the contract account may change. Later, a client can view the account status of the contract account through the connected blockchain node.
  • Smart contracts can be independently executed on each node in the blockchain network in a prescribed manner. All execution records and data are stored on the blockchain, so when such transactions are executed, the blockchain cannot be saved. Falsified, non-lost transaction vouchers.
  • FIG. 3 The schematic diagram of creating and calling smart contracts is shown in Figure 3.
  • To create a smart contract in Ethereum it needs to go through the process of writing a smart contract, turning it into bytecode, and deploying it to the blockchain.
  • Invoking a smart contract in Ethereum is to initiate a transaction pointing to a smart contract address.
  • the EVM of each node can execute the transaction separately, and the smart contract code can be distributed in the virtual machine of each node in the Ethereum network.
  • the smart contract provided in one or more embodiments in this specification is used to detect the similarity between any text and the target original text.
  • the account of the above smart contract can store the full text content of the target original text, or the processed and full text content Various forms of index content corresponding to the content to facilitate comparison.
  • Fig. 4 illustrates the process steps of a method for detecting text similarity based on a blockchain provided by an exemplary embodiment of the present specification.
  • the above method steps can be executed by any node device of the blockchain, including:
  • Step 402 Receive a first transaction that includes a first text, the first text being a text to be detected similar to the target original text.
  • the above-mentioned first transaction is a smart contract call transaction.
  • the first text that contains the similarity between the original text to be detected and the target it may also include the address of the called smart contract and the name of the calling function. , Or parameters, etc.
  • This embodiment does not limit the identity of the sender of the above-mentioned first transaction, and any node device in the blockchain with the above-mentioned smart contract call authority can send the above-mentioned first transaction to the blockchain.
  • the above-mentioned first transaction can not only be used to call smart contracts.
  • the anti-tampering mechanism based on the blockchain can be the above
  • the content of the first text contained in the first exchange serves as a depository.
  • Step 404 Invoke the smart contract, execute the text similarity detection logic declared by the smart contract, and obtain a similarity detection result between the first text unit and the target original text.
  • the smart contract states that there are a series of executable program codes that can be executed on the EVM of the blockchain node device. Since the smart contract can be invoked and executed at any time after being deployed to the blockchain, the detection efficiency of detecting the similarity between the first text and the target original text is greatly improved; moreover, any changes or changes to the smart contract are There are traces on the blockchain, so it has a lower risk of human intervention and the characteristics of decentralized authority.
  • the node equipment of the blockchain network can accurately execute and reach a consensus execution result, compared to the possibility of human intervention.
  • the centralized detection program can obtain more fair, just and accurate execution results by calling smart contracts to perform text similarity detection.
  • This embodiment does not limit the specific logic steps included in the text similarity detection logic declared by the smart contract. Those skilled in the art can design suitable text similarity detection logic for the target original text based on actual business needs. .
  • the above-mentioned text similarity detection logic may include: adopting a similarity algorithm, such as using the simhash algorithm, through word segmentation, hashing, weighting, merging, dimensionality reduction, etc., as the target original text and the first text Generate simhash signatures separately. Because in the simhash algorithm, the closer the characters of the two texts are, the higher the similarity. Therefore, the similarity of the two texts can be obtained by comparing the two simhash values.
  • the above-mentioned text similarity detection logic can still have a good detection effect for large sections of plagiarism and small part of content modification, but the effect will be worse for scenes such as a large number of synonym substitutions and article paragraph splicing.
  • the above-mentioned first text includes at least one first text unit with a preset length; the smart contract stores a number of target text vectors, and each target text vector is based on the original text of the target.
  • the target text unit of the preset length is included.
  • This embodiment does not limit the specific expression form of the aforementioned preset length. It can be the length of a preset text unit such as a natural paragraph or a natural sentence, or a preset capacity such as a length of 100K text capacity, and so on.
  • the first text used to compare the similarity with the target original text includes at least one first text unit with a preset length.
  • the first text should include at least one natural paragraph
  • the target original text can be divided into several target text units of preset length, for example, the target original text can be based on The natural paragraph of the text is divided into several target text units, and then based on a text vector generation algorithm (such as the doc2vec algorithm), corresponding target text vectors are generated for the several target text units.
  • a text vector generation algorithm such as the doc2vec algorithm
  • the text similarity detection logic for executing the smart contract statement includes (the following logical steps 4042 to 4046 are not shown in Figure 4):
  • Step 4042 Generate at least one first text vector for the at least one first text unit.
  • the same algorithm (such as the doc2vec algorithm) used for generating the aforementioned target text vector is used to generate at least one first text vector for the aforementioned at least one first text unit.
  • Step 4044 Calculate the distance between the at least one first text vector and each target text vector.
  • the calculation methods of the distance between the first text vector and the target text vector include, but are not limited to, the cosine distance calculation method, the pearson distance calculation method, the Euclidean distance calculation method, the block distance calculation method, and so on.
  • Step 4046 Compare the distance with a preset distance threshold; when the distance is less than the preset distance threshold, the similarity detection result of the at least one first text unit is similar.
  • the similarity detection result between the first text and the target original text includes the similarity detection result of the at least one first text unit.
  • this embodiment exemplarily shows the similarity detection process of a first text unit to explain the specific process of the above steps 4042 to 4046.
  • the target original text contains a natural paragraph: "The roommate looks very bad today, and feels that the whole person stands up and will fall down soon; after taking a measurement, the body temperature is quite normal. The weather is too hot, it is indeed easy now Heat stroke. It is useful to drink mung bean soup to treat heat stroke.”
  • the first text contains a natural paragraph: “The boyfriend looks very bad in the afternoon. He can’t stand up and feels about to fall down soon; the body temperature is still normal, 36 Degree. Now that the weather is very hot, heatstroke may be easier. It is useful to drink mung bean soup and sour plum soup to treat heatstroke.”
  • the natural paragraphs contained in the above first text may be obtained after washing the manuscripts based on the above natural paragraphs contained in the target original text, but the direct comparison method or the Simhash algorithm cannot well reflect the above two natural paragraphs. Similarity: In this example, the text vector distance is calculated to obtain the similarity of the two natural paragraph texts.
  • the smart contract provided in this example can already use the length of the natural paragraph as the preset length before releasing and deploying to the blockchain, and generate the target text vector corresponding to each natural paragraph for the target original text containing the above natural paragraph; also After the above smart contract code is released to the blockchain, based on the call to the target text vector generation function of the target original text containing the above natural paragraph, the target text vector corresponding to each natural paragraph is generated for the target original text containing the above natural paragraph , This manual does not limit this.
  • target text vector generation process it may include:
  • the preprocessing process can include removing punctuation, removing auxiliary words, stop words and other word segmentation processing;
  • target text units such as dividing the aforementioned target original text or preprocessed target original text with natural paragraphs as units
  • the vector generation algorithm is used to generate the target text vector from the target text unit.
  • the target text unit corresponding to the natural paragraph contained in the target original text in the above example can be "The roommate looks particularly bad today, and I feel that the whole person stands up and will fall down and measure the body temperature. It is normal. The weather is too normal. Heat is indeed easy to treat heat stroke. Drinking mung bean soup is useful.” Based on the doc2vec algorithm, the target text vector generated by the target text unit can be as shown in the second row of data in Table 1.
  • the smart contract’s processing process for the first text containing the natural paragraph is similar to the above process.
  • the first text unit corresponding to the natural paragraph contained in the first text in the example above can be "boyfriend watching this afternoon My face is very bad when I stand up and I don’t feel like I’m going to fall down. Body temperature is still normal 36 degrees. Now the weather is very hot. It may be easier to treat heatstroke. Drinking mung bean soup and sour plum soup is useful.”
  • the target text vector generated by the above target text unit can be shown in the third row of data in Table 1.
  • the distance calculation is performed on the text vectors described in the second and third rows of the table, and the distance between the first text vector and the target text vector can be obtained: 0.9270391810208355.
  • the algorithm used to calculate the text unit of the preset length is usually a computer deep learning algorithm, such as the doc2vec algorithm, so that the text vector generation process will not be affected by the text unit.
  • the impact of synonym substitution in the text unit thereby generating similar text vectors for similar text units; when the distance between the target text vector and the first text vector is less than the preset threshold, the target text unit corresponding to the target text vector and the first The first text unit corresponding to the text vector is similar text. Therefore, in this embodiment, the infringer uses synonym substitution or manuscript washing to perform text infringement, which can achieve a more accurate similarity detection effect.
  • this embodiment does not limit whether the first text unit contained in the first text is derived from the same text or from a different text; this embodiment can use the first text unit as the detection unit to perform text similarity Detection.
  • any node device of the blockchain can call the smart contract based on one or more first text units to detect multiple target texts included in the one or more first text units and the target original text. Similarity of units; for ease of management, the above smart contract can create indexes for multiple target text units included in the target original text, so that the target text unit is used as the detection unit, which further improves the efficiency and accuracy of text similarity detection.
  • the first text used to detect the similarity with the target original text may only contain one or a few first text units with a small amount of preset length and similar to the target original text, and the first text
  • the complete article text to which the text belongs may have more text units that can be considered similar to the target original text; therefore, after obtaining the first text unit that is similar to the target original text (the target text unit), it can be based on
  • the above-mentioned similar first text unit is monitored and captured throughout the network to obtain a second text containing at least one similar first text unit.
  • the above-mentioned similar first text unit is similar in the above-mentioned embodiment.
  • the first text unit to detect the similarity between the second text unit of the preset length contained in the second text and the target original text, so as to obtain whether the second text has sufficient similarity with the target original text to influence the first text.
  • the result of the originality of the second text is
  • the text similarity detection method provided by another embodiment shown further includes:
  • Step 406 Obtain a second text that includes at least one similar first text unit, where the similar first text unit is a first text unit whose similarity detection result is similar.
  • Step 408 Divide the second text into a plurality of second text units of the preset length.
  • the division manner of the second text unit of the preset length may be consistent with the division manner of the first text unit or the target text unit described above.
  • Step 410 Send a second transaction that includes the second text unit to the blockchain to invoke the smart contract, execute the text similarity detection logic declared by the smart contract, and obtain each second text unit and The similarity detection result of the target original text.
  • the above-mentioned second transaction can be one or multiple. Based on the process of steps 402 to 404 shown in the above embodiment, the similarity between each second text unit and the target original text is obtained. Test results.
  • Step 412 Calculate the similarity between the second text and the target original text based on the second text unit whose similarity detection result is similar.
  • This embodiment does not limit the second text unit that is similar based on the similarity detection result.
  • the specific method for calculating the similarity between the second text and the target original text can be based on the field of text content. , The characteristics of the text, the definition of text infringement in the field and other specific influencing factors, design an applicable text similarity calculation method for the target original text.
  • the calculation of the similarity between the second text and the target original text based on the second text unit whose similarity detection result is similar includes: calculating the similarity detection result to be similar
  • the ratio of the total content of the second text unit to the entire content of the second text is used as the similarity between the second text and the target original text. For example, if the number of second text units that are similar to the target original text is N, and the total number of paragraphs of the second text is M1, the similarity between the second text and the target original text may be N/M1.
  • the ratio of the total content of the second text unit whose similarity detection result is similar to the entire content of the target original text is calculated as the similarity between the second text and the target original text. For example, if the number of second text units that are similar to the target original text is N, and the total number of paragraphs of the target original text is M2, the similarity between the second text and the target original text may be N/M2.
  • the similarity between the second text and the target original text can also be the larger or smaller of the two values N/M1 and N/M2 as the similarity between the second text and the target original text.
  • the average similarity can be calculated based on the multiple similarities between all the similar second text units contained in the second text and the corresponding target text units (or take the above multiple similarities The maximum value of) as the similarity between the second text and the target original text.
  • the similarity between the second text unit with similarity and the corresponding target text unit may be calculated based on the ratio or difference between the distance between the second text vector and the corresponding target text vector and a preset distance threshold, which is not limited herein.
  • the second text can be determined as an infringing text.
  • the smart contract is called based on the second transaction on the blockchain, and after the consensus verification of the blockchain node device, it has become the second
  • the content of the second text unit contained in the transaction and the similarity between the content and the original text of the target are recorded on the blockchain; it effectively overcomes the detected infringement in the existing text infringement detection or text similarity detection
  • the text is usually deposited by a notary office or electronically deposited. The time window from detection to deposit is long, and it is easy to be denied by possible infringers or eliminate the shortcomings of evidence.
  • the method further includes: sending to the blockchain including For the deposit transaction of the second text, the deposit transaction may also include the source information of the second text, such as the published URL, so as to further block the infringement of the second text based on the anti-tampering mechanism of the blockchain Chain deposit certificate.
  • the similarity of multiple different second texts can be performed based on the similarity between multiple different second texts and the target original text. Sort, so as to adopt corresponding infringement countermeasures based on different similarity rankings, such as notifying the infringing party to immediately stop the infringement, making infringement claims, or sending sharing rights suggestions.
  • the embodiment of the present specification also provides a text similarity detection device 50 based on a blockchain.
  • the device 50 may be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, as a logical device, it is formed by reading the corresponding computer program instructions into the memory by the CPU (Central Process Unit, central processing unit) of the device where it is located. From a hardware perspective, in addition to the CPU, memory, and storage shown in Figure 6, the device where the above-mentioned device is located usually includes other hardware such as chips for wireless signal transmission and reception, and/or for implementing network communication functions. Other hardware such as boards.
  • this specification also provides a block chain-based text similarity detection device 50, which is applied to a block chain network deployed with a smart contract for detecting the similarity to the target original text.
  • the device 50 The node device side applied to the blockchain network includes:
  • the receiving unit 502 receives a first transaction that includes a first text, the first text being a text to be detected similar to the target original text;
  • the execution unit 504 calls the smart contract, executes the text similarity detection logic declared by the smart contract, and obtains the similarity detection result between the first text and the target original text.
  • the first text includes at least one first text unit with a preset length;
  • the smart contract stores a number of target text vectors, and each target text vector is based on the target original text
  • the included target text unit of the preset length is generated;
  • the text similarity detection logic for executing the smart contract statement includes:
  • the similarity detection result includes the similarity detection result of the at least one first text unit.
  • the smart contract generates a target text vector index for the plurality of target text vectors.
  • the device 50 further includes:
  • a dividing unit dividing the second text into a plurality of second text units of the preset length
  • the sending unit sends a second transaction including the second text unit to the blockchain to call the smart contract, execute the text similarity detection logic declared by the smart contract, and obtain each second text Similarity detection result with the target original text;
  • the calculation unit calculates the similarity between the second text and the target original text based on the second text unit whose similarity detection result is similar.
  • the calculation unit is further configured to:
  • the sending unit when the similarity between the second text and the target original text is greater than a preset similarity threshold, the sending unit is further configured to:
  • the deposit certificate transaction including the second text and source information of the second text.
  • the device embodiments described above are merely illustrative.
  • the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical modules, that is, they may be located in One place, or it can be distributed to multiple network modules. Some or all of the units or modules can be selected according to actual needs to achieve the purpose of the solution in this specification. Those of ordinary skill in the art can understand and implement it without creative work.
  • a typical implementation device is a computer.
  • the specific form of the computer can be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email receiving and sending device, and a game control A console, a tablet computer, a wearable device, or a combination of any of these devices.
  • the embodiment of this specification also provides a computer device.
  • the computer device includes a memory and a processor.
  • the memory stores a computer program that can be run by the processor; when the processor runs the stored computer program, it executes the blockchain-based text similarity detection method executed by the blockchain node device in the embodiment of this specification.
  • the various steps Please refer to the previous content for a detailed description of each step of the blockchain-based text similarity detection method executed by the above-mentioned blockchain node device, and will not be repeated.
  • the embodiments of this specification also provide a computer-readable storage medium on which computer programs are stored. These computer programs, when run by a processor, execute the foregoing in the embodiments of this specification.
  • Each step of the blockchain-based text similarity detection method executed by the blockchain node device Please refer to the previous content for a detailed description of each step of the blockchain-based text similarity detection method executed by the above-mentioned blockchain node device, and will not be repeated.
  • the computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.
  • processors CPU
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-permanent memory in computer readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer readable media.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technology
  • CD-ROM compact disc
  • DVD digital versatile disc
  • Magnetic cassettes magnetic tape magnetic disk storage or other magnetic storage devices or any other non
  • the embodiments of this specification can be provided as methods, systems or computer program products. Therefore, the embodiments of this specification may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the embodiments of this specification can be in the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes. .
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Data Mining & Analysis (AREA)
  • Marketing (AREA)
  • Health & Medical Sciences (AREA)
  • Technology Law (AREA)
  • Computing Systems (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本说明书提供了一种基于区块链的文本相似性检测方法和装置,应用于部署有用于检测与目标原创文本相似度的智能合约的区块链网络,所述方法由所述区块链网络的节点设备执行,包括:接收包含第一文本的第一交易,所述第一文本为待检测与所述目标原创文本的相似度的文本;调用所述智能合约,执行所述智能合约声明的文本相似性检测逻辑,获得所述第一文本与所述目标原创文本的相似性检测结果。

Description

基于区块链的文本相似性检测方法及装置、电子设备 技术领域
本说明书一个或多个实施例涉及区块链技术领域,尤其涉及一种基于区块链的文本相似性检测方法及装置、电子设备。
背景技术
区块链技术,也被称之为分布式账本技术,是一种由若干台计算设备共同参与“记账”,共同维护一份完整的分布式数据库的新兴技术。由于区块链技术具有去中心化、公开透明、每台计算设备可以参与数据库记录、并且各计算设备之间可以快速的进行数据同步的特性,使得区块链技术已在众多的领域中广泛的进行应用。
发明内容
有鉴于此,本说明书一个或多个实施例提供一种基于区块链的文本相似性检测方法、装置、计算机设备和计算机可读存储介质。
为实现上述目的,本说明书一个或多个实施例提供了一种基于区块链的文本相似性检测方法,应用于部署有用于检测与目标原创文本相似度的智能合约的区块链网络,所述方法由所述区块链网络的节点设备执行,包括:
接收包含第一文本的第一交易,所述第一文本为待检测与所述目标原创文本的相似度的文本;
调用所述智能合约,执行所述智能合约声明的文本相似性检测逻辑,获得所述第一文本与所述目标原创文本的相似性检测结果。
在又一示出的实施方式中,所述第一文本包括至少一个预设长度的第一文本单元;所述智能合约存储有若干个目标文本向量,每个目标文本向量基于所述目标原创文本所包含的预设长度的目标文本单元而生成;所述执行所述智能合约声明的文本相似性检测逻辑包括:
为所述至少一个第一文本单元生成至少一个第一文本向量;
计算所述至少一个第一文本向量与每个目标文本向量的距离;
对比所述距离与预设的距离阈值;当所述距离小于预设的距离阈值时,所述至少一个第一文本单元的相似性检测结果为相似;所述第一文本与所述目标原创文本的相似性检测结果包括所述至少一个第一文本单元的相似性检测结果。
在又一示出的实施方式中,所述智能合约为所述若干个目标文本向量生成有目标文本向量索引。
在又一示出的实施方式中,所述的方法,还包括:
获取包含至少一个相似的第一文本单元的第二文本,所述相似的第一文本单元为所述相似性检测结果为相似的第一文本单元;
将所述第二文本划分为多个所述预设长度的第二文本单元;
向所述区块链发送包含所述第二文本单元的第二交易,以调用所述智能合约,执行所述智能合约声明的文本相似性检测逻辑,获得所述每个第二文本与所述目标原创文本的相似性检测结果;
基于相似性检测结果为相似的第二文本单元,计算所述第二文本与所述目标原创文本的相似度。
在又一示出的实施方式中,基于相似度检测结果为相似的第二文本单元,计算所述第二文本与所述目标原创文本的相似度,包括:
计算所述相似度检测结果为相似的第二文本单元的内容总和与所述第二文本的全部内容的比值,或所述相似度检测结果为相似的第二文本单元的内容总和与所述目标原创文本的全部内容的比值,以作为所述第二文本与所述目标原创文本的相似度。
在又一示出的实施方式中,当所述第二文本与目标原创文本的相似度大于预设的相似度阈值时,所述方法还包括:
向所述区块链发送存证交易,所述存证交易包括所述第二文本和所述第二文本的来源信息。
相应地,本说明书还提供了一种基于区块链的文本相似性检测装置,应用于部署有用于检测与目标原创文本相似度的智能合约的区块链网络,所述装置应用于所述区块链网络的节点设备端,包括:
接收单元,接收包含第一文本的第一交易,所述第一文本为待检测与所述目标原创文本的相似度的文本;
执行单元,调用所述智能合约,执行所述智能合约声明的文本相似性检测逻辑,获得所述第一文本与所述目标原创文本的相似性检测结果。
在又一示出的实施方式中,所述第一文本包括至少一个预设长度的第一文本单元;所述智能合约存储有若干个目标文本向量,每个目标文本向量基于所述目标原创文本所包含的预设长度的目标文本单元而生成;所述执行所述智能合约声明的文本相似性检测逻辑包括:
为所述至少一个第一文本单元生成至少一个第一文本向量;
计算所述至少一个第一文本向量与每个目标文本向量的距离;
对比所述距离与预设的距离阈值;当所述距离小于预设的距离阈值时,所述至少一个第一文本单元的相似性检测结果为相似;所述第一文本与所述目标原创文本的相似性检测结果包括所述至少一个第一文本单元的相似性检测结果。
在又一示出的实施方式中,所述智能合约为所述若干个目标文本向量生成有目标文本向量索引。
在又一示出的实施方式中,所述的装置,还包括:
获取单元,获取包含至少一个相似的第一文本单元的第二文本,所述相似的第一文本单元为所述相似性检测结果为相似的第一文本单元;
划分单元,将所述第二文本划分为多个所述预设长度的第二文本单元;
发送单元,向所述区块链发送包含所述第二文本单元的第二交易,以调用所述智能合约,执行所述智能合约声明的文本相似性检测逻辑,获得所述每个第二文本与所述目标原创文本的相似性检测结果;
计算单元,基于相似性检测结果为相似的第二文本单元,计算所述第二文本与所述目标原创文本的相似度。
在又一示出的实施方式中,所述计算单元,进一步用于:
计算所述相似度检测结果为相似的第二文本单元的内容总和与所述第二文本的全部内容的比值,或所述相似度检测结果为相似的第二文本单元的内容总和与所述目标原创文本的全部内容的比值,以作为所述第二文本与所述目标原创文本的相似度。
在又一示出的实施方式中,当所述第二文本与目标原创文本的相似度大于预设的相似度阈值时,所述发送单元,进一步用于:
向所述区块链发送存证交易,所述存证交易包括所述第二文本和所述第二文本的来源信息。
相应地,本说明书还提供了一种计算机设备,包括:存储器和处理器;所述存储器上存储有可由所述处理器运行的计算机程序;所述处理器运行所述计算机程序时,执行如上述各实施方式所述的基于区块链的文本相似性检测方法。
相应地,本说明书还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器运行时,执行如上述各实施方式所述的基于区块链的文本相似性检测方法。
附图说明
图1是一示例性实施例提供的创建智能合约的示意图;
图2是一示例性实施例提供的一种调用智能合约的示意图;
图3是一示例性实施例提供的创建智能合约和调用智能合约的示意图;
图4是一示例性实施例提供的基于区块链的文本相似性检测方法的流程示意图;
图5是一示例性实施例提供的基于区块链的文本相似性检测装置的示意图;
图6是运行本说明书所提供的基于区块链的文本相似性检测装置实施例的一种硬件结构图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本说明书一个或多个实施例相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本说明书一个或多个实施例的一些方面相一致的装置和方法的例子。
需要说明的是:在其他实施例中并不一定按照本说明书示出和描述的顺序来执行相应方法的步骤。在一些其他实施例中,其方法所包括的步骤可以比本说明书所描述的更多或更少。此外,本说明书中所描述的单个步骤,在其他实施例中可能被分解为多个步骤进行描述;而本说明书中所描述的多个步骤,在其他实施例中也可能被合并为单个 步骤进行描述。
随着互联网普及和内容抄袭成本的降低,越来越多原创互联网内容受到不法侵权的影响,不仅给创作者带来巨大的经济损失,而且影响到整个社会的创新动力。另外,抄袭者利用同义词替换或洗稿工具,使得侵权检测变得更困难。
例如,一般对互联网文本内容侵权检测,会采用文本内容直接对比的方式,比较两篇文章中词的重合度;对于直接采用文本内容进行对比的方法,缺点比较明显:当抄袭文章的文本内容稍有改动,就无法检查出,例如原创文章A的文本内容哈希值为md5_A,抄袭文章只要改动一个字符,那么抄袭文章的文本内容的哈希值即与md5_A完全不同,因此很难检出抄袭文章与原创文本的重合度。
而且,在现有的文本侵权检测或文本相似性检测中,对检测出的侵权文本通常通过公证机关进行文本存证或电子存证,从检测到存证的时间窗口较长,易于被可能侵权者抵赖或消除证据。
有鉴于此,本说明书提供了一种基于区块链的文本相似性检测方法,应用于部署有用于检测与目标原创文本相似度的智能合约的区块链网络。
本说明书一个或多个实施例所述的区块链网络,具体可指一个各节点设备通过共识机制达成的、具有分布式数据存储结构的P2P网络系统,该区块链内的数据分布在时间上相连的一个个“区块(block)”之内,后一区块可包含前一区块的数据摘要,且根据具体的共识机制(如POW、POS、DPOS或PBFT等)的不同,达成全部或部分节点的数据全备份。
对于物理世界产生的真实数据,可以将其构建成区块链所支持的标准的交易(transaction)格式,然后发布至区块链,由区块链中的节点设备对收到的交易进行共识处理,并在达成共识后,由区块链中作为记账节点的节点设备,将这笔交易打包进区块,在区块链中进行持久化存证。
其中,区块链中支持的共识算法可以包括:
第一类共识算法,即节点设备需要争夺每一轮的记账周期的记账权的共识算法;例如,工作量证明(Proof of Work,POW)、股权证明(Proof of Stake,POS)、委任权益证明(Delegated Proof of Stake,DPOS)等共识算法;
第二类共识算法,即预先为每一轮记账周期选举记账节点(不需要争夺记账权)的共识算法;例如,实用拜占庭容错(Practical Byzantine Fault Tolerance,PBFT)等共 识算法。
在采用第一类共识算法的区块链网络中,争夺记账权的节点设备,都可以在接收到交易后执行该笔交易。争夺记账权的节点设备中可能有一个节点设备在本轮争夺记账权的过程中胜出,成为记账节点。记账节点可以将收到的交易与其它交易一起打包以生成最新区块,并将生成的最新区块或者该最新区块的区块头发送至其它节点设备进行共识。
在采用第二类共识算法的区块链网络中,具有记账权的节点设备在本轮记账前已经商定好。因此,节点设备在接收到交易后,如果自身不是本轮的记账节点,则可以将该交易发送至记账节点。对于本轮的记账节点,在将该交易与其它交易一起打包以生成最新区块的过程中或者之前,可以执行该交易。记账节点在生成最新区块后,可以将该最新区块或者该最新区块的区块头发送至其它节点设备进行共识。
如上所述,无论区块链采用以上示出的哪种共识算法,本轮的记账节点都可以将接收到的交易打包以生成最新区块,并将生成的最新区块或者该最新区块的区块头发送至其它节点设备进行共识验证。如果其它节点设备接收到最新区块或者该最新区块的区块头后,经验证没有问题,可以将该最新区块追加到原有的区块链末尾,从而完成区块链的记账过程。其它节点验证记账节点发来的新的区块或区块头的过程中,也可以执行该区块中的包含的交易。
本领域的技术人员熟知,由于区块链网络系统在相应共识机制下运行,已收录至区块链数据库内的数据很难被任意的节点篡改,例如采用Pow共识的区块链,至少需要全网51%算力的攻击才有可能篡改已有数据,因此区块链系统有着其他中心化数据库系统所无法比拟的保证数据安全、防攻击篡改的特性。由此可知,被收录至区块链的分布式数据库中的数据不会被攻击或篡改,从而保证了存证入区块链的分布式数据库的数据信息的真实可靠性。
区块链网络的示例类型可以包括公有区块链网络、私有区块链网络和联盟区块链网络。尽管术语区块链通常与比特币加密货币网络相关联,但是本文使用的区块链可指代不参考任何特定用例的DLS(分布式账本系统)。
在公有区块链网络中,共识过程由共识网络的节点控制。例如,数百、数千、甚至数百万个实体可以在公有区块链网络中协作,每个实体在公有区块链网络中操作至少一个节点。因此,公有区块链网络可以被认为是相对于参与实体的公有网络。示例公有 区块链网络包括比特币网络,比特币网络是对等支付网络。比特币网络利用分布式账本,被称为区块链。然而如上所述,术语区块链通常用于指代不特别参考比特币网络的分布式账本。
通常,公有区块链网络支持公有交易。公有交易与公有区块链网络内的所有节点共享,并存储在全局区块链中。全局区块链是跨所有节点复制的区块链。也就是说,对于全局区块链,所有节点处于完全一致的状态。为了达成共识(例如,同意向区块链添加块),在公有区块链网络内实施共识协议。示例共识协议包括但不限于,在比特币网络中实施的工作量证明(proof-of-work,POW)。
通常,私有区块链网络提供给特定实体,特定实体集中控制读取和写入权限。该实体控制哪些节点能够参与区块链网络。因此,私有区块链网络通常被称为许可网络,其对允许谁参与网络及其参与水平(例如,仅在某些交易中)施加限制。可以使用各种类型的访问控制机制(例如,现有参与者投票添加新实体,监管机构可以控制准入)。
通常,联盟区块链网络在参与实体中是私有的。在联盟区块链网络中,共识过程由授权的一组节点(联盟成员节点)控制,一个或多个节点由相应的实体(例如,企业)操作。例如,由十(10)个实体(例如,企业)组成的联盟可以操作联盟区块链网络,每个实体在该联盟区块链网络中操作至少一个节点。因此,就参与实体而言,联盟区块链网络可以被认为是私有网络。在一些示例中,每个实体(节点)必须对每个块进行签名,以使该块有效并将有效的块添加到区块链。在一些示例中,至少实体(节点)的子集(例如,至少7个实体)必须对每个块进行签名以使该块有效,并且将有效的块添加到区块链。
可以预期,本说明书所提供的实施方式能够在任何合适类型的区块链网络中实现。
在实际应用中,不论是公有链、私有链还是联盟链,都可能提供智能合约(Smart contract)的功能。区块链上的智能合约是在区块链上可以被交易触发执行的合约。智能合约可以通过代码的形式定义。
以以太坊为例,支持用户在以太坊网络中创建并调用一些复杂的逻辑。以太坊作为一个可编程区块链,其核心是以太坊虚拟机(EVM),每个以太坊节点都可以运行EVM。EVM是一个图灵完备的虚拟机,通过它可以实现各种复杂的逻辑。用户在以太坊中发布和调用智能合约就是在EVM上运行的。实际上,EVM直接运行的是虚拟机代码(虚拟机字节码,下简称“字节码”),所以部署在区块链上的智能合约可以是字节 码。
如图1所示,Bob将一笔包含创建智能合约信息的交易(Transaction)发送到以太坊网络后,各节点均可以在EVM中执行这笔交易。其中,图中交易的From字段用于记录发起创建智能合约的账户的地址,交易的Data字段的字段值保存的合约代码可以是字节码,交易的To字段的字段值为一个null(空)的账户。当节点间通过共识机制达成一致后,这个智能合约成功创建,后续用户可以调用这个智能合约。
智能合约创建后,区块链上出现一个与该智能合约对应的合约账户,并拥有一个特定的地址;比如,图1中各节点中的“0x68e12cf284…”就代表了创建的这个合约账户的地址;合约代码(Code)和账户存储(Storage)将保存在该合约账户的账户存储中。智能合约的行为由合约代码控制,而智能合约的账户存储则保存了合约的状态。换句话说,智能合约使得区块链上产生包含合约代码和账户存储的虚拟账户。
前述提到,包含创建智能合约的交易的Data字段保存的可以是该智能合约的字节码。字节码由一连串的字节组成,每一字节可以标识一个操作。基于开发效率、可读性等多方面考虑,开发者可以不直接书写字节码,而是选择一门高级语言编写智能合约代码。例如,高级语言可以采用诸如Solidity、Serpent、LLL语言等。对于采用高级语言编写的智能合约代码,可以经过编译器编译,生成可以部署到区块链上的字节码。
以Solidity语言为例,用其编写的合约代码与面向对象编程语言中的类(Class)很相似,在一个合约中可以声明多种成员,包括状态变量、函数、函数修改器、事件等。状态变量是永久存储在智能合约的账户存储(Storage)字段中的值,用于保存合约的状态。
如图2所示,仍以以太坊为例,Bob将一笔包含调用智能合约信息的交易发送到以太坊网络后,各节点均可以在EVM中执行这笔交易。其中,图中交易的From字段用于记录发起调用智能合约的账户的地址,To字段用于记录被调用的智能合约的地址,交易的Data字段用于记录调用智能合约的方法和参数。调用智能合约后,合约账户的账户状态可能改变。后续,某个客户端可以通过接入的区块链节点查看合约账户的账户状态。
智能合约可以以规定的方式在区块链网络中每个节点独立的执行,所有执行记录和数据都保存在区块链上,所以当这样的交易执行完毕后,区块链上就保存了无法篡改、不会丢失的交易凭证。
创建智能合约和调用智能合约的示意图如图3所示。以太坊中要创建一个智能合约,需要经过编写智能合约、变成字节码、部署到区块链等过程。以太坊中调用智能合约,是发起一笔指向智能合约地址的交易,各个节点的EVM可以分别执行该交易,将智能合约代码分布式的运行在以太坊网络中每个节点的虚拟机中。
本说明书中一个或多个实施例中提供的智能合约,用于检测任意文本与目标原创文本的相似度,上述智能合约的账户内可存储目标原创文本的全文内容,或经过处理的、与全文内容对应的多种形式的索引内容,以方便比对。
图4示意了本说明书一示例性实施例提供的基于区块链的文本相似性检测方法的流程步骤,上述方法步骤可由区块链的任一节点设备执行,包括:
步骤402,接收包含第一文本的第一交易,所述第一文本为待检测与所述目标原创文本的相似度的文本。
在本实施方式中,上述第一交易为智能合约调用交易,如上所述,除了包含待检测与目标原创文本相似度的第一文本外,还可包括所调用的智能合约的地址、调用函数名称、或参数等内容。本实施方式并不限定上述第一交易的发送方身份,区块链内任一具有上述智能合约调用权限的节点设备可向区块链发送上述第一交易。
上述第一交易不仅可用于调用智能合约,本领域的技术人员应知,在上述第一交易被共识验证收录到区块链的分布式数据库后,即可基于区块链的防篡改机制为上述第一交易所包含的第一文本内容起到存证作用。
步骤404,调用所述智能合约,执行所述智能合约声明的文本相似性检测逻辑,获得所述第一文本单元与所述目标原创文本的相似性检测结果。
智能合约声明有一系列可执行程序代码,可在区块链节点设备的EVM上执行。由于智能合约在被部署到区块链后,可在任何时候经调用而执行,因而大大提升了检测第一文本与目标原创文本相似度的检测效率;而且,智能合约的任何变动或更改都在区块链上有迹可循,因此有着较低的人为干预风险和去中心化权威特性,区块链网络的节点设备均可准确执行且达成共识的执行结果,相比于可能受人为干预的中心化检测程序,通过调用智能合约执行文本相似性检测可获得更加公平、公正、准确的执行结果。
本实施方式并不限定上述智能合约所声明的文本相似性检测逻辑所包含的具体逻辑步骤,本领域的技术人员可从实际的业务需求出发,针对目标原创文本设计出适合的文本相似性检测逻辑。
在一示出的实施方式中,上述文本相似性检测逻辑可包括:采用相似度算法,如利用simhash算法,通过分词、哈希、加权、合并、降维等过程为目标原创文本和第一文本分别生成simhash签名,由于在simhash算法中,上述两文本的字符上越接近,相似度越高,因此可根据上述两simhash的值进行对比以得到上述两文本的相似度。上述文本相似性检测逻辑对于大段抄袭和小部分内容修改,任然能够起到不错的检测效果,但对于大量采用同义词替换,文章段落拼接等场景,效果就会变差。
在又一示出的实施方式中,上述第一文本包括至少一个预设长度的第一文本单元;所述智能合约存储有若干个目标文本向量,每个目标文本向量基于所述目标原创文本所包含的预设长度的目标文本单元而生成。
本实施方式并不限定上述预设长度的具体表现形式,可以为预设文本单元如自然段落、或自然句子的长度,也可以为预设容量如100K文本容量的长度,等等。上述用于比较与目标原创文本相似性的第一文本,至少包含一个预设长度的第一文本单元,例如,当上述预设长度为自然段落长度时,上述第一文本应至少包含一个自然段落长度的第一文本单元——即至少一个自然段落;基于相同的预设长度处理规则,上述目标原创文本可被划分为若干个预设长度的目标文本单元,例如,上述目标原创文本可基于其文本的自然段落被划分为若干个目标文本单元,再基于文本向量生成算法(如doc2vec算法),为上述若干个目标文本单元生成相应的目标文本向量。
在该实施方式中,执行所述智能合约声明的文本相似性检测逻辑包括(以下逻辑步骤4042至4046未在图4中显示):
步骤4042,为所述至少一个第一文本单元生成至少一个第一文本向量。
在该步骤中,采用与生成上述目标文本向量相同的算法(如doc2vec算法),为上述至少一个第一文本单元生成至少一个第一文本向量。
步骤4044,计算所述至少一个第一文本向量与每个目标文本向量的距离。
上述第一文本向量与目标文本向量之间的距离的计算方法包括但不限于cosine距离计算方法,pearson距离计算方法,欧式距离计算方法,街区距离计算方法,等等。
步骤4046,对比所述距离与预设的距离阈值;当所述距离小于预设的距离阈值时,所述至少一个第一文本单元的相似性检测结果为相似。
相应地,上述第一文本与目标原创文本的相似度检测结果包括上述至少一个第一文本单元的相似性检测结果。
以下,本实施方式示例性地展示一第一文本单元的相似性检测过程,以解释上述步骤4042至4046的具体过程。
在该示例中,目标原创文本包含有一自然段落:“室友今天看上去脸色特别不好,感觉整个人站起来马上就要倒下去;测量一下后发现体温挺正常。天气太热,现在确实很容易中暑。治疗中暑喝绿豆汤有用。”第一文本包含有一自然段落:“男朋友下午看上去脸色特别不好,整个人站都站不起来,感觉马上就要倒下去;体温还算正常,36度。现在天气很热,可能比较容易中暑。治疗中暑喝绿豆汤、酸梅汤有用。”
可以看出上述第一文本所包含的自然段落有可能是基于目标原创文本包含的上述自然段落进行洗稿后所得,但直接对比法、或Simhash算法均不能很好地体现出上述两自然段落的相似性;在本示例中,采用计算文本向量距离的方式来获得上述两自然段落文本的相似性。
本示例所提供的智能合约可以在向区块链发布部署前,已经以自然段落的长度为预设长度,为包含上述自然段落的目标原创文本生成了每个自然段落对应的目标文本向量;也可在向区块链发布上述智能合约代码后,基于包含上述自然段落的目标原创文本对目标文本向量生成函数的调用,为包含上述自然段落的目标原创文本生成每个自然段落对应的目标文本向量,本说明书对此不做限定。
关于上述目标文本向量的生成过程,可以包括:
对目标原创文本进行预处理,该预处理过程可包括去除标点符号,去除文本的助词、停用词等分词处理;
对目标原创文本或经过预处理的目标原创文本进行预设长度的文本划分以生成目标文本单元,例如以自然段落为单元划分上述的目标原创文本或经过预处理的目标原创文本;
采用向量生成算法将目标文本单元生成目标文本向量。
经文本预处理后,上述示例中的目标原创文本包含的自然段落对应的目标文本单元可以为“室友今天看上去脸色特别不好感觉整个人站起来马上就要倒下去测量一下体温挺正常天气太热现在确实很容易中暑治疗中暑喝绿豆汤有用”。基于doc2vec算法,上述目标文本单元生成的目标文本向量可如表一中第二行数据所示。
上述智能合约对包含上述自然段落的第一文本的处理过程与上述过程类似,经文 本预处理后,上述示例中的第一文本包含的自然段落对应的第一文本单元可以为“男朋友下午看上去脸色特别不好整个人站都站不感觉马上就要倒下去体温还算正常36度现在天气很热可能比较容易中暑治疗中暑喝绿豆汤酸梅汤有用”。基于doc2vec算法,上述目标文本单元生成的目标文本向量可表一中第三行数据所示。
在本示例中,基于cosine距离算法,对表一种第二行和第三行所述的文本向量进行距离计算,可得到上述第一文本向量和目标文本向量的距离:0.9270391810208355。
当上述向量间的距离小于预设的距离阈值时,可以得出上述第一文本向量的相似性检测结果为相似。
在本实施方式中,计算预设长度的文本单元(包含目标文本单元和第一文本单元)所采用的算法通常选用计算机深度学习算法,如doc2vec算法,使得文本向量的生成过程不会受文本单元中同义词替换的影响,从而为相似的文本单元生成相似的文本向量;当上述目标文本向量与第一文本向量之间的距离小于预设阈值时,上述目标文本向量对应的目标文本单元与第一文本向量对应的第一文本单元即为相似文本,因此本实施方式侵权者采用同义词替换或洗稿来进行文本侵权可以起到更加准确的相似性检测效果。
值得注意的是,本实施方式中并不限定第一文本所包含的第一文本单元来源于同一文本,还是源自不同的文本;本实施方式可以第一文本单元为检测单位来进行文本相似性的检测。
在本实施方式中,区块链的任一节点设备可以基于一个或多个第一文本单元调用上述智能合约来检测上述一个或多个第一文本单元与目标原创文本中包括的多个目标文本单元的相似性;为了便于管理,上述智能合约可以为目标原创文本包括的多个目标文本单元创建索引,从而以目标文本单元为检测单元,更加提高了文本相似度检测的效率和准确率。
在上述实施方式中,用于检测与目标原创文本的相似性的第一文本可能只包含一个或几个等少量预设长度的、与目标原创文本具有相似性的第一文本单元,而第一文本所属的完整文章文本可能相对于目标原创文本具有更多可以视为相似性的文本单元;因此,在获取与目标原创文本(的目标文本单元)具有相似性的第一文本单元后,可基于上述相似的第一文本单元进行全网监测抓取,以获取包含至少一个相似的第一文本单元的第二文本,上述相似的第一文本单元为在上述实施方式中相似性检测结果为相似的第 一文本单元,以检测上述第二本文所包含的预设长度的第二文本单元与目标原创文本的相似性,从而以获得上述第二文本是否与目标原创文本具备足够的相似性以影响第二文本的原创性的结果。
因此,如图4所示,又一示出的实施方式所提供的文本相似性检测方法还包括:
步骤406,获取包含至少一个相似的第一文本单元的第二文本,所述相似的第一文本单元为所述相似性检测结果为相似的第一文本单元。
步骤408,将所述第二文本划分为多个所述预设长度的第二文本单元。
为了保证相似性对比的统一性,对预设长度的第二文本单元的划分方式可与上述第一文本单元或目标文本单元的划分方式一致。
步骤410,向所述区块链发送包含所述第二文本单元的第二交易,以调用所述智能合约,执行所述智能合约声明的文本相似性检测逻辑,获得每个第二文本单元与所述目标原创文本的相似性检测结果。
由以上实施方式可知,上述第二交易可以为一个,也可以为多个,基于上述实施方式所示的步骤402至404的过程,获得每个第二文本单元与所述目标原创文本的相似性检测结果。
步骤412,基于相似性检测结果为相似的第二文本单元,计算所述第二文本与所述目标原创文本的相似度。
本实施方式中并不限定基于相似性检测结果为相似的第二文本单元,计算所述第二文本与所述目标原创文本的相似度的具体方式,本领域的技术人员可基于文本内容的领域、文本的特性、领域内文本侵权的定义等具体的影响因素,为目标原创文本设计出适用的文本相似度计算方法。
在一示出的实施方式中,上述基于相似度检测结果为相似的第二文本单元,计算所述第二文本与所述目标原创文本的相似度,包括:计算所述相似度检测结果为相似的第二文本单元的内容总和与所述第二文本的全部内容的比值以作为所述第二文本与所述目标原创文本的相似度。例如,与目标原创文本具有相似性的第二文本单元的个数为N个,第二文本的总段落数为M1个,则上述第二文本与目标原创文本的相似度可以为N/M1。
或者,计算所述相似度检测结果为相似的第二文本单元的内容总和与所述目标原 创文本的全部内容的比值以作为所述第二文本与所述目标原创文本的相似度。例如,与目标原创文本具有相似性的第二文本单元的个数为N个,目标原创文本的总段落数为M2个,则上述第二文本与目标原创文本的相似度可以为N/M2。
亦或,上述第二文本与目标原创文本的相似度还可选用N/M1、N/M2这两数值中较大或较小的,以作为上述第二文本与目标原创文本的相似度。
在又一示出的实施方式中,可以基于第二文本所包含的所有具有相似性的第二文本单元与相应目标文本单元的多个相似度,计算平均相似度(或取上述多个相似度的最大值),作为第二文本与目标原创文本的相似度。上述具有相似性的第二文本单元与相应目标文本单元的相似度可以基于上述第二文本向量与相应目标文本向量的距离与预设距离阈值的比值或差值计算而获得,在此不作限定。
当上述第二文本与目标原创文本的相似度大于一预设的相似度阈值时,上述第二文本可被认定为侵权文本。在进行第二文本所包含的第二文本单元与目标文本单元的相似性检测时,通过在区块链上基于第二交易调用智能合约,经过区块链节点设备的共识验证、已经为第二交易包含的第二文本单元的内容、及该内容与目标原创文本的相似性进行了区块链存证;有效克服了在现有的文本侵权检测或文本相似性检测中,对检测出的侵权文本通常通过公证机关进行文本存证或电子存证,从检测到存证的时间窗口较长,易于被可能侵权者抵赖或消除证据的缺点。
更进一步地,在又一示出的实施方式中,当所述第二文本与目标原创文本的相似度大于预设的相似度阈值时,所述方法还包括:向所述区块链发送包含所述第二文本的存证交易,上述存证交易还可包括上述第二文本的来源信息,如刊载网址等,从而基于区块链的防篡改机制进一步为第二文本的侵权性作出区块链存证。
基于本说明书各实施方式所提供的第二文本与目标原创文本的相似度获得过程,可以为基于多个不同的第二文本与目标原创文本的相似度对多个不同的第二文本进行相似度排序,从而基于不同的相似度排名采用对应的侵权应对措施,如通知侵权方立即停止侵权、进行侵权索赔或发送共享权利建议等。
与上述流程实现对应,本说明书的实施例还提供了基于区块链的文本相似性检测装置50。装置50可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为逻辑意义上的装置,是通过所在设备的CPU(Central Process Unit,中央处理器)将对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,除 了图6所示的CPU、内存以及存储器之外,上述装置所在的设备通常还包括用于进行无线信号收发的芯片等其他硬件,和/或用于实现网络通信功能的板卡等其他硬件。
如图5所示,本说明书还提供了一种基于区块链的文本相似性检测装置50,应用于部署有用于检测与目标原创文本相似度的智能合约的区块链网络,所述装置50应用于所述区块链网络的节点设备端,包括:
接收单元502,接收包含第一文本的第一交易,所述第一文本为待检测与所述目标原创文本的相似度的文本;
执行单元504,调用所述智能合约,执行所述智能合约声明的文本相似性检测逻辑,获得所述第一文本与所述目标原创文本的相似性检测结果。
在又一示出的实施方式中,所述第一文本包括至少一个预设长度的第一文本单元;所述智能合约存储有若干个目标文本向量,每个目标文本向量基于所述目标原创文本所包含的预设长度的目标文本单元而生成;所述执行所述智能合约声明的文本相似性检测逻辑包括:
为所述至少一个第一文本单元生成至少一个第一文本向量;
计算所述至少一个第一文本向量与每个目标文本向量的距离;
对比所述距离与预设的距离阈值;当所述距离小于预设的距离阈值时,所述至少一个第一文本单元的相似性检测结果为相似;所述第一文本与所述目标原创文本的相似性检测结果包括所述至少一个第一文本单元的相似性检测结果。
在又一示出的实施方式中,所述智能合约为所述若干个目标文本向量生成有目标文本向量索引。
在又一示出的实施方式中,所述的装置50,还包括:
获取单元,获取包含至少一个相似的第一文本单元的第二文本,所述相似的第一文本单元为所述相似性检测结果为相似的第一文本单元;
划分单元,将所述第二文本划分为多个所述预设长度的第二文本单元;
发送单元,向所述区块链发送包含所述第二文本单元的第二交易,以调用所述智能合约,执行所述智能合约声明的文本相似性检测逻辑,获得所述每个第二文本与所述目标原创文本的相似性检测结果;
计算单元,基于相似性检测结果为相似的第二文本单元,计算所述第二文本与所 述目标原创文本的相似度。
在又一示出的实施方式中,所述计算单元,进一步用于:
计算所述相似度检测结果为相似的第二文本单元的内容总和与所述第二文本的全部内容的比值,或所述相似度检测结果为相似的第二文本单元的内容总和与所述目标原创文本的全部内容的比值,以作为所述第二文本与所述目标原创文本的相似度。
在又一示出的实施方式中,当所述第二文本与目标原创文本的相似度大于预设的相似度阈值时,所述发送单元,进一步用于:
向所述区块链发送存证交易,所述存证交易包括所述第二文本和所述第二文本的来源信息。
上述装置50中各个单元的功能和作用的实现过程具体详见上述区块链节点设备所执行的基于区块链的文本相似性检测方法中对应步骤的实现过程,相关之处参见方法实施例的部分说明即可,在此不再赘述。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部单元或模块来实现本说明书方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
上述实施例阐明的装置、单元、模块,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。
与上述方法实施例相对应,本说明书的实施例还提供了一种计算机设备,如图6所示,该计算机设备包括存储器和处理器。其中,存储器上存储有能够由处理器运行的计算机程序;处理器在运行存储的计算机程序时,执行本说明书实施例中上述区块链节点设备所执行的基于区块链的文本相似性检测方法的各个步骤。对上述区块链节点设备所执行的基于区块链的文本相似性检测方法的各个步骤的详细描述请参见之前的内容,不再重复。
与上述方法实施例相对应,本说明书的实施例还提供了一种计算机可读存储介质, 该存储介质上存储有计算机程序,这些计算机程序在被处理器运行时,执行本说明书实施例中上述区块链节点设备所执行的基于区块链的文本相似性检测方法的各个步骤。对上述区块链节点设备所执行的基于区块链的文本相似性检测方法的各个步骤的详细描述请参见之前的内容,不再重复。
以上所述仅为本说明书的较佳实施例而已,并不用以限制本说明书,凡在本说明书的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本说明书保护的范围之内。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。
计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
本领域技术人员应明白,本说明书的实施例可提供为方法、系统或计算机程序产品。因此,本说明书的实施例可采用完全硬件实施例、完全软件实施例或结合软件和硬 件方面的实施例的形式。而且,本说明书的实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。

Claims (14)

  1. 一种基于区块链的文本相似性检测方法,应用于部署有用于检测与目标原创文本相似度的智能合约的区块链网络,所述方法由所述区块链网络的节点设备执行,包括:
    接收包含第一文本的第一交易,所述第一文本为待检测与所述目标原创文本的相似度的文本;
    调用所述智能合约,执行所述智能合约声明的文本相似性检测逻辑,获得所述第一文本与所述目标原创文本的相似性检测结果。
  2. 根据权利要求1所述的方法,所述第一文本包括至少一个预设长度的第一文本单元;所述智能合约存储有若干个目标文本向量,每个目标文本向量基于所述目标原创文本所包含的预设长度的目标文本单元而生成;所述执行所述智能合约声明的文本相似性检测逻辑包括:
    为所述至少一个第一文本单元生成至少一个第一文本向量;
    计算所述至少一个第一文本向量与每个目标文本向量的距离;
    对比所述距离与预设的距离阈值;当所述距离小于预设的距离阈值时,所述至少一个第一文本单元的相似性检测结果为相似;所述第一文本与所述目标原创文本的相似性检测结果包括所述至少一个第一文本单元的相似性检测结果。
  3. 根据权利要求2所述的方法,所述智能合约为所述若干个目标文本向量生成有目标文本向量索引。
  4. 根据权利要求2或3所述的方法,还包括:
    获取包含至少一个相似的第一文本单元的第二文本,所述相似的第一文本单元为所述相似性检测结果为相似的第一文本单元;
    将所述第二文本划分为多个所述预设长度的第二文本单元;
    向所述区块链发送包含所述第二文本单元的第二交易,以调用所述智能合约,执行所述智能合约声明的文本相似性检测逻辑,获得所述每个第二文本与所述目标原创文本的相似性检测结果;
    基于相似性检测结果为相似的第二文本单元,计算所述第二文本与所述目标原创文本的相似度。
  5. 根据权利要求4所述的方法,基于相似度检测结果为相似的第二文本单元,计算所述第二文本与所述目标原创文本的相似度,包括:
    计算所述相似度检测结果为相似的第二文本单元的内容总和与所述第二文本的全部内容的比值,或所述相似度检测结果为相似的第二文本单元的内容总和与所述目标原 创文本的全部内容的比值,以作为所述第二文本与所述目标原创文本的相似度。
  6. 根据权利要求4所述的方法,当所述第二文本与目标原创文本的相似度大于预设的相似度阈值时,所述方法还包括:
    向所述区块链发送存证交易,所述存证交易包括所述第二文本和所述第二文本的来源信息。
  7. 一种基于区块链的文本相似性检测装置,应用于部署有用于检测与目标原创文本相似度的智能合约的区块链网络,所述装置应用于所述区块链网络的节点设备端,包括:
    接收单元,接收包含第一文本的第一交易,所述第一文本为待检测与所述目标原创文本的相似度的文本;
    执行单元,调用所述智能合约,执行所述智能合约声明的文本相似性检测逻辑,获得所述第一文本与所述目标原创文本的相似性检测结果。
  8. 根据权利要求7所述的装置,所述第一文本包括至少一个预设长度的第一文本单元;所述智能合约存储有若干个目标文本向量,每个目标文本向量基于所述目标原创文本所包含的预设长度的目标文本单元而生成;所述执行所述智能合约声明的文本相似性检测逻辑包括:
    为所述至少一个第一文本单元生成至少一个第一文本向量;
    计算所述至少一个第一文本向量与每个目标文本向量的距离;
    对比所述距离与预设的距离阈值;当所述距离小于预设的距离阈值时,所述至少一个第一文本单元的相似性检测结果为相似;所述第一文本与所述目标原创文本的相似性检测结果包括所述至少一个第一文本单元的相似性检测结果。
  9. 根据权利要求8所述的装置,所述智能合约为所述若干个目标文本向量生成有目标文本向量索引。
  10. 根据权利要求8或9所述的装置,还包括:
    获取单元,获取包含至少一个相似的第一文本单元的第二文本,所述相似的第一文本单元为所述相似性检测结果为相似的第一文本单元;
    划分单元,将所述第二文本划分为多个所述预设长度的第二文本单元;
    发送单元,向所述区块链发送包含所述第二文本单元的第二交易,以调用所述智能合约,执行所述智能合约声明的文本相似性检测逻辑,获得所述每个第二文本与所述目标原创文本的相似性检测结果;
    计算单元,基于相似性检测结果为相似的第二文本单元,计算所述第二文本与所述 目标原创文本的相似度。
  11. 根据权利要求10所述的装置,所述计算单元,进一步用于:
    计算所述相似度检测结果为相似的第二文本单元的内容总和与所述第二文本的全部内容的比值,或所述相似度检测结果为相似的第二文本单元的内容总和与所述目标原创文本的全部内容的比值,以作为所述第二文本与所述目标原创文本的相似度。
  12. 根据权利要求10所述的装置,当所述第二文本与目标原创文本的相似度大于预设的相似度阈值时,所述发送单元,进一步用于:
    向所述区块链发送存证交易,所述存证交易包括所述第二文本和所述第二文本的来源信息。
  13. 一种计算机设备,包括:存储器和处理器;所述存储器上存储有可由所述处理器运行的计算机程序;所述处理器运行所述计算机程序时,执行如权利要求1至6任意一项所述的方法。
  14. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该计算机程序被处理器执行时实现如权利要求1至6中任一项所述方法的步骤。
PCT/CN2020/072148 2019-07-26 2020-01-15 基于区块链的文本相似性检测方法及装置、电子设备 WO2021017440A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US16/782,938 US10909317B2 (en) 2019-07-26 2020-02-05 Blockchain-based text similarity detection method, apparatus and electronic device
US17/164,741 US11100284B2 (en) 2019-07-26 2021-02-01 Blockchain-based text similarity detection method, apparatus and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910683370.2A CN110472201B (zh) 2019-07-26 2019-07-26 基于区块链的文本相似性检测方法及装置、电子设备
CN201910683370.2 2019-07-26

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/782,938 Continuation US10909317B2 (en) 2019-07-26 2020-02-05 Blockchain-based text similarity detection method, apparatus and electronic device

Publications (1)

Publication Number Publication Date
WO2021017440A1 true WO2021017440A1 (zh) 2021-02-04

Family

ID=68508366

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/072148 WO2021017440A1 (zh) 2019-07-26 2020-01-15 基于区块链的文本相似性检测方法及装置、电子设备

Country Status (3)

Country Link
CN (2) CN111898360B (zh)
TW (1) TWI737183B (zh)
WO (1) WO2021017440A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113821474A (zh) * 2021-11-22 2021-12-21 武汉龙津科技有限公司 一种数据处理方法、装置、设备和存储介质
CN113837629A (zh) * 2021-09-29 2021-12-24 土巴兔集团股份有限公司 原创内容保护方法、装置及可读存储介质
CN114492373A (zh) * 2022-04-07 2022-05-13 中国信息通信研究院 基于区块链的作品侵权判定方法和装置

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111898360B (zh) * 2019-07-26 2023-09-26 创新先进技术有限公司 基于区块链的文本相似性检测方法及装置、电子设备
US10909317B2 (en) 2019-07-26 2021-02-02 Advanced New Technologies Co., Ltd. Blockchain-based text similarity detection method, apparatus and electronic device
CN110991358B (zh) * 2019-12-06 2024-03-19 腾讯科技(深圳)有限公司 一种基于区块链的文本比对方法及装置
CN110851761A (zh) * 2020-01-15 2020-02-28 支付宝(杭州)信息技术有限公司 基于区块链的侵权检测方法、装置、设备及存储介质
CN111414589B (zh) * 2020-03-20 2021-11-16 支付宝(杭州)信息技术有限公司 基于区块链的作品原创审核方法、装置及设备
CN113553839B (zh) * 2020-04-26 2024-05-10 北京中科闻歌科技股份有限公司 一种文本原创识别方法、装置、电子设备及存储介质
CN111539853B (zh) * 2020-06-19 2020-11-06 支付宝(杭州)信息技术有限公司 标准案由确定方法、装置和设备
CN112819616A (zh) * 2020-06-24 2021-05-18 支付宝(杭州)信息技术有限公司 基于区块链的原创作品交易方法及装置和电子设备
CN111917859B (zh) * 2020-07-28 2022-08-12 腾讯科技(深圳)有限公司 数据传输方法、装置、计算机设备以及存储介质
CN111930809A (zh) 2020-09-17 2020-11-13 支付宝(杭州)信息技术有限公司 数据处理方法、装置及设备
CN113128592B (zh) * 2021-04-20 2022-10-18 重庆邮电大学 一种用于异构的医疗器械标识解析方法、系统及存储介质
CN113177107B (zh) * 2021-05-25 2022-05-27 浙江工商大学 一种基于句法树匹配的智能合约相似性检测方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832306A (zh) * 2017-11-28 2018-03-23 武汉大学 一种基于Doc2vec的相似实体挖掘方法
CN109002693A (zh) * 2018-07-17 2018-12-14 大连理工大学 一种基于区块链的稿件保护方法
CN109086577A (zh) * 2018-08-06 2018-12-25 深圳市网心科技有限公司 一种基于区块链的原创音乐作品管理方法及相关设备
CN109597878A (zh) * 2018-11-13 2019-04-09 北京合享智慧科技有限公司 一种确定文本相似度的方法及相关装置
US20190179951A1 (en) * 2017-12-08 2019-06-13 International Business Machines Corporation Distributed match and association of entity key-value attribute pairs
CN110472201A (zh) * 2019-07-26 2019-11-19 阿里巴巴集团控股有限公司 基于区块链的文本相似性检测方法及装置、电子设备

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8620872B1 (en) * 2008-09-10 2013-12-31 Amazon Technologies, Inc. System for comparing content
KR101577376B1 (ko) * 2014-01-21 2015-12-14 (주) 아워텍 텍스트 기준점 기반의 저작권 침해 판단 시스템 및 그 방법
US20170075877A1 (en) * 2015-09-16 2017-03-16 Marie-Therese LEPELTIER Methods and systems of handling patent claims
CN106227897A (zh) * 2016-08-31 2016-12-14 青海民族大学 一种基于藏文句子级别的藏文论文复制检测方法及系统
CN106649221A (zh) * 2016-12-06 2017-05-10 北京锐安科技有限公司 重复文本的检测方法及装置
CN107451553B (zh) * 2017-07-26 2019-08-02 北京大学深圳研究生院 一种基于超图转变的视频中暴力事件检测方法
CN107832384A (zh) * 2017-10-28 2018-03-23 北京安妮全版权科技发展有限公司 侵权检测方法、装置、存储介质和电子设备
CN107992470A (zh) * 2017-11-08 2018-05-04 中国科学院计算机网络信息中心 一种基于相似度的文本查重方法及系统
CN110019216B (zh) * 2017-12-07 2022-10-14 中国科学院上海高等研究院 基于区块链的知识产权数据存储方法、介质及计算机设备
CN108197102A (zh) * 2017-12-26 2018-06-22 百度在线网络技术(北京)有限公司 一种文本数据统计方法、装置和服务器
US10909150B2 (en) * 2018-01-19 2021-02-02 Hypernet Labs, Inc. Decentralized latent semantic index using distributed average consensus
CN108550041A (zh) * 2018-03-20 2018-09-18 深圳市元征科技股份有限公司 保护原创作品的方法、装置和终端
KR101938878B1 (ko) * 2018-06-14 2019-01-15 김보언 블록체인 기반 저작권 관리 시스템
CN108920633B (zh) * 2018-07-01 2021-12-03 湖北通远格知科技有限公司 一种论文相似度的检测方法
CN108876560B (zh) * 2018-07-18 2020-10-02 阿里巴巴集团控股有限公司 一种基于区块链对作品发布者进行信用评价的方法及装置
CN109345416B (zh) * 2018-09-12 2021-09-21 连尚(新昌)网络科技有限公司 一种用于记录作品间的引用关系的方法与设备
CN109492982B (zh) * 2018-09-18 2023-07-18 平安科技(深圳)有限公司 基于区块链的协作创作方法、装置及电子设备
KR101981699B1 (ko) * 2018-10-22 2019-05-23 김보언 블록체인 기반의 비디지털 저작물의 저작권 관리 시스템
CN113283905A (zh) * 2018-10-26 2021-08-20 创新先进技术有限公司 基于区块链的数据存证、获取方法和装置
CN109614775A (zh) * 2018-11-20 2019-04-12 安徽大学 一种基于区块链的版权溯源的保护框架及方法
CN110457917B (zh) * 2019-01-09 2022-12-09 腾讯科技(深圳)有限公司 滤除区块链数据中的非法内容的方法及相关装置
CN110046480A (zh) * 2019-03-29 2019-07-23 阿里巴巴集团控股有限公司 基于区块链的作品版权分配方法和装置
PL3662637T3 (pl) * 2019-05-20 2021-09-20 Advanced New Technologies Co., Ltd. Identyfikacja materiałów chronionych prawem autorskim przy zastosowaniu osadzonych informacji o prawie autorskim

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832306A (zh) * 2017-11-28 2018-03-23 武汉大学 一种基于Doc2vec的相似实体挖掘方法
US20190179951A1 (en) * 2017-12-08 2019-06-13 International Business Machines Corporation Distributed match and association of entity key-value attribute pairs
CN109002693A (zh) * 2018-07-17 2018-12-14 大连理工大学 一种基于区块链的稿件保护方法
CN109086577A (zh) * 2018-08-06 2018-12-25 深圳市网心科技有限公司 一种基于区块链的原创音乐作品管理方法及相关设备
CN109597878A (zh) * 2018-11-13 2019-04-09 北京合享智慧科技有限公司 一种确定文本相似度的方法及相关装置
CN110472201A (zh) * 2019-07-26 2019-11-19 阿里巴巴集团控股有限公司 基于区块链的文本相似性检测方法及装置、电子设备

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837629A (zh) * 2021-09-29 2021-12-24 土巴兔集团股份有限公司 原创内容保护方法、装置及可读存储介质
CN113821474A (zh) * 2021-11-22 2021-12-21 武汉龙津科技有限公司 一种数据处理方法、装置、设备和存储介质
CN114492373A (zh) * 2022-04-07 2022-05-13 中国信息通信研究院 基于区块链的作品侵权判定方法和装置

Also Published As

Publication number Publication date
TW202105372A (zh) 2021-02-01
CN110472201A (zh) 2019-11-19
CN111898360B (zh) 2023-09-26
TWI737183B (zh) 2021-08-21
CN111898360A (zh) 2020-11-06
CN110472201B (zh) 2020-07-21

Similar Documents

Publication Publication Date Title
WO2021017440A1 (zh) 基于区块链的文本相似性检测方法及装置、电子设备
US11100284B2 (en) Blockchain-based text similarity detection method, apparatus and electronic device
Bai et al. Formal modeling and verification of smart contracts
TWI762818B (zh) 基於區塊鏈的發票創建方法及裝置、電子設備
EP3859571B1 (en) Method and apparatus for allocating copyrights of works based on blockchain
CN111539731A (zh) 基于区块链的联邦学习方法及装置和电子设备
CN112650978B (zh) 基于区块链的侵权检测方法及装置、电子设备
US11562451B1 (en) Apparatus for proportional calculation regarding non-fungible tokens
TWI733349B (zh) 基於區塊鏈的票據號碼分配方法、裝置及電子設備
TW202022754A (zh) 基於區塊鏈的發票創建方法及裝置、電子設備
US20200193428A1 (en) Blockchain-based payment withholding and agreement signing method, apparatus, and electronic device
WO2021017437A1 (zh) 基于区块链的票据核销方法及装置、电子设备、存储介质
TW202107456A (zh) 基於區塊鏈的票據實名領取方法、裝置及電子設備
CN112101938B (zh) 基于区块链的数字印章使用方法、装置及电子设备
US10872170B2 (en) Blockchain-based copyright distribution
CN112200569B (zh) 基于区块链的数字印章使用方法、装置及电子设备
CN111738724A (zh) 跨境资源转移真实性审核方法及装置、电子设备
WO2021017432A1 (zh) 一种基于区块链的报销费用分割方法、装置及电子设备
Eltuhami et al. Identity verification and document traceability in digital identity systems using non-transferable non-fungible tokens
US20200286090A1 (en) Blockchain-based reimbursement splitting
Prabhu et al. Decentralized digital currency system using Merkle Hash trees
Guidi et al. Delving NFT vulnerabilities, a sleepminting prevention system
CN114119046A (zh) 基于区块链系统的商品设计的授权使用方法及装置
Nagaraj et al. Panel 3 position paper: Blockchain can be the backbone of india’s economy
George Introducing blockchain applications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20848550

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20848550

Country of ref document: EP

Kind code of ref document: A1