CN111414648A - Corpus authentication method and apparatus - Google Patents

Corpus authentication method and apparatus Download PDF

Info

Publication number
CN111414648A
CN111414648A CN202010143671.9A CN202010143671A CN111414648A CN 111414648 A CN111414648 A CN 111414648A CN 202010143671 A CN202010143671 A CN 202010143671A CN 111414648 A CN111414648 A CN 111414648A
Authority
CN
China
Prior art keywords
node
corpus
hash value
authenticated
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010143671.9A
Other languages
Chinese (zh)
Other versions
CN111414648B (en
Inventor
何征宇
谭峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Transn Iol Technology Co ltd
Original Assignee
Transn Iol Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Transn Iol Technology Co ltd filed Critical Transn Iol Technology Co ltd
Priority to CN202010143671.9A priority Critical patent/CN111414648B/en
Publication of CN111414648A publication Critical patent/CN111414648A/en
Application granted granted Critical
Publication of CN111414648B publication Critical patent/CN111414648B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Storage Device Security (AREA)

Abstract

The embodiment of the invention provides a corpus authentication method and a corpus authentication device, wherein the method comprises the steps of obtaining a corpus to be authenticated; updating the hash value of the corpus used for authentication according to the corpus to be authenticated to obtain an updated hash value; and determining the hash value of the corpus from the block chain, and if the hash value determined from the block chain is the same as the updated hash value, obtaining that the corpus to be authenticated comes out of the corpus. The embodiment of the invention adopts all corpora of the whole corpus as a small number of corpora to be authenticated to perform authentication, improves the reliability of the authentication process, and also ensures that the hash value of the historically constructed corpus can be stably stored and credibility for a long time by utilizing the characteristic that the block chain can not be tampered, and the uplink time on the block chain can be used as the proof of the authority time of the corpus, and finally, the uplink is only stored in the hash value of the whole corpus, so that the occupation of the storage space can be greatly reduced compared with the uploading of the hash value of each corpus in the corpus.

Description

Corpus authentication method and apparatus
Technical Field
The invention relates to the technical field of information retrieval, in particular to a corpus authentication method and device.
Background
The corpus is a warehouse for storing language materials, and is a convenient tool for quantitative analysis and qualitative research of languages. The language materials actually appearing in real life, such as spoken sentences, sentence paragraphs of literary works, sentence paragraphs appearing in newspapers and magazines, and the like, are arranged together, and then annotation or translation and the like are carried out to finish the explanation of the language materials in different fields, so that a language material base is finally formed, wherein each language material in the language material base is composed of a language material original text and a language material explanation. The sharing of the corpus resources can furthest exert the academic value and the social benefit of the corpus.
The formation of a high-quality corpus requires a corpus author to pay a large amount of heart blood and time to accumulate corpus data, including data such as annotation or translation of the corpus, but at present, no authentication scheme for the corpus exists, so that the corpus is stolen, and the corpus author suffers great loss.
Disclosure of Invention
Embodiments of the present invention provide a corpus authentication method and apparatus that overcome the above problems or at least partially solve the above problems.
In a first aspect, an embodiment of the present invention provides a corpus authentication method, including:
obtaining a corpus to be authenticated;
updating the hash value of the corpus used for authentication according to the corpus to be authenticated to obtain an updated hash value;
and determining a hash value of the corpus from the blockchain, and if the hash value determined from the blockchain is the same as the updated hash value, acquiring that the corpus to be authenticated comes out of the corpus.
The updating the hash value of the corpus used for authentication according to the corpus to be authenticated comprises the following steps:
acquiring double-array Trie trees of the corpus;
updating the node hash value of the root node of the double-array Trie tree according to the linguistic data to be authenticated;
the method comprises the steps that a corpus original text of an original corpus in the corpus is represented by a path formed by nodes in a double-array Trie tree, the nodes are used for representing a splitting unit of the corpus original text, and the sequence of the nodes in the path corresponds to the sequence of the splitting unit represented by the nodes in the corpus original text;
and the node hash value of the node is obtained according to the corpus represented by the path formed from the node to the root node and the node hash values of the child nodes.
Further, the updating the node hash value of the root node of the double-array Trie according to the corpus to be authenticated specifically includes:
determining a node with the maximum depth of the corpus original text of the corpus to be authenticated in the double-array Trie tree as a target node; taking a node on a path from the target node to the root node as a first associated node, and taking a brother node of the first associated node as a second associated node;
acquiring node hash values of the sub-nodes of the target node, which are determined when the hash values of the corpus are uploaded to a block chain, and updating the node hash values of the target node according to the corpus to be authenticated and the node hash values of the sub-nodes of the target node;
acquiring a first hash value of the first association node and a node hash value of the second association node which are determined when the hash values of the corpus are uploaded to a block chain;
updating the node hash value of the root node of the double-array Trie tree according to the node hash value of the target node, the first hash value of the first associated node and the node hash value of the second associated node;
and obtaining the first hash value according to the corpus represented by the path formed from the node to the root node.
Further, the updating the node hash value of the target node according to the corpus to be authenticated and the node hash values of the child nodes of the target node specifically includes:
performing hash operation according to the corpus original text and the corpus explanation of the corpus to be authenticated to obtain a first hash value of the target node;
performing hash operation according to the node hash values of the child nodes of the target node to obtain a second hash value of the target node;
and carrying out hash operation according to the first hash value and the second hash value of the target node to obtain the node hash value of the target node.
Further, the node hash value of the root node of the double-array Trie is updated according to the node hash value of the target node, the first hash value of the first associated node, and the node hash value of the second associated node, specifically, the following iterative process is executed from the target node:
performing hash operation according to the node hash values of the current iteration node and a second associated node of the current iteration node to obtain a second hash value of a father node of the current iteration node; the current iteration node is a target node or the first association node;
if the father node is not the root node, performing hash operation according to the first hash value of the father node and the second hash value of the father node to obtain a node hash value of the father node, and taking the father node of the father node as a node of next iteration;
and if the father node is the root node, taking the second hash value of the father node as the node hash value of the root node.
Further, the obtaining the corpus to be authenticated further includes:
generating a hash value of the corpus to be authenticated, and uploading the hash value of the corpus to the block chain;
correspondingly, the generating the hash value of the corpus to be authenticated specifically includes:
constructing a double-array Trie tree of the corpus;
performing the following iterative process starting from a leaf node of the dual array Trie:
performing hash operation on a node of current iteration according to the corpus original text and the corpus explanation of the corpus represented by a path formed from the node to a root node if the node is a leaf node to obtain a first hash value of the node, taking the first hash value of the node as the node hash value of the node, and taking a father node of the node as a node of next iteration;
if the node is not a leaf node or a root node, performing hash operation according to the corpus original text and the corpus explanation of the corpus represented by the path formed from the node to the root node to obtain a first hash value of the node, and performing hash operation according to the node hash values of the child nodes of the node to obtain a second hash value of the node; performing hash operation according to the first hash value and the second hash value of the node to obtain a node hash value of the node, and taking a father node of the node as a node of the next iteration;
and if the node is the root node, performing hash operation according to the node hash values of the child nodes of the node to obtain a second hash value of the node, taking the second hash value of the node as the node hash value of the node and simultaneously as the hash value of the corpus, and ending the iterative process.
Further, the split unit is a byte.
In a second aspect, an embodiment of the present invention provides a corpus authentication apparatus, including:
the system comprises a to-be-authenticated corpus obtaining module, a to-be-authenticated corpus obtaining module and a to-be-authenticated corpus obtaining module, wherein the to-be-authenticated corpus obtaining module is used for obtaining a to-be-authenticated corpus;
the hash value updating module is used for updating the hash value of the corpus used for authentication according to the corpus to be authenticated to obtain an updated hash value;
and the comparison module is used for determining a hash value of the corpus from the block chain, and if the hash value determined from the block chain is the same as the updated hash value, the corpus to be authenticated is known to be out of the corpus.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method provided in the first aspect when executing the program.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method as provided in the first aspect.
According to the corpus authentication method and device provided by the embodiment of the invention, all corpora of the whole corpus are used as a small number of corpora to be authenticated for proving, the reliability of the authentication process is improved, the characteristic that a block chain cannot be tampered is also utilized, the long-term stable storage and reliability of hash values of the historically constructed corpus are ensured, the uplink time on the block chain can be used as the proof of authority time of the corpus, and finally, the uplink storage is only the hash values of the whole corpus, so that the occupation of storage space can be greatly reduced compared with the uploading of the hash values of all corpora in the corpus.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flowchart illustrating a corpus authentication method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a double-array Trie according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a corpus authentication device according to an embodiment of the present invention;
fig. 4 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to solve the problem that the corpus is stolen due to the fact that an authentication scheme aiming at the corpus does not exist in the prior art, the invention of the embodiment of the invention has the following conception: after a corpus is constructed, node hash values of root nodes of the corpus are generated according to all corpora in the corpus, the node hash values are only uploaded to a block chain for storage, after the corpora to be authenticated are obtained, the hash values of the corpus used for authentication are updated by the corpora to be authenticated, updated node hash values are obtained, and the corpora to be authenticated are self-derived from the corpus if the hash values before and after updating are the same because the hash values at each time are generated based on all corpora in the corpus. The invention adopts all corpora of the whole corpus as a small number of corpora to be authenticated to perform authentication, improves the reliability of the authentication process, and also utilizes the characteristic that the block chain can not be tampered, so that the hash value of the historically constructed corpus can be stably stored and credibility for a long time, and the uplink time on the block chain can be used as the proof of the authority time of the corpus, and finally, the uplink is only stored in the hash value of the whole corpus, so that the occupation of the storage space can be greatly reduced compared with the uploading of the hash value of each corpus in the corpus.
Fig. 1 is a schematic flow chart of a corpus authentication method according to an embodiment of the present invention, as shown in fig. 1, the method includes:
s101, obtaining the linguistic data to be authenticated.
The language of the corpus to be authenticated is not particularly limited in the embodiment of the present invention, and the corpus to be authenticated may be, for example, chinese, english, japanese, or the like.
And S102, updating the hash value of the corpus used for authentication according to the corpus to be authenticated to obtain the updated hash value.
It can be understood that, in the embodiment of the present invention, to implement the authentication of whether the corpus to be authenticated is from the corpus, the corpus has at least one corpus original text that is the same as the corpus original text of the corpus to be authenticated.
The embodiment of the invention adopts an algorithm for calculating the hash value in advance to calculate the hash value for the corpus used for authentication. The embodiment of the invention can replace the corpus with the same corpus original text in the corpus and the corpus to be authenticated so as to obtain an updated corpus, and the corpus is formed by explaining the corpus original text and the corpus, so the updated corpus can be the same as or different from the original corpus.
S103, determining a hash value of the corpus from the block chain, and if the hash value determined from the block chain is the same as the updated hash value, knowing that the corpus to be authenticated is derived from the corpus.
It should be noted that, in the embodiment of the present invention, after the corpus is created, the hash value of the corpus is uploaded to the block chain, during the authentication, the hash value of the corpus is determined from the block chain, and then the hash value is compared with the updated hash value, and if the hash value is the same as the updated hash value, it indicates that the corpus to be authenticated is derived from the corpus.
According to the corpus authentication method, all corpora of the whole corpus are used for verifying a small number of corpora to be authenticated, the reliability of the authentication process is improved, the characteristic that a block chain cannot be tampered is also utilized, the long-term stable storage and reliability of hash values of the historically constructed corpus can be guaranteed, the uplink time on the block chain can be used as the proof of the authority time of the corpus, and finally, only the hash values of the whole corpus are stored in the uplink, so that the occupation of storage space can be greatly reduced compared with the hash values of all corpora uploaded in the corpus.
The embodiment of the invention can convert the corpus into the tree structure for calculation when calculating the hash value of the corpus, and optionally, the Merkle trusted tree, also called as the Merck tree, is generated by solving the authentication problem in multiple one-time signatures. The Merkle trusted tree can process an infinite set and change the set into a hash value, the what value corresponds to the data fingerprint, the what value of each data (corpus) of the set (namely the corpus) is converged into a unique hash value in a binary tree mode, and any value in the set can prove that the Merkle trusted tree belongs to the set only by holding a verifiable branch (adjacent hash on a path).
However, the embodiment of the present invention finds that the Merkle tree can provide existence proof, and each corpus corresponds to a hash value of a leaf node, but a large binary tree is formed for a large corpus, which results in low efficiency in querying a node of one corpus, and theoretically, there is a possibility that different corpuses will generate the same hash value, and in addition, the binary tree structure is unrelated to the semantics of the corpuses.
Therefore, on the basis of the above embodiment, in order to overcome the problems that the Merkle tree has low query efficiency, different corpora may generate the same hash value, that is, the query accuracy is low, and the information represented by the hash value is limited due to the fact that the corpus is irrelevant to the corpora, the embodiment of the present invention updates the hash value of the corpus used for authentication according to the corpus to be authenticated, including:
acquiring double-array Trie trees of the corpus;
updating the node hash value of the root node of the double-array Trie tree according to the linguistic data to be authenticated;
the method comprises the steps that a corpus original text of an original corpus in the corpus is represented by a path formed by nodes in a double-array Trie tree, the nodes are used for representing a splitting unit of the corpus original text, and the sequence of the nodes in the path corresponds to the sequence of the splitting unit represented by the nodes in the corpus original text.
It should be noted that a Double Array Trie (Double Array Trie) is a Trie with low spatial complexity, and is generally used for constructing a segmentation dictionary. The double array Trie combines fast array access and chain storage compression. And the double-array Trie tree supports prefix index, namely, whether a word has other words with the word as a prefix in the tree can be searched. Therefore, the embodiment of the invention can quickly retrieve the corpus by constructing the double-array Trie tree of the corpus, and can simply realize the replacement of the hash of each node due to the prefix search characteristic of the double-array Trie tree, thereby realizing the update of the corpus and only needing to update a small amount of double-array Trie trees providing the existence certification. Finally, a corpus can be uniquely verified by a hash value, and a creator of the corpus can store the hash value into the block chain to disclose the hash value, because of the decentralized and non-falsifiable characteristics of the block chain, the existence and the affiliation of the corpus are guaranteed.
In the embodiment of the present invention, the splitting unit may be a Byte (Byte), a word or a word, taking the corpus original text "i love eating hot dry noodles" as an example, when the splitting unit is a word, the splitting unit is split into "i", "love" and "hot dry noodles"; when the splitting unit is a word, the word is split into 'I', 'love', 'eat', 'hot', 'dry' and 'face'; when the splitting unit is a byte, a Chinese Character can be coded into different bytes according to different coding modes, and if the Chinese Character is split by UTF (Universal Character Set/Unicode transformation Format) -8 coding, the splitting result is & # x 6211; and # x 7231; and # x 5403; and # x70 ED; and # x5E 72; and # x9762, deletion of common "& # x", the byte sequence may be expressed as "62117231540370 ED5E 729762". As can be seen from the above example, the smaller the splitting unit is, the more nodes corresponding to one corpus are. It can be understood that in the double-array Trie, a path is formed between any one child node and the root node, and the node sequence represented by all the nodes included in the path represents a corpus.
Fig. 2 is a schematic structural diagram of an even-tuple Trie according to an embodiment of the present invention, and as shown in fig. 2, each corpus in the even-tuple Trie uses a word as a splitting unit, and includes 5 branches in total, and five branches include at least 5 corpora: "o", "ning abuse lack", "ning strong", "prefer" and "Egypt", it should be noted that, in each branch of the double-group Trie tree in FIG. 2, not only the longest route is a corpus, but also the branches "ning abuse", "ning" are all the corpora, as an example. The construction method of the double-array Trie tree shown in FIG. 2 comprises the following steps: the method comprises the steps of firstly taking the first word of a word sequence of each language material as a child node under a root node, sharing the same child node if the first words of the word sequences of a plurality of language materials are the same, thus obtaining 3 child nodes 'o', 'ning' and 'Angstrom', then taking the second word of the child sequence of each language material as a child node of the previous child node, sharing the same child node if the second words of the word sequences of the plurality of language materials are the same, and analogizing until the last word of all the language materials is taken as a leaf node. It will be appreciated that if different corpora have the same word and the word sequence in different corpora differs, the same word will have two nodes in the double-tuple Trie. For example, two corpora are "i love China" and "what is eaten at noon", and both corpora have "middle" characters, so that two nodes in the double-array Trie constructed by the two corpora both represent the "middle" characters.
Because the double-array Trie tree of the corpus is only constructed according to the corpus original text of the corpus, the double-array Trie tree does not need to be updated during authentication, only the hash value of the double-array Trie tree needs to be updated, and the authentication efficiency is improved.
The node hash value of the node in the embodiment of the invention is obtained according to the corpus represented by the path formed from the node to the root node and the node hash values of the child nodes, namely, the node hash value of the node in the embodiment of the invention considers both semantic information of the corpus and position information of the node in the double-array Trie tree, so that the uniqueness of the node hash value is higher, and meanwhile, during authentication, the authentication of one corpus is verified based on the whole Trie tree, thereby enhancing the reliability of the verification result.
On the basis of the foregoing embodiments, the corpus to be authenticated updates the node hash value of the root node of the double-array Trie, specifically:
s201, determining a node with the maximum depth of the corpus original text of the corpus to be authenticated in the double-array Trie tree as a target node; and taking a node on the path from the target node to the root node as a first associated node, and taking a brother node of the first associated node as a second associated node.
Specifically, taking the dual-array Trie shown in fig. 2 as an example, if the corpus to be authenticated is "preferable," the "preferable" is taken as the target node. And the first associated node is "ning" and the second associated node of "ning" is "o" and "a".
S202, obtaining node hash values of the sub-nodes of the target node, which are determined when the hash values of the corpus are uploaded to a block chain, and updating the node hash values of the target node according to the corpus to be authenticated and the node hash values of the sub-nodes of the target node;
specifically, updating the node hash value of the target node according to the corpus to be authenticated and the node hash values of the child nodes of the target node, further comprising:
s301, carrying out Hash operation according to the corpus original text and the corpus explanation of the corpus to be authenticated to obtain a first Hash value of the target node;
s302, carrying out Hash operation according to the node Hash values of the child nodes of the target node to obtain a second Hash value of the target node;
s303, carrying out hash operation according to the first hash value and the second hash value of the target node to obtain a node hash value of the target node.
Taking the even-number-group Trie shown in fig. 2 as an example, for a target node "ning", the child nodes of the node are "small", "willing", and "abusing", and the node hash values of the three child nodes may be stored locally by a creator of the corpus when creating the even-number-group Trie and generating the node hash value, and are provided to the execution main body of the embodiment of the present invention when calculating the node hash value of the corpus to be authenticated, so that the execution main body of the embodiment of the present invention should be a creator of the corpus and a third party other than the creator of the corpus to be authenticated, thereby ensuring fairness of authentication.
In the embodiment of the invention, the node hash values of all the child nodes of the target node are considered when the double-array Trie tree is used for calculating the second hash value, and it can be understood that the second hash value of each father node is obtained based on the node hash values of the child nodes and is not related to all other nodes, so that the calculation amount can be reduced during calculation and authentication, and the authentication efficiency is further improved.
The specific formulas for performing the hash operation in steps S301 to S303 in the embodiment of the present invention may be the same or different, and this is not further limited in the embodiment of the present invention.
S203, obtaining the first hash value of the first associated node and the node hash value of the second associated node which are determined when the hash values of the corpus are uploaded to the blockchain.
It is understood that although the corpus owner of the embodiment uploads only the hash values of the corpus to the blockchain, the hash values of all nodes (including the first hash value, the second hash value and the node hash value) obtained in the process of calculating the corpus hash value are stored locally. During authentication, the owner needs to send the first hash value of the first associated node and the node hash value of the second associated node to the execution main body according to the embodiment of the present invention. Wherein, the first hash value is obtained according to the corpus represented by the path formed from the node to the root node
S204, updating the node hash value of the root node of the double-array Trie tree according to the node hash value of the target node, the first hash value of the first associated node and the node hash value of the second associated node.
It should be noted that, in the embodiment of the present invention, in the process of updating the node hash values of the root nodes of the dual-array Trie according to the corpus to be authenticated, the node hash values of all the nodes in the Trie do not need to be used to update the node hash values of the root nodes, and only part of the nodes, namely the hash values of the first associated node and the second associated node related to the target node, need to be used, so that the authentication efficiency is greatly improved.
On the basis of the foregoing embodiments, as an optional embodiment, the updating the node hash value of the root node of the double-array Trie according to the node hash value of the target node, the first hash value of the first associated node, and the node hash value of the second associated node specifically includes:
performing the following iterative process starting from the target node:
performing hash operation according to the node hash values of the current iteration node and a second associated node of the current iteration node to obtain a second hash value of a father node of the current iteration node; the current iteration node is a target node or the first association node;
if the father node is not the root node, performing hash operation according to the first hash value of the father node and the second hash value of the father node to obtain a node hash value of the father node, and taking the father node of the father node as a node of next iteration;
and if the father node is the root node, taking the second hash value of the father node as the node hash value of the root node.
On the basis of the above embodiments, as an alternative embodiment, the split unit is a byte. It can be known from the above embodiments that the smaller the split unit is, the more nodes and the more branches are obtained in the double-array Trie, so that the node hash value can be calculated more accurately.
The foregoing embodiments describe the corpus authentication process, and it can be understood that, on the basis of the foregoing embodiments, the embodiments of the present invention further include a corpus right determining process, specifically, the method includes:
constructing a double-array Trie tree of the corpus;
performing the following iterative process starting from a leaf node of the dual array Trie:
performing hash operation on a node of current iteration according to the corpus original text and the corpus explanation of the corpus represented by a path formed from the node to a root node if the node is a leaf node to obtain a first hash value of the node, taking the first hash value of the node as the node hash value of the node, and taking a father node of the node as a node of next iteration;
if the node is not a leaf node or a root node, performing hash operation according to the corpus original text and the corpus explanation of the corpus represented by the path formed from the node to the root node to obtain a first hash value of the node, and performing hash operation according to the node hash values of the child nodes of the node to obtain a second hash value of the node; performing hash operation according to the first hash value and the second hash value of the node to obtain a node hash value of the node, and taking a father node of the node as a node of the next iteration;
and if the node is the root node, performing hash operation according to the node hash values of the child nodes of the node to obtain a second hash value of the node, taking the second hash value of the node as the node hash value of the node and simultaneously as the hash value of the corpus, and ending the iterative process.
The embodiment of the invention is explained in the process of inserting new corpora into the corpus, which is a process of updating the node hash value of the corpus. As can be seen from the foregoing embodiments, when updating the node hash value, the embodiments of the present invention do not need to start calculation from all leaf nodes of the dual-tuple Trie, and first find the corresponding node in the dual-tuple Trie through the byte sequence of the corpus (taking the splitting unit as the byte example). And then carrying out hash operation on the corpus (including the corpus original text and corpus explanation), calculating to obtain a first hash value, and storing the first hash value into the node. Since the first hash value of the node is updated, the node hash value of the node is also changed, and the node hash value of the node is recalculated. The double array Trie characteristic records the address of a father node of each node, so that the father node of the node is found. And finding out all child nodes of the parent node (namely brother nodes of the nodes corresponding to the corpora to be inserted) through prefix search of the double-array Trie tree. And carrying out hash operation according to the node hash values of all the child nodes, and updating the node hash values into a second hash value of the father node. Because the second hash value of the parent node changes, the node hash value of the parent node also changes. And finding out the father node of the father node and updating the father node, sequentially circulating until the father node is updated to the root node of the double-array Trie tree, and completing the insertion, wherein the insertion for one time only needs to update the node on the path between the node corresponding to the inserted corpus and the root node.
In the embodiment of the invention, when a corpus is updated, the flow is the same as that of inserting a corpus. Therefore, the invention uses the double-array Trie tree, is different from the defects of the traditional Merkle tree which are only suitable for building the tree in fixed data, and can support the updating of the corpus.
Because the second hash value of the root node of the double-array Trie tree of a corpus uniquely represents the corpus, the existence of the corpus in the corpus can be verified in a reliable manner as long as the node hash value of the root node of the corpus exists. After the creation of a corpus is completed, the hash values of this corpus are submitted to blockchain disclosure. The non-falsifiable records in the blockchain will identify the right of the corpus as the submitter.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Fig. 3 is a schematic structural diagram of a corpus authentication apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus includes: a corpus to be authenticated obtaining module 201, a hash value updating module 202, and a comparing module 203, wherein:
a to-be-authenticated corpus obtaining module 201, configured to obtain a to-be-authenticated corpus;
a hash value updating module 202, configured to update a hash value of the corpus used for authentication according to the corpus to be authenticated, so as to obtain an updated hash value;
a comparing module 203, configured to determine a hash value of the corpus from a blockchain, and if the hash value determined from the blockchain is the same as the updated hash value, it is known that the corpus to be authenticated is derived from the corpus.
The syntax authentication device provided in the embodiment of the present invention specifically executes the flows of the above-mentioned method embodiments, and please refer to the contents of the above-mentioned syntax authentication method embodiments in detail, which are not described herein again. The invention adopts all corpora of the whole corpus as a small number of corpora to be authenticated to perform authentication, improves the reliability of the authentication process, and also utilizes the characteristic that the block chain can not be tampered, so that the hash value of the historically constructed corpus can be stably stored and credibility for a long time, and the uplink time on the block chain can be used as the proof of the authority time of the corpus, and finally, the uplink is only stored in the hash value of the whole corpus, so that the occupation of the storage space can be greatly reduced compared with the uploading of the hash value of each corpus in the corpus.
Fig. 4 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 4, the electronic device may include: a processor (processor)310, a communication Interface (communication Interface)320, a memory (memory)330 and a communication bus 340, wherein the processor 310, the communication Interface 320 and the memory 330 communicate with each other via the communication bus 340. The processor 310 may call a computer program stored in the memory 330 and operable on the processor 310 to execute the corpus authentication method provided by the above embodiments, for example, including: obtaining a corpus to be authenticated; updating the hash value of the corpus used for authentication according to the corpus to be authenticated to obtain an updated hash value; and determining a hash value of the corpus from the blockchain, and if the hash value determined from the blockchain is the same as the updated hash value, acquiring that the corpus to be authenticated comes out of the corpus.
An embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the corpus authentication method provided in the foregoing embodiments when executed by a processor, for example, the method includes: obtaining a corpus to be authenticated; updating the hash value of the corpus used for authentication according to the corpus to be authenticated to obtain an updated hash value; and determining a hash value of the corpus from the blockchain, and if the hash value determined from the blockchain is the same as the updated hash value, acquiring that the corpus to be authenticated comes out of the corpus.
In addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A corpus authentication method, comprising:
obtaining a corpus to be authenticated;
updating the hash value of the corpus used for authentication according to the corpus to be authenticated to obtain an updated hash value;
and determining a hash value of the corpus from the blockchain, and if the hash value determined from the blockchain is the same as the updated hash value, acquiring that the corpus to be authenticated comes out of the corpus.
2. The corpus authentication method according to claim 1, wherein the updating the hash value of the corpus for authentication according to the corpus to be authenticated comprises:
acquiring double-array Trie trees of the corpus;
updating the node hash value of the root node of the double-array Trie tree according to the linguistic data to be authenticated;
the method comprises the steps that a corpus original text of an original corpus in the corpus is represented by a path formed by nodes in a double-array Trie tree, the nodes are used for representing a splitting unit of the corpus original text, and the sequence of the nodes in the path corresponds to the sequence of the splitting unit represented by the nodes in the corpus original text;
and the node hash value of the node is obtained according to the corpus represented by the path formed from the node to the root node and the node hash values of the child nodes.
3. The corpus authentication method according to claim 2, wherein the updating of the node hash value of the root node of the dual-array Trie according to the corpus to be authenticated specifically comprises:
determining a node with the maximum depth of the corpus original text of the corpus to be authenticated in the double-array Trie tree as a target node; taking a node on a path from the target node to the root node as a first associated node, and taking a brother node of the first associated node as a second associated node;
acquiring node hash values of the sub-nodes of the target node, which are determined when the hash values of the corpus are uploaded to a block chain, and updating the node hash values of the target node according to the corpus to be authenticated and the node hash values of the sub-nodes of the target node;
acquiring a first hash value of the first association node and a node hash value of the second association node which are determined when the hash values of the corpus are uploaded to a block chain;
updating the node hash value of the root node of the double-array Trie tree according to the node hash value of the target node, the first hash value of the first associated node and the node hash value of the second associated node;
and obtaining the first hash value according to the corpus represented by the path formed from the node to the root node.
4. The corpus authentication method according to claim 3, wherein the updating the node hash value of the target node according to the corpus to be authenticated and the node hash values of the child nodes of the target node comprises:
performing hash operation according to the corpus original text and the corpus explanation of the corpus to be authenticated to obtain a first hash value of the target node;
performing hash operation according to the node hash values of the child nodes of the target node to obtain a second hash value of the target node;
and carrying out hash operation according to the first hash value and the second hash value of the target node to obtain the node hash value of the target node.
5. The corpus authentication method according to claim 3, wherein the node hash value of the root node of the dual-array Trie is updated according to the node hash value of the target node, the first hash value of the first associated node, and the node hash value of the second associated node, specifically, the following iterative process is performed from the target node:
performing hash operation according to the node hash values of the current iteration node and a second associated node of the current iteration node to obtain a second hash value of a father node of the current iteration node; the current iteration node is a target node or the first association node;
if the father node is not the root node, performing hash operation according to the first hash value of the father node and the second hash value of the father node to obtain a node hash value of the father node, and taking the father node of the father node as a node of next iteration;
and if the father node is the root node, taking the second hash value of the father node as the node hash value of the root node.
6. The corpus authentication method according to any one of claims 2 to 5, wherein the obtaining the corpus to be authenticated further comprises:
generating a hash value of the corpus to be authenticated, and uploading the hash value of the corpus to the block chain;
correspondingly, the generating the hash value of the corpus to be authenticated specifically includes:
constructing a double-array Trie tree of the corpus;
performing the following iterative process starting from a leaf node of the dual array Trie:
performing hash operation on a node of current iteration according to the corpus original text and the corpus explanation of the corpus represented by a path formed from the node to a root node if the node is a leaf node to obtain a first hash value of the node, taking the first hash value of the node as the node hash value of the node, and taking a father node of the node as a node of next iteration;
if the node is not a leaf node or a root node, performing hash operation according to the corpus original text and the corpus explanation of the corpus represented by the path formed from the node to the root node to obtain a first hash value of the node, and performing hash operation according to the node hash values of the child nodes of the node to obtain a second hash value of the node; performing hash operation according to the first hash value and the second hash value of the node to obtain a node hash value of the node, and taking a father node of the node as a node of the next iteration;
and if the node is the root node, performing hash operation according to the node hash values of the child nodes of the node to obtain a second hash value of the node, taking the second hash value of the node as the node hash value of the node and simultaneously as the hash value of the corpus, and ending the iterative process.
7. The corpus authentication method according to claim 2, wherein said splitting unit is a byte.
8. A corpus authentication device, comprising:
the system comprises a to-be-authenticated corpus obtaining module, a to-be-authenticated corpus obtaining module and a to-be-authenticated corpus obtaining module, wherein the to-be-authenticated corpus obtaining module is used for obtaining a to-be-authenticated corpus;
the hash value updating module is used for updating the hash value of the corpus used for authentication according to the corpus to be authenticated to obtain an updated hash value;
and the comparison module is used for determining a hash value of the corpus from the block chain, and if the hash value determined from the block chain is the same as the updated hash value, the corpus to be authenticated is known to be out of the corpus.
9. An electronic device, comprising:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the corpus authentication method according to any one of claims 1 to 7.
10. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the corpus authentication method according to any one of claims 1 to 7.
CN202010143671.9A 2020-03-04 2020-03-04 Corpus authentication method and device Active CN111414648B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010143671.9A CN111414648B (en) 2020-03-04 2020-03-04 Corpus authentication method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010143671.9A CN111414648B (en) 2020-03-04 2020-03-04 Corpus authentication method and device

Publications (2)

Publication Number Publication Date
CN111414648A true CN111414648A (en) 2020-07-14
CN111414648B CN111414648B (en) 2023-05-12

Family

ID=71491177

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010143671.9A Active CN111414648B (en) 2020-03-04 2020-03-04 Corpus authentication method and device

Country Status (1)

Country Link
CN (1) CN111414648B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008126A (en) * 2014-03-31 2014-08-27 北京奇虎科技有限公司 Method and device for segmentation on basis of webpage content classification
CN107102981A (en) * 2016-02-19 2017-08-29 腾讯科技(深圳)有限公司 Term vector generation method and device
CN107239549A (en) * 2017-06-07 2017-10-10 传神语联网网络科技股份有限公司 Method, device and the terminal of database terminology retrieval
JP2018081611A (en) * 2016-11-18 2018-05-24 日本電信電話株式会社 Dictionary search method, device, and program
CN108197120A (en) * 2017-12-28 2018-06-22 中译语通科技(青岛)有限公司 A kind of similar sentence machining system based on bilingual teaching mode
CN109460501A (en) * 2018-11-15 2019-03-12 成都傅立叶电子科技有限公司 A kind of global search Battle Assistant Decision-making system and method
CN109684439A (en) * 2018-12-28 2019-04-26 语联网(武汉)信息技术有限公司 The method and device of prefix index is carried out during participle
CN110032545A (en) * 2019-03-27 2019-07-19 远光软件股份有限公司 File memory method, system and electronic equipment based on block chain
CN110602239A (en) * 2019-09-20 2019-12-20 腾讯科技(深圳)有限公司 Block chain information storage method and related equipment
US20200042508A1 (en) * 2018-08-06 2020-02-06 Walmart Apollo, Llc Artificial intelligence system and method for auto-naming customer tree nodes in a data structure

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008126A (en) * 2014-03-31 2014-08-27 北京奇虎科技有限公司 Method and device for segmentation on basis of webpage content classification
CN107102981A (en) * 2016-02-19 2017-08-29 腾讯科技(深圳)有限公司 Term vector generation method and device
JP2018081611A (en) * 2016-11-18 2018-05-24 日本電信電話株式会社 Dictionary search method, device, and program
CN107239549A (en) * 2017-06-07 2017-10-10 传神语联网网络科技股份有限公司 Method, device and the terminal of database terminology retrieval
CN108197120A (en) * 2017-12-28 2018-06-22 中译语通科技(青岛)有限公司 A kind of similar sentence machining system based on bilingual teaching mode
US20200042508A1 (en) * 2018-08-06 2020-02-06 Walmart Apollo, Llc Artificial intelligence system and method for auto-naming customer tree nodes in a data structure
CN109460501A (en) * 2018-11-15 2019-03-12 成都傅立叶电子科技有限公司 A kind of global search Battle Assistant Decision-making system and method
CN109684439A (en) * 2018-12-28 2019-04-26 语联网(武汉)信息技术有限公司 The method and device of prefix index is carried out during participle
CN110032545A (en) * 2019-03-27 2019-07-19 远光软件股份有限公司 File memory method, system and electronic equipment based on block chain
CN110602239A (en) * 2019-09-20 2019-12-20 腾讯科技(深圳)有限公司 Block chain information storage method and related equipment

Also Published As

Publication number Publication date
CN111414648B (en) 2023-05-12

Similar Documents

Publication Publication Date Title
CN106776544B (en) Character relation recognition method and device and word segmentation method
US20190251165A1 (en) Conversational agent
US9195738B2 (en) Tokenization platform
US6470347B1 (en) Method, system, program, and data structure for a dense array storing character strings
US9129007B2 (en) Indexing and querying hash sequence matrices
US10104021B2 (en) Electronic mail data modeling for efficient indexing
US11205041B2 (en) Web element rediscovery system and method
US20110078562A1 (en) Method and system for tracking authorship of content in data
US10248680B2 (en) Index management
US11397855B2 (en) Data standardization rules generation
WO2016144367A1 (en) Database records associated with a trie
US20090234852A1 (en) Sub-linear approximate string match
CN113177407A (en) Data dictionary construction method and device, computer equipment and storage medium
US10248813B2 (en) Organizing key-value information sets into hierarchical representations for efficient signature computation given change information
CN111414648B (en) Corpus authentication method and device
CN111581344A (en) Interface information auditing method and device, computer equipment and storage medium
CN110795617A (en) Error correction method and related device for search terms
KR20210099661A (en) Method and apparatus for generating annotated natural language phrases
US20150154198A1 (en) Method for in-loop human validation of disambiguated features
Mendonça et al. Onception: Active learning with expert advice for real world machine translation
CN113378544A (en) Text analysis method, text data acquisition method, device, medium and equipment
US20140081986A1 (en) Computing device and method for generating sequence indexes for data files
CN116992883B (en) Entity alignment processing method and device
CN117113436B (en) Block chain-based data credibility right-confirming method and device
CN112800185B (en) Method and device for generating and matching text of interface node in mobile terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant