CN117290889B

CN117290889B - Safe storage method for realizing electronic labor contract based on blockchain

Info

Publication number: CN117290889B
Application number: CN202311576468.0A
Authority: CN
Inventors: 尹清波; 刘骞
Original assignee: Guangzhou Ink It Co ltd
Current assignee: Guangzhou Ink It Co ltd
Priority date: 2023-11-24
Filing date: 2023-11-24
Publication date: 2024-03-12
Anticipated expiration: 2043-11-24
Also published as: CN117290889A

Abstract

The application provides a safe storage method for realizing an electronic labor contract based on a blockchain, which comprises the following steps: acquiring an electronic contract, analyzing the contract content to obtain the basic characteristics and sensitivity of the data, and marking the sensitivity level of each paragraph according to the sensitivity standard; selecting a data storage position according to the sensitive mark of the content, and distributing the sensitive information to an intranet server for storage; processing contract contents by adopting a data splitting strategy, classifying contents of each part in the contract and obtaining the category of each part; judging whether the data should be stored in the main chain, the side chain or the inside according to the confidentiality degree and the data quantity of each category of the content in the contract; judging the data relation stored in the main chain and the side chain by contract, and establishing a data flow path for the data needing interaction between the main chain and the side chain; after each data interaction is analyzed, the integrity and the accuracy of contract data are analyzed, and incorrect contents are corrected through intelligent contracts.

Description

Safe storage method for realizing electronic labor contract based on blockchain

Technical Field

The invention relates to the technical field of information, in particular to a safe storage method for realizing an electronic labor contract based on a blockchain.

Background

With the widespread use of electronic contracts, the secure management of contract contents and the protection of sensitive data become an important issue. While some data management techniques already exist at the current stage, there are still some problems to be solved. First, existing data management techniques mainly focus on storage location selection and sensitive information filtering of data, but there is also a limitation on deep analysis of contract content. This results in a lack of comprehensive knowledge of the fundamental nature and sensitivity of the data and the inability to precisely mark the level of sensitivity of the contract content according to the actual situation. Secondly, the existing data management technology has certain defects in terms of considering classification of contract contents and data quantity differences. Different types of contracts may have different security requirements, however, the prior art cannot determine the location of data storage according to the category and data volume of the contract content, and thus cannot implement differentiated management and protection for different contracts. In addition, existing data management techniques have not addressed the problem of contract content interactions between the backbone and side chains. Because contractual matters may require data exchange between the main chain and the side chain, the prior art cannot ensure the integrity and accuracy of the data, and there is a certain risk. The prior art fails to deeply analyze contract contents, accurately marks sensitive grades, and cannot conduct differentiated management and protection according to contract types and data volumes. Therefore, with the popularization of electronic contracts, the secure management of contract contents and the protection of sensitive data are issues to be solved urgently.

Disclosure of Invention

The invention provides a safe storage method for realizing electronic labor contracts based on block chains, which mainly comprises the following steps:

acquiring an electronic contract, analyzing the contract content to obtain the basic characteristics and sensitivity of the data, and marking the sensitivity level of each paragraph according to the sensitivity standard; selecting a data storage position according to the sensitive mark of the content, and distributing the sensitive information to an intranet server for storage; processing contract contents by adopting a data splitting strategy, classifying contents of each part in the contract and obtaining the category of each part; judging whether the data should be stored in the main chain, the side chain or the inside according to the confidentiality degree and the data quantity of each category of the content in the contract; judging the data relation stored in the main chain and the side chain by contract, and establishing a data flow path for the data needing interaction between the main chain and the side chain; after each data interaction is analyzed, the integrity and the accuracy of contract data are analyzed, and incorrect contents are corrected through intelligent contracts; judging whether the data flow path accords with the specification, if the content of the interactive transmission accords with the preset data label specification, automatically generating a corresponding label in the contract, and ensuring that the storage of the contract content accords with the safety requirement.

In one embodiment, the acquiring the electronic contract, analyzing the contract content to obtain the basic characteristics and the sensitivity of the data, and marking the sensitivity level of each paragraph according to the sensitivity standard, including:

acquiring an electronic contract from a contract management system through an API interface; adopting a TF-IDF algorithm to analyze the content of the electronic contract text; labeling each paragraph of the same text, training a classifier through labeled data, applying the trained classifier, and grading the sensitivity of the analyzed data according to sensitivity standards; the data is divided into three different data pools according to the determined sensitivity ratings: high sensitivity, medium sensitivity and low sensitivity; encrypting the high-sensitivity data by adopting an AES encryption algorithm, and encrypting the medium-sensitivity and low-sensitivity data by adopting an MD5 encryption algorithm; performing authority management through an ActiveDirecty according to the data pool, and setting that high-sensitivity data can only be accessed by a specific role; adopting an elastic search to index data, and classifying according to a data pool; according to the data pool and the data index, carrying out batch verification by using the DataDriventesting to carry out batch verification; further comprises: the classifier is trained and applied using a pre-trained classifier.

The classifier is trained and applied by using a pre-trained classifier, and specifically comprises the following steps:

electronic labor service contract sample data is obtained, including contracts of different industries, different terms and different formats. And (3) performing operations of removing duplicate data, unifying formats, removing special characters, standardizing dates and numbers on the acquired data. The bag of words model is used to extract useful features from the contract text. Each contract sample is assigned the correct classification label using a rule-based automatic labeling method. A naive Bayesian algorithm is selected, and a classifier model is trained by using the labeled data. Model evaluation was performed using a portion of the data, checking the classifier for accuracy, recall, and F1 score. And adjusting model parameters and increasing training data quantity according to the evaluation result.

In one embodiment, the selecting a data storage location according to the sensitive label of the content, and distributing the sensitive information to the intranet server for storage includes:

content-sensitive tagging of contract data, tagging data containing predefined keywords or phrases; using a TF-IDF algorithm, obtaining TF-IDF weight of a word in the text by multiplying TF and IDF according to the frequency of the sensitive marks in the text and the inverse document frequency in the whole text set, and grading the sensitive marks, and classifying the sensitive marks into high, medium and low sensitivity; analyzing storage requirements, and deciding which partition of the intranet server the data should be stored into according to the sensitivity rating result; checking the storage state of an intranet server, evaluating the storage capacity of a server partition, and particularly aiming at the storage partition of high-sensitivity data; allocating storage resources, and automatically pushing data to corresponding intranet server partitions if the storage capacity allows; enabling encryption and access control lists for the high sensitivity data partitions by applying security policies; if the high-sensitivity partition reaches the upper limit of the capacity, the storage priority of other partitions is automatically adjusted, and the data flow is automatically redirected to other low-priority partitions through a load balancing strategy; periodic data integrity and security checks are performed by a digital signature algorithm and periodic security scans, and if inconsistencies occur, the operations of re-labeling and storing are triggered.

In one embodiment, the processing the contract content by using the data splitting policy, classifying each part of the content in the contract and obtaining the category of each part includes:

performing word segmentation, part-of-speech tagging and named entity recognition of the contract by using a jieba segmentation library; text analysis is carried out on the synthesized content through TF-IDF so as to identify keywords and high-frequency phrases; obtaining an original text feature vector by utilizing a genesim Doc2Vec technology; applying a data splitting strategy to divide the feature vector into a plurality of data subsets, each subset containing a certain number of keywords and phrases; classifying the data subsets by using a decision tree algorithm, and generating pre-classification labels according to the combination of the keywords and the phrases; correcting the pre-classified label by combining an original text feature vector through a self-adaptive learning method to obtain a corrected classified label; the data subsets are finally classified by using the decision tree algorithm again through the corrected classification labels, and final classification labels are generated; combining the final category label with the original text feature vector to generate a new feature vector; logic judgment is carried out on the new feature vectors, and screening is carried out through a built-in threshold value to determine which feature vectors need to be further processed; if the feature vector needs to be further processed, performing data cleaning or transformation on the new feature vector through a decision tree algorithm to obtain a final cleaned feature vector; and integrating all the cleaned feature vectors, and storing the feature vectors through a database to obtain the contract content of which the final processing is finished.

In one embodiment, the determining whether the data should be stored in the main chain, the side chain or the internal storage according to the confidentiality degree and the data amount of each category of the contents in the contract includes:

text analysis of contract content is carried out by using a TF-IDF algorithm, keywords and phrases are extracted, and initial classification of data categories is carried out; after the data category is determined, determining the confidentiality degree of each category of data through database query, classifying the data again by using a decision tree algorithm, and generating a pre-classification label according to the combination of the keywords and the phrases; correcting the pre-classified label by combining an original text feature vector through a self-adaptive learning method to obtain a corrected classified label; the data are finally classified by using the decision tree algorithm again through the corrected classification labels, and a final classification label is generated; adding the data to respective category labels; sorting the confidentiality degree labels, and evaluating the priority of each category by using a sorting result; metering the data volume of the data types with the confidentiality degree higher than a preset confidentiality threshold and the priority higher than a priority threshold; judging whether the data is stored in the main chain or not according to the data quantity and the confidentiality degree, and carrying out hash verification on the data in the main chain; if the hash verification is passed, executing data writing to the main chain, and configuring read-write permission for the data; judging the data with small data size or confidentiality degree not higher than a preset threshold value to be stored in a side chain; encrypting the data in the side chain by using an AES encryption algorithm; and (3) reconfiguring the read-write permission of the encrypted data in the side chain according to the confidentiality degree and the data volume of the encrypted data.

In one embodiment, the determining contract stores data relationships between backbone and side chains, and establishing a data flow path for data that requires interaction between backbone and side chains comprises:

inquiring the data characteristics of the main chain and the side chain to obtain the data structure and the type of the main chain and the side chain; judging whether the main chain has a basic condition for data interaction with the side chain or not by comparing the data characteristics of the main chain and the side chain; if the main chain has data interaction basic conditions, further checking whether the data characteristics of the side chain are matched with the main chain, especially in the aspects of data structure and type; establishing a data flow path between a main chain and a side chain according to the data interaction frequency and the data size; determining whether to start a data synchronization mechanism according to the data consistency requirement, and if so, adopting a differential synchronization algorithm; carrying out differential synchronization on data meeting the data synchronization condition between a main chain and a side chain; acquiring a data queriability attribute, determining whether the data needs to be indexed according to the attribute, and indexing by using a B-tree algorithm; determining whether to store the interaction record and the storage mode according to the data audit requirement; after data synchronization is completed between the main chain and the side chain, the data life cycle is evaluated, and whether data need to be migrated is determined.

In one embodiment, the analyzing the integrity and accuracy of the contract data after each data interaction, correcting the incorrect content by the smart contract, includes:

checking the integrity of contract data, and if the contract data is found to be incomplete, triggering an integrity checking algorithm in the intelligent contract to automatically correct; performing accuracy assessment on the contract data subjected to the integrity check, and if the accuracy is not consistent, executing an accuracy correction algorithm in the intelligent contract; judging whether the new data input is matched with the corrected contract data or not through an SHA-256 algorithm; after the new data is input and verified, encrypting and outputting the data, and if the output encryption fails, re-executing the data encryption process; executing logic judgment on the encrypted output data, and executing data rollback operation if the logic judgment is abnormal; after logic judgment and data rollback are completed, an intelligent contract running state is obtained, and whether the intelligent contract is normally executed is judged through Boolean logic; after the intelligent contract running state is obtained, checking contract execution conditions, and if the Boolean logic judges false, re-triggering a correction algorithm of the intelligent contract; after the contract execution condition is checked, feedback information of the system is obtained, and if the feedback information is abnormal, a data abstract is regenerated according to an SHA-256 algorithm; after system feedback is obtained, final stability evaluation is carried out on the whole data processing flow, and if the stability is inconsistent, a self-correction mechanism in the intelligent contract is triggered.

In one embodiment, the determining whether the data flow path meets the specification, if the content of the interactive transmission meets the preset data tag specification, automatically generating a corresponding tag in the contract to ensure that the storage of the contract content meets the security requirement, includes:

capturing data flow path information in interactive transmission through a tcpdump tool; checking source address, target address and intermediate node information of a data packet by applying a regular expression, wherein a predetermined regular expression rule is that the address is a relative path starting with a slash or a valid Unix file path; performing a verification of the captured data stream path; if the result of regular expression checking does not match with the preset data label, triggering data rollback and stopping the process, and if the result of regular expression algorithm checking matches with the preset data label, starting an automatic label generator to generate a corresponding label; inquiring contract content corresponding to the automatically generated tag from a relational database, checking whether the contract content contains a specific keyword or phrase or not according to the data containing the predefined keyword or phrase marked before the inquired contract content is matched, and judging storage compliance; if the content accords with the storage specification, integrating the contract content and the automatically generated tag into a data object by using StanfordNLP; the data object is subjected to security verification by using an SHA-256 algorithm, and if the output of the SHA-256 algorithm meets the preset security standard, the data object is directly stored in the encrypted storage area.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

the invention discloses a sensitive data management technology based on contract content, which obtains the basic characteristics and sensitivity of data by analyzing the content of an electronic contract and marks the sensitivity level of each paragraph according to sensitivity standards. And selecting a proper data storage position according to the sensitive mark of the content, adopting a sensitive information filtering algorithm, and distributing the sensitive information to an intranet server for storage. And meanwhile, processing the contract content by adopting a data splitting strategy, classifying each part of content in the contract, and obtaining the category of each part. Based on the degree of confidentiality and the data amount of each category in the contract, it is judged whether the data should be stored in the main chain, the side chain or internally. Judging the data relationship between the main chain and the side chain, establishing a data flow path, and checking the integrity and the accuracy of the data through intelligent contracts for the data needing interaction between the main chain and the side chain, and correcting the data. Meanwhile, whether the data flow path meets the specification is judged according to the preset data tag specification, if so, a corresponding tag is automatically generated in the contract, and the storage of the contract content is ensured to meet the safety requirement. Through the fusion of the technologies, the safety management of contract contents and the controllability of data interaction are realized, and the safety and the accuracy of contract data are improved.

Drawings

FIG. 1 is a flow chart of a method for securely storing electronic labor contracts based on blockchain in accordance with the present invention.

FIG. 2 is a schematic diagram of a block chain based secure storage method for implementing electronic labor contracts according to the present invention.

FIG. 3 is a schematic diagram of a method for securely storing electronic labor contracts based on blockchain according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and specifically described below with reference to the drawings in the embodiments of the present invention. The described embodiments are only a few embodiments of the present invention.

The safe storage method for realizing the electronic labor contract based on the blockchain in the embodiment specifically comprises the following steps:

s101, acquiring an electronic contract, analyzing the content of the contract to obtain the basic characteristics and the sensitivity of the data, and marking the sensitivity level of each paragraph according to the sensitivity standard.

And acquiring the electronic contract from the contract management system through an API interface. And adopting a TF-IDF algorithm to analyze the content of the electronic contract text. Labeling each paragraph of the same text, training a classifier through labeled data, applying the trained classifier, and grading the sensitivity of the analyzed data according to sensitivity standards. The data is divided into three different data pools according to the determined sensitivity ratings: high sensitivity, medium sensitivity and low sensitivity. And encrypting the high-sensitivity data by adopting an AES encryption algorithm, and encrypting the medium-sensitivity and low-sensitivity data by adopting an MD5 encryption algorithm. And performing authority management through an ActiveDirecty according to the data pool, and setting that the high-sensitivity data can only be accessed by a specific role. And carrying out data indexing by adopting an elastic search, and classifying according to the data pool. And carrying out batch verification by using the DataDriventesting according to the data pool and the data index. For example, an electronic contract is obtained from the contract management system through an API interface. 100 electronic contract texts are obtained from the system. And adopting a TF-IDF algorithm to analyze the content of the electronic contract text. The TF-IDF values, contract 08, management 05, system 04, data 06 for the following keywords are obtained. The following labels are given for paragraphs, the first being of high sensitivity, the second being of medium sensitivity, and the third being of low sensitivity. Through the marked data, a classifier can be trained to automatically mark new paragraphs of contract text. And carrying out sensitivity rating on the analyzed data according to the sensitivity standard. The sensitivity rating is determined as high sensitivity for the segments with TF-IDF values greater than 07, medium sensitivity for the segments with TF-IDF values between 05 and 07, and low sensitivity for the segments with TF-IDF values less than 05. According to the above criteria, the sensitivity rating is performed on 100 electronic contracts to obtain the following results, wherein the high sensitivity is 10 contracts, the medium sensitivity is 20 contracts, and the low sensitivity is 70 contracts. For high sensitivity data, an AES encryption algorithm is used for encryption. The AES encryption algorithm used is capable of encrypting data into ciphertext of 128 bits in length. For data with medium and low sensitivity, the encryption is performed by adopting an MD5 encryption algorithm. The MD5 encryption algorithm used is capable of encrypting data into a 32-bit digest. And performing authority management through an ActiveDirecty according to the data pool, and setting that only users with specific roles can access the high-sensitivity data. And carrying out data indexing by adopting an elastic search, and classifying according to the data pool. The following data indexes, namely a high-sensitivity data index, a medium-sensitivity data index and a low-sensitivity data index, are established, and batch verification can be carried out by using DataDrivenTesting according to the data pool and the data indexes. And carrying out batch verification on 100 electronic contracts to obtain the following results, wherein the high-sensitivity data verification passes, the medium-sensitivity data verification passes and the low-sensitivity data verification passes.

The classifier is trained and applied using a pre-trained classifier.

Electronic labor service contract sample data is obtained, including contracts of different industries, different terms and different formats. And (3) performing operations of removing duplicate data, unifying formats, removing special characters, standardizing dates and numbers on the acquired data. The bag of words model is used to extract useful features from the contract text. Each contract sample is assigned the correct classification label using a rule-based automatic labeling method. A naive Bayesian algorithm is selected, and a classifier model is trained by using the labeled data. Model evaluation was performed using a portion of the data, checking the classifier for accuracy, recall, and F1 score. And adjusting model parameters and increasing training data quantity according to the evaluation result. For example, 1000 pieces of electronic labor contract sample data are acquired, and data cleaning and format unification processing are performed. Wherein 500 samples belong to IT industry, 300 samples belong to financial industry, and 200 samples belong to education industry. In the bag of words model, each contract text is converted into a vector representing the frequency of the different words that appear in the text. "labor contract" 5 times, "wage" 2 times, "welfare" 3 times, "performance" 1 time, etc. words appear in a sample, and the feature vector of the sample is [5,2,3,1]. And matching keywords in each contract sample by using an automatic labeling method based on rules, and judging which industry the keywords belong to. If a contract sample contains keywords of "software development" and "information technology", IT is labeled as IT industry. And then training the marked data by using a naive Bayesian algorithm to construct a classifier model. During training, 80% of the data was used for training, and the remaining 20% of the data was used for model evaluation. In the evaluation process, the accuracy of the model was found to be 90%, the recall rate was 85% and the F1 score was 85%. And according to the evaluation result, adjusting the smoothing parameters of the naive Bayesian algorithm, or adding more contract sample data for training. Through iterative adjustment and training, the accuracy, recall rate and F1 score of the model are gradually improved so as to better classify the electronic labor contract samples.

S102, selecting a data storage position according to the sensitive mark of the content, and distributing the sensitive information to an intranet server for storage.

Specifically, the synthetic data is content-sensitive tagged, with the data containing predefined keywords or phrases. And using a TF-IDF algorithm, obtaining the TF-IDF weight of a word in the text by multiplying TF and IDF according to the frequency of the sensitive marks in the text and the inverse document frequency in the whole text set, and grading the sensitive marks, and classifying the sensitive marks into high sensitivity, medium sensitivity and low sensitivity. The contract text contains sensitive keywords such as "secret agreement", "secret information", "legal responsibility", "default clause", and the like, and the TF-IDF weight of the "secret agreement" is 05, the TF-IDF weight of the "secret information" is 03, the TF-IDF weight of the "legal responsibility" is 02, and the TF-IDF weight of the "default clause" is 01. With the above results, sensitive tags can be ranked, with "secret agreements" being high sensitivity, "secret information" being medium sensitivity, "legal liabilities" and "default terms" being low sensitivity. And analyzing the storage requirement, and deciding which partition of the intranet server the data should be stored to according to the sensitivity rating result. For example, there are 3 intranet server partitions, partition A, B, C, partition a is high priority, partition B is medium priority, and partition C is low priority. According to the sensitivity rating result, high sensitivity data should be stored to partition A, medium sensitivity data should be stored to partition B, and low sensitivity data should be stored to partition C. The total amount of contract data is 1000, wherein 200 pieces of high-sensitivity data, 300 pieces of medium-sensitivity data and 500 pieces of low-sensitivity data exist. According to the storage requirements, storage resources are allocated as follows, partition A stores high-sensitivity data, partition B stores medium-sensitivity data, and partition C stores low-sensitivity data. And checking the storage state of the intranet server, and evaluating the storage capacity of the server partition, particularly the storage partition aiming at high-sensitivity data. The storage capacity of the partition A is 100GB, 70GB is used, and the rest 30GB; the storage capacity of the partition B is 200GB, 80GB is used, and the rest 120GB; the storage capacity of partition C is 500GB, 200GB has been used, and 300GB remains. Special attention needs to be paid to the storage capacity of partition a to ensure that there is sufficient storage space for the high sensitivity data. And if the storage capacity allows, automatically pushing the data to the corresponding intranet server partition. Depending on the storage requirements and server storage status, if the storage capacity of partition A, B, C is sufficient, the system can automatically push contract data to the corresponding partition. Applying security policies enables encryption and access control lists for high sensitivity data partitions. For partition a storing high sensitivity data, the system may apply security policies, such as data encryption and access control lists, to ensure the security and confidentiality of the data. And if the high-sensitivity partition reaches the upper limit of the capacity, automatically adjusting the storage priority of other partitions. If the storage space of partition A reaches the upper capacity limit, the system may redirect new contract data traffic to other low priority partitions B or C through a load balancing policy to balance the storage load. Periodic data integrity and security checks are performed by a digital signature algorithm and periodic security scans, and if inconsistencies occur, the operations of re-labeling and storing are triggered. The system may use a digital signature algorithm and periodic security scans to check the integrity and security of stored contract data. If the data is found to be inconsistent or a security breach exists, the system may trigger operations of re-tagging and storing to ensure the integrity and security of the data. And carrying out MD5 check and security scanning on the stored contract data periodically, and carrying out re-marking and storing if check failure or security hole occurs. Content sensitive tagging is performed on the contract data to tag data containing predefined keywords or phrases. And using a TF-IDF algorithm, obtaining the TF-IDF weight of a word in the text by multiplying TF and IDF according to the frequency of the sensitive marks in the text and the inverse document frequency in the whole text set, and grading the sensitive marks, and classifying the sensitive marks into high sensitivity, medium sensitivity and low sensitivity. The contract text contains sensitive keywords such as "secret agreement", "secret information", "legal responsibility", "default clause", and the like, and the TF-IDF weight of the "secret agreement" is 05, the TF-IDF weight of the "secret information" is 03, the TF-IDF weight of the "legal responsibility" is 02, and the TF-IDF weight of the "default clause" is 01. With the above results, sensitive tags can be ranked, with "secret agreements" being high sensitivity, "secret information" being medium sensitivity, "legal liabilities" and "default terms" being low sensitivity. And analyzing the storage requirement, and deciding which partition of the intranet server the data should be stored to according to the sensitivity rating result. For example, there are 3 intranet server partitions, partition A, B, C, partition a is high priority, partition B is medium priority, and partition C is low priority. According to the sensitivity rating result, high sensitivity data should be stored to partition A, medium sensitivity data should be stored to partition B, and low sensitivity data should be stored to partition C. The total amount of contract data is 1000, wherein 200 pieces of high-sensitivity data, 300 pieces of medium-sensitivity data and 500 pieces of low-sensitivity data exist. According to the storage requirements, storage resources are allocated as follows, partition A stores high-sensitivity data, partition B stores medium-sensitivity data, and partition C stores low-sensitivity data. And checking the storage state of the intranet server, and evaluating the storage capacity of the server partition, particularly the storage partition aiming at high-sensitivity data. The storage capacity of the partition A is 100GB, 70GB is used, and the rest 30GB; the storage capacity of the partition B is 200GB, 80GB is used, and the rest 120GB; the storage capacity of partition C is 500GB, 200GB has been used, and 300GB remains. Special attention needs to be paid to the storage capacity of partition a to ensure that there is sufficient storage space for the high sensitivity data. And if the storage capacity allows, automatically pushing the data to the corresponding intranet server partition. Depending on the storage requirements and server storage status, if the storage capacity of partition A, B, C is sufficient, the system can automatically push contract data to the corresponding partition. Applying security policies enables encryption and access control lists for high sensitivity data partitions. For partition a storing high sensitivity data, the system may apply security policies, such as data encryption and access control lists, to ensure the security and confidentiality of the data. And if the high-sensitivity partition reaches the upper limit of the capacity, automatically adjusting the storage priority of other partitions. If the storage space of partition A reaches the upper capacity limit, the system may redirect new contract data traffic to other low priority partitions B or C through a load balancing policy to balance the storage load. Periodic data integrity and security checks are performed by a digital signature algorithm and periodic security scans, and if inconsistencies occur, the operations of re-labeling and storing are triggered. The system may use a digital signature algorithm and periodic security scans to check the integrity and security of stored contract data. If the data is found to be inconsistent or a security breach exists, the system may trigger operations of re-tagging and storing to ensure the integrity and security of the data. And carrying out MD5 check and security scanning on the stored contract data periodically, and carrying out re-marking and storing if check failure or security hole occurs.

S103, processing contract contents by adopting a data splitting strategy, classifying each part of contents in the contract and obtaining the category of each part.

And performing word segmentation, part-of-speech tagging and named entity recognition of the contract by using a jieba segmentation library. And carrying out text analysis on the synthesized content through TF-IDF to identify keywords and high-frequency phrases. The original text feature vector is obtained by utilizing the Doc2Vec technology of genesim. And applying a data splitting strategy to divide the feature vector into a plurality of data subsets, wherein each subset contains a certain number of keywords and phrases. And classifying the data subsets by using a decision tree algorithm, and generating pre-classification labels according to the combination of the keywords and the phrases. And correcting the pre-classified label by combining the original text feature vector through a self-adaptive learning method to obtain a corrected classified label. And finally classifying the data subsets by using the decision tree algorithm again through the corrected classification labels to generate final classification labels. And combining the final category label with the original text feature vector to generate a new feature vector. Logic judgment is carried out on the new feature vectors, and screening is carried out through a built-in threshold value to determine which feature vectors need further processing. If the feature vector needs to be further processed, the data of the new feature vector is cleaned or transformed through a decision tree algorithm, and the final cleaned feature vector is obtained. And integrating all the cleaned feature vectors, and storing the feature vectors through a database to obtain the contract content of which the final processing is finished. For example, a contract is subjected to text analysis and classification. First, the contract content can be segmented by using a jieba segmentation library, and part-of-speech tagging and named entity recognition can be performed. For example, for a sentence "the present contract is commonly contracted by the first party and the second party", the word segmentation result may be [ "the present", "the contract", "the first party", "the second party", "the common", "the contracted" ], the part of speech tagging result may be [ "r", "n", "p", "n", "c", "n", "d", "v" ], and the named entity recognition result may be [ "O", "ORG", "O" ]. Next, the contract content may be text analyzed using TF-IDF algorithm to identify keywords and high frequency phrases. For example, the words frequently appearing in the contract are "partnership" and "payment", and the high-frequency phrase is "partnership agreement" and "payment method". The original text may then be converted into feature vectors using the genesim Doc2Vec technique. For example, the contract content is converted into a 300-dimensional feature vector. Next, a data splitting policy needs to be applied to split the feature vector into several data subsets, each subset containing a certain number of keywords and phrases. For example, feature vectors are partitioned into two subsets of data, one subset containing the keyword "collaboration" and the phrase "collaboration agreement", and the other subset containing the keyword "payment" and the phrase "payment means". The subset of data may then be classified using a decision tree algorithm, generating pre-classification tags from a combination of keywords and phrases. The pre-classification label may be a "collaboration protocol class" according to the keyword "collaboration" and the phrase "collaboration protocol"; the pre-classification label may be a "payment means class" based on the keyword "payment" and the phrase "payment means". And then, correcting the pre-classified label by combining the original text feature vector through an adaptive learning method to obtain a corrected classified label. The revised class label may be a "collaboration class" based on the pre-class label "collaboration protocol class" and the original text feature vector. And finally classifying the data subset by using the decision tree algorithm again through the corrected classification labels to generate final classification labels. According to the revised class label "collaboration class", the final class label may be "collaboration class". And combining the final category label with the original text feature vector to generate a new feature vector. The category label "collaboration category" is combined with the original text feature vector [1,2, 3] to obtain a new feature vector [1,2, 3, "collaboration category" ]. Logic judgment is carried out on the new feature vectors, and screening is carried out through a built-in threshold value to determine which feature vectors need further processing. If a certain feature value in the feature vector is greater than 5, further processing is required. If the feature vector needs to be further processed, the data of the new feature vector is cleaned or transformed through a decision tree algorithm, and the final cleaned feature vector is obtained. A certain feature value in the feature vector is set to 1 if it is greater than 5, otherwise to 0. And integrating all the cleaned feature vectors, storing the feature vectors through a database to obtain the contract content of which the final processing is finished, and storing all the feature vectors in a database table.

S104, judging whether the data should be stored in the main chain, the side chain or the inside according to the confidentiality degree and the data quantity of each category of the content in the contract.

And (3) carrying out text analysis on contract contents by using a TF-IDF algorithm, extracting keywords and phrases, and carrying out initial classification of data categories. After the data category is determined, the confidentiality degree of each category of data is determined through database query, the data is classified again by using a decision tree algorithm, and a pre-classification label is generated according to the combination of the keywords and the phrases. And correcting the pre-classified label by combining the original text feature vector through a self-adaptive learning method to obtain a corrected classified label. And finally classifying the data by using the corrected classification labels and applying a decision tree algorithm again to generate final classification labels. Data is added to the respective category labels. And sorting the confidentiality degree labels, and evaluating the priority of each category by using the sorting result. And metering the data volume of the data types with the confidentiality degree higher than a preset confidentiality threshold and the priority higher than a priority threshold. And judging whether the data is stored in the main chain or not according to the data quantity and the confidentiality degree, and carrying out hash verification of the data in the main chain. And if the hash verification is passed, executing data writing to the main chain, and configuring read-write permission for the data. And judging the data stored in the side chain for the data with small data size or confidentiality degree not higher than a preset threshold value. The data in the side chains is encrypted using the AES encryption algorithm. And (3) reconfiguring the read-write permission of the encrypted data in the side chain according to the confidentiality degree and the data volume of the encrypted data. For example, there is one contract text as follows: "Party A agrees to provide Party B with marketing services for one year since 1/2012, pays Party B with a contract fee of $5000 per month, both agreeing to keep all information related to the contract secret". The method comprises the steps of firstly, carrying out text analysis on contract contents by using a TF-IDF algorithm, and extracting keywords of 'party A', 'party B', 'market service', '1 year', '1 month in 2022', 'month charge', '5000 dollars', 'contract', 'information'. The data categories are classified into "contract terms", "payment conditions", and "confidentiality" according to contract contents. Keywords "1 year" and "2022 1 month 12 day" are classified as "contract terms", keywords "monthly fee" and "$5000" are classified as "payment condition", and keywords "information" are classified as "secret". The database query results show that the confidentiality degree of the contract clause is medium, the confidentiality degree of the payment condition is low, the confidentiality degree of the confidentiality is high, and the confidentiality degree labels are added into the corresponding data categories. And sorting the confidentiality degree labels, and evaluating the priority of each category by using the sorting result. The result of the privacy degree ordering is "high" > "medium" > "low", then the priority of the "privacy" category is highest, the priority of the "contract term" category is next to, and the priority of the "payment condition" category is lowest. The privacy threshold is set to "high" and the priority threshold is set to "medium". If the confidentiality of a data class is "high-level" and the priority is "medium" or "high-level", the data volume of that class is metered. Data having a data amount of less than 100 pieces or a degree of confidentiality of "low" is stored in the side chain, and data having a data amount of 100 pieces or more and a degree of confidentiality of "medium" or "high" is stored in the main chain. If the data passes the verification in the hash verification, writing the data to the main chain is performed, and the read-write authority is configured for the data. And for the data stored in the side chain, the AES encryption algorithm is used for encrypting the data, so that the security of the data is improved. And reconfiguring the read-write permission of the data in the side chain according to the confidentiality degree and the data quantity of the data.

S105, judging the data relation stored in the main chain and the side chain in a contract, and establishing a data flow path for data needing interaction between the main chain and the side chain.

And inquiring the data characteristics of the main chain and the side chain to obtain the data structure and the type of the main chain and the side chain. And judging whether the main chain has the basic condition of data interaction with the side chain or not by comparing the data characteristics of the main chain and the side chain. If the main chain has data interaction basic conditions, whether the data characteristics of the side chain are matched with the main chain or not is further checked, especially in the aspects of data structure and type. Data flow paths are established between the backbone and the side chains according to the frequency of data interaction and the size of the data. And determining whether to start a data synchronization mechanism according to the data consistency requirement, and if so, adopting a differential synchronization algorithm. And carrying out differential synchronization between the main chain and the side chain on the data meeting the data synchronization condition. And acquiring the data queriability attribute, determining whether the data needs to be indexed according to the attribute, and indexing by using a B-tree algorithm. And determining whether to store the interaction record and the storage mode according to the data audit requirement. After data synchronization is completed between the main chain and the side chain, the data life cycle is evaluated, and whether data need to be migrated is determined. For example, the backbone and side chains are a blockchain system and a distributed database system, respectively, with the data characteristics of the backbone being as follows. The data structure is a block chain structure, each block contains a batch of transaction records, and the data type is transaction data, including transfer information and intelligent contract execution results. The data properties of the side chains are as follows. The data structure is a distributed database structure that stores data using tables, and the data types are various types of data. By comparing the data properties of the main chain and the side chain, they are found to have different data structures and types, so that the data interaction between the main chain and the side chain needs to be adapted. In determining whether the backbone has the underlying conditions for data interaction with the side chains, consider whether the backbone supports the data structure and type of the side chains, and whether the backbone provides sufficient performance and capacity to handle the data interaction of the side chains. If the main chain has the basic condition of data interaction, further checking whether the data characteristic of the side chain is matched with the main chain, if the main chain is a financial blockchain, and the side chain is a database for storing image data, the data interaction has no matching property. Transaction data on the backbone is periodically synchronized into the side chains so that the side chain system remains consistent with the backbone and a data synchronization mechanism is initiated to maintain data consistency. And the differential synchronization algorithm is utilized to synchronize data between the main chain and the side chain, and only partial data which are changed are transmitted, so that the data transmission quantity is reduced. The data stored in the side chain needs to be frequently queried, and the B-tree algorithm is used for indexing, so that the query efficiency is improved. The time of data interaction between the main chain and the side chain, the two transaction parties and the transaction amount record are saved. After data synchronization between the backbone and side chains is completed, the lifecycle of the data is assessed. Migrating data to other storage media or systems, the main chain each chunk is 1MB in size, the chunk out speed is 10 chunks per second, and the side chain each data object is 100KB in size, generating 100 data objects per second. When a data flow path is established between the backbone and the side chains, transaction data in each tile is synchronized into the side chains, synchronizing the data volume of 10MB per second. And starting a differential synchronization mechanism, transmitting only changed transaction data, and establishing an index for the data. And simultaneously saving the time stamp of each transaction, both transaction sides and transaction amount information. Based on the lifecycle assessment of the data, the data is migrated to other storage media or systems if the data is no longer in need of use.

S106, after each data interaction is analyzed, the integrity and the accuracy of contract data are corrected through intelligent contracts.

And carrying out integrity check on contract data, and if the data is found to be incomplete, triggering an integrity check algorithm in the intelligent contract to carry out automatic correction. And carrying out accuracy assessment on the contract data subjected to the integrity check, and if the accuracy is not consistent, executing an accuracy correction algorithm in the intelligent contract. By the SHA-256 algorithm, it is determined whether the new data input matches the corrected contract data. After the new data is input and verified, the data is encrypted and output, and if the output encryption fails, the data encryption process is re-executed. And executing logic judgment on the encrypted output data, and executing data rollback operation if the logic judgment is abnormal. After logic judgment and data rollback are completed, the running state of the intelligent contract is obtained, and whether the intelligent contract is normally executed is judged through Boolean logic. After the intelligent contract running state is obtained, checking contract execution conditions, and if the Boolean logic judges false, restarting the correction algorithm of the intelligent contract. After the contract performs condition checking, feedback information of the system is obtained, and if the feedback information is abnormal, the data abstract is regenerated according to the SHA-256 algorithm. After system feedback is obtained, final stability evaluation is carried out on the whole data processing flow, and if the stability is inconsistent, a self-correction mechanism in the intelligent contract is triggered. For example, there is contract data after data interaction is completed, integrity check is needed, and if the data is found to be incomplete, the intelligent contract date triggers an integrity check algorithm to automatically correct. In the integrity check, a certain field in the contract data is missing. Contract data: after the integrity check of { "contract number": "ABC123", "second party": "Zhang Sanj", "date of sign": "2022-01-01" }, the contract amount field was found to be missing. The intelligent appointment automatically corrects the contract data and adds the contract amount field. Corrected contract data: { "contract number": "ABC123", "second party": "Zhang Sanj", "date of sign": "2022-01-01", "contract amount": "10000 yuan" }. And then, carrying out accuracy assessment on the contract data subjected to the integrity check and correction, and finding that the contract amount is inconsistent with the actual condition in the accuracy assessment. The intelligent appointment performs an accuracy correction algorithm to correct the contract amount to the correct value. Corrected contract data: { "contract number": "ABC123", "second party": "Zhang Sanj", "date of sign": "2022-01-01", "contract amount": "15000 Yuan" }. SHA-256 is applied to the corrected contract data, and a hash value is calculated. When new data is input, the same SHA-256 algorithm is applied to the data, and a new hash value is calculated. Comparing the new hash value with the hash value of the corrected contract data, judging whether the new data input is matched with the corrected contract data, if the two hash values are equal, indicating that the new data input is matched with the corrected contract data, if the hash values are not equal, indicating that the new data input is not matched with the corrected contract data, and if the matching is successful, performing the next processing. The data is then encrypted and output, and if encryption fails, the data encryption process is re-executed until successful. Encrypting the output data: { "contract number": "ABC123", "second party": "Zhang Sanj", "date of sign": "2022-01-01", "contract amount": "15000 Yuan" }. And carrying out logic judgment on the encrypted output data, wherein the logic judgment is an abnormal state, executing data rollback operation and recovering to the last state. Data after performing the data rollback operation: { "contract number": "ABC123", "second party": "Zhang Sanj", "date of sign": "2022-01-01", "contract amount": "10000 yuan" }. After logic judgment and data rollback are completed, an intelligent contract running state is obtained, whether the intelligent contract is normally executed or not is judged through Boolean logic, the intelligent contract is normally executed, and contract execution conditions are checked. The Boolean logic judges as false, triggers a correction algorithm of the intelligent contract, corrects the execution condition, and the corrected execution condition is true. And acquiring feedback information of the system, wherein the feedback information is abnormal, and regenerating a data abstract according to a hash algorithm. The regenerated data summary is d41d8cd98f00b204e9800998ecf8427e. After system feedback is obtained, final stability evaluation is carried out on the whole data processing flow, the stability is not in accordance with the requirement, and a self-correction mechanism in the intelligent contract is triggered.

And S107, judging whether the data flow path accords with the specification, and if the content of the interactive transmission accords with the preset data tag specification, automatically generating a corresponding tag in the contract to ensure that the storage of the contract content accords with the safety requirement.

Data flow path information in interactive transmission is captured by tcpdump tool. Regular expressions are applied to examine the source address, destination address, intermediate node information of the data packet, with a predetermined regular expression rule that the address must be a relative path beginning with a slash or a valid Unix file path. Verification is performed on the captured data stream path. If the result of regular expression checking does not match with the preset data label, triggering the data rollback and stopping the process, and if the result of regular expression algorithm checking matches with the preset data label, starting an automatic label generator to generate a corresponding label. And inquiring contract contents corresponding to the automatically generated labels from the relational database, and checking whether the contract contents contain specific keywords or phrases or not by matching the data which are marked before the inquired contract contents and contain the predefined keywords or phrases, so as to judge the storage compliance. If the content meets the storage specification, the contract content and the automatically generated tag are integrated into a data object by using StanfordNLP. The data object is subjected to security verification by using an SHA-256 algorithm, and if the output of the SHA-256 algorithm meets the preset security standard, the data object is directly stored in the encrypted storage area. For example, the result of the data flow path verification algorithm is path a, but the preset data label is path B. In this case, the data rollback is triggered and the process is terminated, i.e., the data flow path is rolled back to the previous state and the associated process is terminated. The regular expression algorithm results in matching with the preset data label, and an automatic label generator is started to generate a corresponding label, wherein the generated label is 'Contract 123'. The Contract content corresponding to "Contract123" is queried from the relational database. And carrying out keyword matching on the contract contents, wherein the keyword to be matched is 'confidence', and if the keyword is contained in the contract contents, the content is indicated to meet the storage specification. According to statistics, the keyword "confidence" appears 5 times in the contract content. And finally integrating the contract content and the automatically generated tag into a data object, and carrying out security verification on the data object by using an SHA-256 algorithm. A security criteria threshold is set and if the first two characters output are "0x87", the security criteria are met. The output of the SHA-256 algorithm is "0x87456abf", which meets the security standard.

The foregoing describes specific embodiments of the present invention. It is to be understood that the invention is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the invention. The embodiments of the present application and features in the embodiments may be combined with each other arbitrarily without conflict.

Claims

1. A method for implementing secure storage of electronic labor contracts based on blockchain, the method comprising:

acquiring an electronic contract, analyzing the contract content to obtain the basic characteristics and sensitivity of the data, and marking the sensitivity level of each paragraph according to the sensitivity standard; selecting a data storage position according to the sensitive mark of the content, and distributing the sensitive information to an intranet server for storage; processing contract contents by adopting a data splitting strategy, classifying contents of each part in the contract and obtaining the category of each part; judging whether the data should be stored in the main chain, the side chain or the inside according to the confidentiality degree and the data amount of each category in the contract content; judging the data relation stored in the main chain and the side chain by contract, and establishing a data flow path for the data needing interaction between the main chain and the side chain; after each data interaction is analyzed, the integrity and the accuracy of contract data are analyzed, and incorrect contents are corrected through intelligent contracts; judging whether the data flow path accords with the specification, if the content of the interactive transmission accords with the preset data label specification, automatically generating a corresponding label in the contract, and ensuring that the storage of the contract content accords with the safety requirement.

2. The method of claim 1, wherein the acquiring the electronic contract, analyzing the basic characteristics and sensitivity of the data through contract content, marking the sensitivity level of each paragraph according to the sensitivity standard, comprises:

acquiring an electronic contract from a contract management system through an API interface; adopting a TF-IDF algorithm to analyze the content of the electronic contract text; labeling each paragraph of the same text, and grading the sensitivity of the data by using a trained classifier; data is divided into three distinct data pools: high sensitivity, medium sensitivity and low sensitivity; encrypting the high-sensitivity data by adopting an AES encryption algorithm, and encrypting the medium-sensitivity and low-sensitivity data by adopting an MD5 encryption algorithm; performing authority management through an ActiveDirecty, and setting that high-sensitivity data can only be accessed by a specific role; adopting an elastic search to index data, and classifying according to a data pool; and carrying out batch verification by using the DataDriventesting according to the data pool and the data index.

3. The method of claim 1, wherein selecting a data storage location according to the sensitive tag of the content, and distributing the sensitive information to an intranet server for storage, comprises:

Content sensitive marking is carried out on contract data; using TF-IDF algorithm to grade according to sensitive marks, classifying into high, medium and low sensitivity; deciding a data storage position according to the sensitivity rating result; checking the storage state of an intranet server, and evaluating the storage capacity of a server partition; storage resources are allocated, and data are pushed to corresponding intranet server partitions; enabling encryption and access control lists for the high sensitivity data partitions by applying security policies; adjusting storage priority, and redirecting data to other partitions through a load balancing strategy; periodic data integrity checks are performed by a digital signature algorithm.

4. The method of claim 1, wherein the processing the contract content using the data splitting policy, classifying the portions of the content in the contract and deriving a category for each portion, comprises:

performing word segmentation, part-of-speech tagging and named entity recognition on the contract by using a jieba word segmentation library; identifying keywords and high-frequency phrases through TF-IDF; obtaining text feature vectors by utilizing a genesim Doc2Vec technology; dividing the feature vector into data subsets, each subset containing keywords and phrases; classifying the data subsets by using a decision tree algorithm and generating pre-classification labels; obtaining a corrected classification label through a self-adaptive learning method; performing final classification again by using a decision tree algorithm; combining the final category labels with the text feature vectors to obtain new feature vectors; executing logic judgment to determine which feature vectors need to be processed; carrying out data cleaning or transformation on the feature vector to be processed to obtain a final feature vector; and integrating all the feature vectors to obtain the processed contract content.

5. The method of claim 1, wherein the determining whether the data should be stored in the main chain, the side chain or the internal storage according to the confidentiality degree and the data amount of each category in the contract contents comprises:

text analysis is carried out by using a TF-IDF algorithm to extract keywords and phrases; determining the confidentiality degree of the data through database query, and classifying the data by using a decision tree algorithm; performing pre-classification label correction by combining text feature vectors; classifying again by using a decision tree algorithm through correcting the label to generate a final label; adding data to the respective tags; sorting the confidentiality degree labels, and evaluating the priority of each class; judging a storage position according to the data quantity and the confidentiality degree; performing hash verification in the main chain and performing data writing; using an AES encryption algorithm on the data in the side chain; the read-write rights of the encrypted data in the side chain are reconfigured.

6. The method of claim 1, wherein the determining the contract stores a data relationship between the backbone and the side chain, and establishing a data flow path for data that requires interaction between the backbone and the side chain comprises:

inquiring the data characteristics of the main chain and the side chain to obtain a data structure and a data type; comparing the data characteristics of the main chain and the side chain to judge the interaction condition; establishing a data flow path between the interaction frequency and the data size according to the interaction frequency and the data size; determining whether to start data synchronization according to the data consistency requirement, and adopting a differential synchronization algorithm; carrying out differential synchronization on the data between the two; acquiring data attributes to determine whether indexing is needed or not, and indexing by using a B-tree algorithm; determining and storing an interactive recording mode according to the data auditing requirement; after synchronization is completed, the data lifecycle is evaluated to determine data migration.

7. The method of claim 1, wherein said analyzing the integrity and accuracy of the contract data after each data interaction, correcting incorrect content by smart contracts, comprises:

triggering an integrity check algorithm in the intelligent contract to automatically correct incomplete data; executing an accuracy correction algorithm in the intelligent contract to correct contract data which does not accord with accuracy; judging whether the new data input is matched with the corrected contract data or not through an SHA-256 algorithm; performing data encryption output, and retrying the encryption output until success; logic judgment is carried out on the encrypted output data, and data rollback operation is carried out under an abnormal state; after acquiring the running state of the intelligent contract, checking the contract execution condition; if the feedback information is abnormal, regenerating a data abstract by using an SHA-256 algorithm; and after the system feedback, performing stability evaluation, and triggering a self-correction mechanism of the intelligent contract.

8. The method according to claim 1, wherein the determining whether the data flow path meets the specification, if the content of the interactive transmission meets the preset data tag specification, automatically generating a corresponding tag in the contract, and ensuring that the storage of the contract content meets the security requirement, includes:

Capturing data flow path information using a tcpdump tool; checking the source address, the target address and the intermediate node information of the data packet by using the regular expression; checking the captured data flow path, if the result does not match with a preset data label, terminating the process, and starting an automatic label generator when the result does not match with the preset data label; inquiring contract content in the relational database, corresponding to the automatic generation tag, and judging storage compliance of the inquired contract content; integrating contract content with automatically generated tags into a data object; and carrying out security verification on the data object by using an SHA-256 algorithm, and storing the data object meeting the preset security standard into an encrypted storage area.