CN117892331B - Data security storage method of scientific and technological achievement sharing platform - Google Patents
Data security storage method of scientific and technological achievement sharing platform Download PDFInfo
- Publication number
- CN117892331B CN117892331B CN202410288260.7A CN202410288260A CN117892331B CN 117892331 B CN117892331 B CN 117892331B CN 202410288260 A CN202410288260 A CN 202410288260A CN 117892331 B CN117892331 B CN 117892331B
- Authority
- CN
- China
- Prior art keywords
- sub
- sentence
- word
- scientific
- paragraph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000035945 sensitivity Effects 0.000 claims abstract description 39
- 239000013598 vector Substances 0.000 claims abstract description 33
- 235000013399 edible fruits Nutrition 0.000 claims abstract description 3
- 230000011218 segmentation Effects 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000013500 data storage Methods 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/602—Providing cryptographic facilities or services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Probability & Statistics with Applications (AREA)
- Bioethics (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of data storage, in particular to a data security storage method of a scientific and technological achievement sharing platform, which comprises the following steps: acquiring technological achievement text data to be stored; analyzing sub-word frequency in the scientific and technological achievement text, and constructing sub-word frequency sensitivity coefficients; analyzing the data type and data information to be encrypted in the scientific and technological achievement text, and constructing sub-word frequency sensitivity dense indexes; obtaining a local vector sequence of a subword window; analyzing sentence similarity in paragraphs according to similarity of sensitive data description in the scientific and technological achievement text, and constructing paragraph sensitive dense similarity; acquiring random numbers by combining the sensitive dense similarity of the paragraphs; and the data security storage of the science and technology fruit sharing platform is completed by combining an elliptic curve encryption algorithm. The invention aims to acquire random numbers by utilizing the text data characteristics needing encryption, and increase the randomness and the security of the secret key.
Description
Technical Field
The application relates to the technical field of data storage, in particular to a data security storage method of a scientific and technological achievement sharing platform.
Background
In the internet big data age, the scientific and technological achievement sharing platform faces the challenges of complexity of a data management environment and imperfect data security storage mechanism. These challenges lead to frequent occurrence of problems such as data leakage, loss, etc. Considering that scientific and technological data is large in scale, complex in type and contains a large amount of private data, once revealed or lost, great economic loss is brought. In order to ensure the safe storage of scientific and technological achievement data, only authorized users can decrypt and access the data, so that the adoption of a data safe storage mode based on an encryption mechanism is particularly important.
The elliptic curve (Elliptic Curve Cryptography, ECC) encryption algorithm has the advantages of short key, high security performance and the like. Because of various types of scientific and technological achievements, random numbers R are required to be generated to generate keys in the ECC encryption process, after different R encrypt data, ciphertext data can be quite different, and the larger the R value is, the more the running times of an encryption algorithm are, and the higher the security of ciphertext is. If the pseudorandom number generator is not secure enough or is compromised, predictability of the key may result, thereby reducing the security of the encryption. The existing scientific and technological achievement sharing platform has the defect of sensitivity to pseudo-random number generation when ECC (error correction code) secure storage is adopted.
Disclosure of Invention
In order to solve the technical problems, the invention provides a data security storage method of a scientific and technological achievement sharing platform, which aims to solve the existing problems.
The data security storage method of the scientific and technological achievement sharing platform adopts the following technical scheme:
The embodiment of the invention provides a data security storage method of a scientific and technological achievement sharing platform, which comprises the following steps:
Obtaining each sub word and corresponding semantic vector by adopting a word segmentation model to the scientific and technological achievement text data to be stored;
Counting the occurrence frequency of each sub word in the paragraph; aiming at each sentence, acquiring sub word frequency sensitivity coefficients of each sub word in the sentence according to the occurrence frequency of each sub word and the difference between the occurrence frequency and the position of the rest sub words in the sentence; acquiring sub-word frequency sensitivity dense indexes of each sentence of the paragraph according to the data distribution characteristics of all sub-words in each sentence of the paragraph and the sub-word frequency sensitivity coefficients; dividing the subword windows for each subword, and acquiring a local vector sequence of each subword window by combining the semantic vector of each subword; obtaining the semantic similarity distance between two sentences according to the similarity between the local vector sequences of the sub-word windows in the two sentences; acquiring paragraph sensitive dense similarity of each paragraph according to the distribution of sub-word frequency sensitive dense indexes of continuous sentences in the paragraphs and the semantic similarity distance between sentences;
Acquiring random numbers according to the distribution of the paragraph sensitive dense similarity of all paragraphs; and (3) completing the data security storage of the scientific and technological fruit sharing platform according to the random number and UTF-8 coding and elliptic curve encryption algorithm.
Preferably, the counting the occurrence frequency of each sub-word in the paragraph includes:
Taking all the sub-words as input, and outputting the sub-words processed by the stop words by adopting a stop word list; and taking the subwords processed by the stop words of each paragraph as input, and outputting the occurrence frequency of each subword in the paragraph by adopting an N-gram model.
Preferably, the sub-word frequency sensitivity coefficient is specifically:
for each sub-word in the sentence, calculating the sum of the occurrence frequencies of each sub-word and the rest sub-words; calculating the absolute value of the difference value of the order of each sub word and the rest sub words; calculating the ratio of the sum value of each subword to the other subwords to the absolute value of the difference value; and taking the sum value of the ratio obtained by calculating each sub word and all other sub words as the sub word frequency sensitivity coefficient of each sub word in the sentence.
Preferably, the sub word frequency sensitive dense index of each sentence of the paragraph is obtained according to the data distribution characteristics of all sub words in each sentence of the paragraph, and the expression is:
In the method, in the process of the invention, Frequency-sensitive dense index of sub-words representing the c-th sentence in paragraph t,/>, andRepresenting the total number of subwords contained in the c-th sentence,/>Representing total number of key sub-words of scientific and technological achievement contained in paragraph t,/>As an exponential function based on natural constants,/>Representing the difference between the order values of the ith sub-word and the kth scientific and technological achievement key data sub-word in the paragraph,/>Sub-word frequency sensitivity coefficient representing the ith sub-word of the c-th sentence,/>Representing the sensitivity score of the ith subword in the c-th sentence.
Preferably, the sensitivity score is specifically:
acquiring a key data set of scientific and technological achievements; if each sub-word in the sentence belongs to the scientific and technological achievement key data set, taking 3 as the sensitive score of each sub-word in the sentence; if each sub-word in the sentence does not belong to the scientific and technological achievement key data set, 1 is used as the sensitivity score of each sub-word in the sentence.
Preferably, the acquiring a key dataset of a scientific and technological achievement specifically includes:
the key data of technological achievement is composed of English letters, greek letters commonly used in basic mathematics and numerals; and taking the set of the scientific and technological achievement key data as a scientific and technological achievement key data set.
Preferably, the dividing the subword window for each subword, and obtaining the local vector sequence of each subword window by combining the semantic vector of each subword includes:
For each sentence, taking each sub-word as a center, and respectively taking m sub-words forwards and backwards to form each sub-word window with each sub-word; and forming the semantic vectors of all the subwords in each subword window into a local vector sequence of each subword window, wherein m is a preset value.
Preferably, the semantic similarity distance between two sentences is obtained according to the similarity between the local vector sequences of the subword windows in the two sentences, and the expression is:
In the method, in the process of the invention, Represents the/>Sentences and/>Semantic similarity distance of individual sentences,/>Represents the/>Number of subword windows in each sentence,/>Represents the/>Number of subword windows in each sentence,/>To adjust the parameters,/>Represents the/>First/>, in the sentenceLocal vector sequence of subword window,/>Represents the/>First/>, in the sentenceLocal vector sequence of subword window,/>Representing a cosine similarity function.
Preferably, the sensitive dense similarity of the paragraphs of each paragraph is specifically:
For each paragraph, starting from the second sentence, taking the semantic similarity distance between each sentence and the previous sentence as the previous semantic similarity distance of each sentence; the semantic similarity distance between each sentence and the following sentence is used as the following semantic similarity distance of each sentence; calculating the sum of the front semantic similarity distance and the rear semantic similarity distance of each sentence; calculating the average value of sub-word frequency sensitive dense indexes of each sentence, the sentence before each sentence and the sentence after each sentence; calculating the ratio of the average value to the sum value of each sentence; and taking the average value of the ratios of all sentences of each paragraph as the paragraph sensitive dense similarity of each paragraph.
Preferably, the obtaining the random number according to the distribution of the paragraph sensitive dense similarity of all paragraphs includes:
The mean value of the sensitive dense similarity of all paragraphs is rounded to a rounded value as a random number.
The invention has at least the following beneficial effects:
The invention analyzes the data of the document to be encrypted based on the word segmentation and word frequency statistics of the scientific and technological achievement text. Because the data to be encrypted is mostly characterized by repeated words, sub-word frequency sensitivity coefficients of sentences are constructed; and then, setting a sensitivity score according to the specificity and the distribution density of the data type and the key degree of the data type, and effectively judging whether the data in the text needs encryption operation or not. And finally, calculating the sensitive dense similarity of the paragraphs according to the description similarity of sentences in the paragraphs, calculating random numbers of an ECC encryption algorithm on the basis of the sensitive dense similarity, generating a key according to the characteristics of the data, overcoming the defect that the existing algorithm excessively depends on pseudo random numbers, and enhancing the randomness and the safety of key generation.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for securely storing data of a scientific and technological achievement sharing platform provided by the invention;
FIG. 2 is a flow chart of the acquisition of paragraph sensitive dense similarity.
Detailed Description
In order to further describe the technical means and effects adopted by the present invention to achieve the preset purpose, the following detailed description refers to the specific implementation, structure, features and effects of a data security storage method of a scientific and technological achievement sharing platform according to the present invention with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following specifically describes a specific scheme of the data security storage method of the scientific and technological achievement sharing platform provided by the invention with reference to the accompanying drawings.
The embodiment of the invention provides a data security storage method of a scientific and technological achievement sharing platform.
Specifically, a method for securely storing data of a scientific and technological achievement sharing platform is provided, please refer to fig. 1, the method includes the following steps:
step S001: preprocessing technological achievement text data to be stored.
In the embodiment, encryption storage is considered for the scientific and technological achievement text data, and word segmentation processing is conducted on text content for convenience of subsequent processing and analysis due to the fact that data types in the scientific and technological achievement text are complex. In the embodiment, the BERT word segmentation model is adopted to carry out word segmentation processing, all sentences are split into sub-words, and each sub-word has a corresponding semantic vector. The input of the BERT word segmentation model is a single sentence, the input is a sentence subjected to word segmentation processing, the sentence comprises all sub words in the sentence and vector representation results of each sub word, and the vector representation results of each sub word are used as semantic vectors of each sub word. The BERT word segmentation model is a well-known technology, and the specific process is not described in detail.
So far, each sub word of all sentences of the scientific and technological achievement text data and corresponding semantic vectors are obtained.
Step S002: based on sub-word frequency analysis in scientific and technological achievement documents, sub-word frequency sensitivity coefficients are constructed, key data types and distribution densities thereof in data are analyzed, sub-word frequency sensitivity density indexes are constructed, and paragraph sensitivity density similarity is constructed based on description similarity features of sensitive data.
Because of the huge size of scientific and technological achievements, in order to prevent data confusion and loss, encryption and decryption are needed in the data access process, so that the safety of data storage is improved. The present embodiment replaces the conventional distributed access management mode of the shared platform with a hash access data algorithm. The input of the hash algorithm is the sum of semantic vectors of all subwords of the scientific and technological achievement title, and then the corresponding positions are selected in the memory to store according to the calculated hash value. The calculation of the hash algorithm is a well-known technique, and the specific process is not described in detail.
In the process of storage, since a piece of scientific and technological achievement text data is more, algorithm complexity is necessarily increased if all data in the text are encrypted, and contents such as names of scientific and technological achievements, project introduction, related introduction and the like do not need to be encrypted and stored. While some of these words and paragraphs may relate to sensitive information, which needs to be stored encrypted. Relatively sensitive information in a technological achievement document is usually found in technical details and in paragraphs of study data analysis, and analysis finds that the descriptions of these paragraphs often appear as repetitive words. For example, when describing experimental results, the occurrence frequency of partial words is high for comparison, so that sensitive information in a document is identified according to the characteristics.
In a paragraph of a scientific and technological achievement article, if the distribution of high-frequency words with more occurrence times is close, the sensitive characteristics of the words in the article are more obvious, so that the sub-word frequency sensitive coefficient is calculated according to the frequency and the distribution characteristics of the words in the paragraph, and specifically, before word frequency statistics is carried out, words with high occurrence frequency but lower information content in the text need to be removed. Such as conjunctions, prepositions, pronouns, etc. The present embodiment uses the stop word list provided by spaCy libraries in Python for stop word processing. The input of spaCy library is all the sub-words in the document, and the output is all the sub-words after the stop word processing. And then, carrying out word frequency statistics by adopting an N-gram model, wherein the input of the algorithm is a sub-word of each paragraph in the document, and the output is the occurrence frequency of each sub-word in the paragraph. The spaCy library and the N-gram model are known techniques, and detailed processes are not repeated. The frequency sensitivity coefficient of the subword is calculated as follows:
In the method, in the process of the invention, Sub-word frequency sensitivity coefficient representing the i-th sub-word in the c-th sentence of paragraph t,/>Represents the total number of subwords contained in the c-th sentence, j represents the j-th subword except i in the c-th sentence,/>、/>Respectively represent the occurrence frequency of the ith sub word and the jth sub word in the c th sentence,/>、/>The order values of the ith sub-word and the jth sub-word in sentence c are respectively represented.
When a certain subword in a sentence occurs more frequently in that paragraph,Or/>The larger the value of (c) indicates that the more frequently the paragraph describes the ith and jth subwords, the higher the importance in the paragraph. Meanwhile, if the positions of two words are closer, the position distance value/>, in the sentenceThe smaller the two subwords, the higher the correlation may be, indicating that the higher frequency of the subwords is more closely related to the overall paragraph. Finally calculated sub-word frequency sensitivity coefficient/>The higher the likelihood that the subword appears in the technical details and the research data analyzing the paragraphs is, the higher the sensitivity is.
In addition, except for the presence of high-frequency words, the sensitive paragraphs usually relate to data such as letters and numbers, in this embodiment, sub-words composed of 26 English letters, basic math common Greek characters and numbers are used as key data of technological achievements, and a set composed of key data of technological achievements is used as key data set of technological achievements. If the sub-words with higher occurrence frequency in the paragraphs are similar to the positions of the key data of the technological achievements in the paragraphs, namely the high-frequency words and the key data of the technological achievements in the paragraphs show cross distribution characteristics, the more sensitive the information data in the paragraphs are. Further, if the occurrence frequency of the key data of the technological achievement in the paragraph is high, and meanwhile, the distribution of the key data of the technological achievement is dense, the information content in the paragraph is large, the sensitivity of the information is strong, the sub-word frequency sensitive dense index is calculated according to the distribution characteristics between the high-frequency vocabulary and the key data of the technological achievement in the paragraph, and a specific calculation formula is as follows:
In the method, in the process of the invention, Frequency-sensitive dense index of sub-words representing the c-th sentence in paragraph t,/>, andRepresenting the total number of subwords contained in the c-th sentence,/>Representing total number of key sub-words of scientific and technological achievement contained in paragraph t,/>As an exponential function based on natural constants,/>Representing the difference between the order values of the ith sub-word and the kth scientific and technological achievement key data sub-word in the paragraph,/>Sub-word frequency sensitivity coefficient representing the ith sub-word of the c-th sentence,/>Representing the sensitivity score of the ith subword in the c-th sentence,/>Data content representing i-th word,/>Representing a key dataset of technological achievements.
If key data such as letters or characters exist in the neighborhood of the subword,The smaller the value of (2) is, the closer to the key data is, and the higher the sub word frequency sensitivity coefficient of the sub word i is, the obtained/>The smaller is, make/>The larger the value, the more critical data that appears near the subword, the denser the distribution of critical data. If the subword belongs to the collection/>The higher the corresponding score, the more sensitive the corresponding subword is explained. Meanwhile, the greater the sub-word frequency sensitivity coefficient of the sentence is, the more the sub-word frequency sensitivity dense index/>, the more the sub-word frequency sensitivity isThe larger. The more likely the sentence is to belong to the key data in the scientific and technological achievement.
The key data typically appears in the form of paragraphs in which sentences have a high degree of semantic similarity from sentence to sentence. Further, sensitive paragraphs in the scientific and technological achievement document are analyzed on the basis of the key data sentences. Firstly, dividing a sentence into sub-word windowsAnd m subwords before and after the subword window, it should be noted that the situation that m subwords cannot be taken before and after the beginning and the end of a sentence occurs, and only the situation that m can be taken is considered in this embodiment, and the value of m is 2 in this embodiment. For example, in sentence c, to/>The subword window being centered is denoted/>Wherein/>The value of (2) should be greater than 2 and less than/>Is an integer of (a). And then the semantic vectors of the sub words in the sub word window are orderly formed into a local vector sequence of the window. Combining the semantic association degree of the sentence whole, constructing paragraph sensitive dense similarity, wherein the expression is as follows:
In the method, in the process of the invention, Represents the/>Sentences and/>Semantic similarity distance of individual sentences,/>Represents the/>Number of subword windows in each sentence,/>Represents the/>Number of subword windows in each sentence,/>To adjust the parameters,/>Represents the/>First/>, in the sentenceLocal vector sequence of subword window,/>Represents the/>First/>, in the sentenceLocal vector sequence of subword window,/>Representing a cosine similarity function; it should be noted that, the value of the adjustment parameter in this embodiment is 2;
In the method, in the process of the invention, Representing paragraph sensitive dense similarity of paragraph t,/>Representing the total number of sentences contained in paragraph t,/>、/>And/>The sub-word frequency sensitive dense indexes of the c-1 th sentence, the c-1 th sentence and the c+1 th sentence in the paragraph t are respectively expressed by/>The mean function is represented by a function of the mean,Representing the semantic similarity distance between the c-1 th sentence and the c-th sentence,/>Representing the semantically similar distance between the c-th sentence and the c+1th sentence. Will/>Preservation as pre-semantic similarity distance of the c-th sentence, will/>The post-semantic similarity distance is saved as the c-th sentence. The flow of obtaining the paragraph sensitive dense similarity is shown in fig. 2.
If the description modes among sentences are closer, calculatingThe higher the value, the resulting semantic similarity distance/>The smaller the value, the more similar the semantics of the subword between two sentences are, the two sentences may be describing the same object. Then calculate the/>, between the sentence and the two sentences before and afterThe higher the sub-word frequency sensitive dense index is, the corresponding average valueThe larger the more densely the sensitive data is explained. Paragraph sensitive dense similarity/>The larger the sensitive information is related to the paragraph t in the scientific and technological achievement document, the higher the security of the encrypted and safe storage is.
Step S003: and (5) combining an elliptic curve encryption algorithm to carry out encryption and safe storage on the scientific and technological achievements.
Through the steps, the paragraph sensitive dense similarity value of each paragraph is obtained through calculation, and then the scientific and technological achievement document is stored in an encrypted mode. The encryption mode is elliptic curve encryption algorithm. The embodiment adopts a UTF-8 scheme to encode the text to be encrypted, and converts the text data into a numerical value which can be input by an ECC algorithm. UTF-8 encoding is a well-known technique, and specific steps are not repeated. The input of the algorithm is UTF-8 coded data and a random number R, R is calculated by calculating all paragraphsThe average of the values is rounded off and the output of the algorithm is the encrypted ciphertext. The ECC algorithm is a well-known technique, and the detailed process is not repeated. And finally, placing the encrypted paragraph data and the unencrypted paragraph data in a memory.
Thus, the data security storage of the scientific and technological achievement sharing platform is completed.
In summary, the embodiment of the invention analyzes the data of the document to be encrypted based on the word segmentation and word frequency statistics of the scientific and technological achievement text. Because the data to be encrypted is mostly characterized by repeated words, sub-word frequency sensitivity coefficients of sentences are constructed; and then, setting a sensitivity score according to the specificity and the distribution density of the data type and the key degree of the data type, and effectively judging whether the data in the text needs encryption operation or not. And finally, calculating the sensitive dense similarity of the paragraphs according to the description similarity of sentences in the paragraphs, calculating random numbers of an ECC encryption algorithm on the basis of the sensitive dense similarity, generating a key according to the characteristics of the data, overcoming the defect that the existing algorithm excessively depends on pseudo random numbers, and enhancing the randomness and the safety of key generation.
It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment mainly describes differences from other embodiments.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; the technical solutions described in the foregoing embodiments are modified or some of the technical features are replaced equivalently, so that the essence of the corresponding technical solutions does not deviate from the scope of the technical solutions of the embodiments of the present application, and all the technical solutions are included in the protection scope of the present application.
Claims (7)
1. The data security storage method of the scientific and technological achievement sharing platform is characterized by comprising the following steps of:
Obtaining each sub word and corresponding semantic vector by adopting a word segmentation model to the scientific and technological achievement text data to be stored;
Counting the occurrence frequency of each sub word in the paragraph; aiming at each sentence, acquiring sub word frequency sensitivity coefficients of each sub word in the sentence according to the occurrence frequency of each sub word and the difference between the occurrence frequency and the position of the rest sub words in the sentence; acquiring sub-word frequency sensitivity dense indexes of each sentence of the paragraph according to the data distribution characteristics of all sub-words in each sentence of the paragraph and the sub-word frequency sensitivity coefficients; dividing the subword windows for each subword, and acquiring a local vector sequence of each subword window by combining the semantic vector of each subword; obtaining the semantic similarity distance between two sentences according to the similarity between the local vector sequences of the sub-word windows in the two sentences; acquiring paragraph sensitive dense similarity of each paragraph according to the distribution of sub-word frequency sensitive dense indexes of continuous sentences in the paragraphs and the semantic similarity distance between sentences;
Acquiring random numbers according to the distribution of the paragraph sensitive dense similarity of all paragraphs; completing the data security storage of the scientific and technological fruit sharing platform according to the random number and UTF-8 coding and elliptic curve encryption algorithm;
The sub-word frequency sensitivity coefficient is specifically:
For each sub-word in the sentence, calculating the sum of the occurrence frequencies of each sub-word and the rest sub-words; calculating the absolute value of the difference value of the order of each sub word and the rest sub words; calculating the ratio of the sum value of each subword to the other subwords to the absolute value of the difference value; taking the sum of the ratio obtained by calculating each sub word and all other sub words as the sub word frequency sensitivity coefficient of each sub word in the sentence;
Acquiring sub word frequency sensitive dense indexes of each sentence of the paragraph according to the data distribution characteristics of all sub words in each sentence of the paragraph, wherein the expression is as follows:
In the method, in the process of the invention, Frequency-sensitive dense index of sub-words representing the c-th sentence in paragraph t,/>, andRepresenting the total number of subwords contained in the c-th sentence,/>Representing total number of key sub-words of scientific and technological achievement contained in paragraph t,/>As an exponential function based on natural constants,/>Representing the difference between the order values of the ith sub-word and the kth scientific and technological achievement key data sub-word in the paragraph,Sub-word frequency sensitivity coefficient representing the ith sub-word of the c-th sentence,/>A sensitivity score representing an ith sub-word in the c-th sentence;
The sensitive dense similarity of the paragraphs of each paragraph is specifically:
For each paragraph, starting from the second sentence, taking the semantic similarity distance between each sentence and the previous sentence as the previous semantic similarity distance of each sentence; the semantic similarity distance between each sentence and the following sentence is used as the following semantic similarity distance of each sentence; calculating the calculation result of adding the front semantic similarity distance and the rear semantic similarity distance of each sentence; calculating the average value of sub-word frequency sensitive dense indexes of each sentence, the sentence before each sentence and the sentence after each sentence; calculating the ratio of the average value to the calculation result of each sentence; and taking the average value of the ratio of the average value of all sentences of each paragraph to the calculation result as the paragraph sensitive dense similarity of each paragraph.
2. The method for securely storing data of a scientific and technological achievement sharing platform according to claim 1, wherein the step of counting occurrence frequencies of each sub-word in a paragraph includes:
Taking all the sub-words as input, and outputting the sub-words processed by the stop words by adopting a stop word list; and taking the subwords processed by the stop words of each paragraph as input, and outputting the occurrence frequency of each subword in the paragraph by adopting an N-gram model.
3. The method for securely storing data of a scientific and technological achievement sharing platform according to claim 1, wherein the sensitivity score is specifically:
acquiring a key data set of scientific and technological achievements; if each sub-word in the sentence belongs to the scientific and technological achievement key data set, taking 3 as the sensitive score of each sub-word in the sentence; if each sub-word in the sentence does not belong to the scientific and technological achievement key data set, 1 is used as the sensitivity score of each sub-word in the sentence.
4. The method for securely storing data of a scientific and technological achievement sharing platform according to claim 3, wherein the step of obtaining a scientific and technological achievement key data set is specifically:
the key data of technological achievement is composed of English letters, greek letters commonly used in basic mathematics and numerals; and taking the set of the scientific and technological achievement key data as a scientific and technological achievement key data set.
5. The method for securely storing data of a scientific and technological achievement sharing platform according to claim 1, wherein the dividing the subword window for each subword and combining the semantic vector of each subword to obtain the local vector sequence of each subword window comprises:
For each sentence, taking each sub-word as a center, and respectively taking m sub-words forwards and backwards to form each sub-word window with each sub-word; and forming the semantic vectors of all the subwords in each subword window into a local vector sequence of each subword window, wherein m is a preset value.
6. The method for securely storing data of a scientific and technological achievement sharing platform according to claim 1, wherein the semantic similarity distance between two sentences is obtained according to the similarity between local vector sequences of sub-word windows in the two sentences, and the expression is:
In the method, in the process of the invention, Represents the/>Sentences and/>Semantic similarity distance of individual sentences,/>Represents the/>Number of subword windows in each sentence,/>Represents the/>Number of subword windows in each sentence,/>To adjust the parameters,/>Represents the/>First/>, in the sentenceLocal vector sequence of subword window,/>Represents the/>First/>, in the sentenceLocal vector sequence of subword window,/>Representing a cosine similarity function.
7. The method for securely storing data of a scientific and technological achievement sharing platform according to claim 1, wherein the step of obtaining random numbers according to distribution of paragraph sensitive dense similarity of all paragraphs includes:
The mean value of the sensitive dense similarity of all paragraphs is rounded to a rounded value as a random number.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410288260.7A CN117892331B (en) | 2024-03-14 | 2024-03-14 | Data security storage method of scientific and technological achievement sharing platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410288260.7A CN117892331B (en) | 2024-03-14 | 2024-03-14 | Data security storage method of scientific and technological achievement sharing platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117892331A CN117892331A (en) | 2024-04-16 |
CN117892331B true CN117892331B (en) | 2024-05-24 |
Family
ID=90649098
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410288260.7A Active CN117892331B (en) | 2024-03-14 | 2024-03-14 | Data security storage method of scientific and technological achievement sharing platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117892331B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118468321B (en) * | 2024-07-11 | 2024-10-18 | 山东圣剑医学研究有限公司 | Basic research data encryption storage method based on block chain technology |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013101679A (en) * | 2013-01-30 | 2013-05-23 | Nippon Telegr & Teleph Corp <Ntt> | Text segmentation device, method, program, and computer-readable recording medium |
CN110096710A (en) * | 2019-05-09 | 2019-08-06 | 董云鹏 | A kind of article analysis and the method from demonstration |
CN113239148A (en) * | 2021-05-14 | 2021-08-10 | 廖伟智 | Scientific and technological resource retrieval method based on machine reading understanding |
WO2021164302A1 (en) * | 2020-09-07 | 2021-08-26 | 平安科技(深圳)有限公司 | Sentence vector generation method, apparatus, device and storage medium |
CN114265936A (en) * | 2021-12-23 | 2022-04-01 | 深圳供电局有限公司 | Method for realizing text mining of science and technology project |
CN114462424A (en) * | 2022-04-12 | 2022-05-10 | 北京思源智通科技有限责任公司 | Method, system, readable medium and device for analyzing and annotating article paragraphs |
CN114490959A (en) * | 2021-07-18 | 2022-05-13 | 北京理工大学 | Keyword-driven dynamic graph neural network multi-hop reading understanding method |
CN115659954A (en) * | 2022-10-31 | 2023-01-31 | 北京工业大学 | Composition automatic scoring method based on multi-stage learning |
CN117648724A (en) * | 2024-01-30 | 2024-03-05 | 北京点聚信息技术有限公司 | Data security transmission method for layout file |
-
2024
- 2024-03-14 CN CN202410288260.7A patent/CN117892331B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013101679A (en) * | 2013-01-30 | 2013-05-23 | Nippon Telegr & Teleph Corp <Ntt> | Text segmentation device, method, program, and computer-readable recording medium |
CN110096710A (en) * | 2019-05-09 | 2019-08-06 | 董云鹏 | A kind of article analysis and the method from demonstration |
WO2021164302A1 (en) * | 2020-09-07 | 2021-08-26 | 平安科技(深圳)有限公司 | Sentence vector generation method, apparatus, device and storage medium |
CN113239148A (en) * | 2021-05-14 | 2021-08-10 | 廖伟智 | Scientific and technological resource retrieval method based on machine reading understanding |
CN114490959A (en) * | 2021-07-18 | 2022-05-13 | 北京理工大学 | Keyword-driven dynamic graph neural network multi-hop reading understanding method |
CN114265936A (en) * | 2021-12-23 | 2022-04-01 | 深圳供电局有限公司 | Method for realizing text mining of science and technology project |
CN114462424A (en) * | 2022-04-12 | 2022-05-10 | 北京思源智通科技有限责任公司 | Method, system, readable medium and device for analyzing and annotating article paragraphs |
CN115659954A (en) * | 2022-10-31 | 2023-01-31 | 北京工业大学 | Composition automatic scoring method based on multi-stage learning |
CN117648724A (en) * | 2024-01-30 | 2024-03-05 | 北京点聚信息技术有限公司 | Data security transmission method for layout file |
Non-Patent Citations (3)
Title |
---|
Automatic textual Knowledge Extraction based on Paragraph Constitutive Relations;Zuquan Peng 等;2019 6th International Conference on Systems and Informatics (ICSAI);20191231;全文 * |
基于RNN和CNN的蒙汉神经机器翻译研究;包乌格德勒;赵小兵;;中文信息学报;20180815(第08期);全文 * |
基于编辑距离的词序敏感相似度度量方法;张雷;崔荣一;;延边大学学报(自然科学版);20200620(第02期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117892331A (en) | 2024-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN117892331B (en) | Data security storage method of scientific and technological achievement sharing platform | |
US10917235B2 (en) | High-precision privacy-preserving real-valued function evaluation | |
Kim et al. | Forward secure dynamic searchable symmetric encryption with efficient updates | |
US10810313B2 (en) | System and method for preserving privacy of data in the cloud | |
Xiang et al. | A linguistic steganography based on word indexing compression and candidate selection | |
Mohan et al. | An efficient technique for cloud storage using secured de-duplication algorithm | |
CN113609284B (en) | Automatic text abstract generation method and device integrating multiple semantics | |
WO2021212221A1 (en) | Method and system for confidential string-matching and deep packet inspection | |
Li et al. | Topic-aware neural linguistic steganography based on knowledge graphs | |
He et al. | A retrieval algorithm of encrypted speech based on syllable-level perceptual hashing | |
CN115296862B (en) | Network data safety transmission method based on data coding | |
He et al. | Stretching cube attacks: improved methods to recover massive superpolies | |
Wu et al. | Research on coverless text steganography based on single bit rules | |
CN115442024A (en) | Chaos-based MapReduce data compression information protection method | |
Arısoy | LZW-CIE: a high-capacity linguistic steganography based on LZW char index encoding | |
Forré | Methods and instruments for designing S-boxes | |
Giné et al. | Kernel density estimators: convergence in distribution for weighted sup-norms | |
Wu et al. | Coverless text steganography based on maximum variable bit embedding rules | |
Zhang et al. | An efficient retrieval approach for encrypted speech based on biological hashing and spectral subtraction | |
Bourdon | Size and path length of Patricia tries: dynamical sources context | |
Voloshynovskiy et al. | Conception and limits of robust perceptual hashing: towards side information assisted hash functions | |
WO2022170092A1 (en) | Method and apparatus for comparing and ranking long documents | |
Qiao | Integer least squares: Sphere decoding and the LLL algorithm | |
CN114610843A (en) | Multi-keyword fuzzy ciphertext retrieval method and system | |
Wu et al. | Generative Text Steganography with Large Language Model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |