CN113315623B - Symmetric encryption method for DNA storage - Google Patents

Symmetric encryption method for DNA storage Download PDF

Info

Publication number
CN113315623B
CN113315623B CN202110557922.2A CN202110557922A CN113315623B CN 113315623 B CN113315623 B CN 113315623B CN 202110557922 A CN202110557922 A CN 202110557922A CN 113315623 B CN113315623 B CN 113315623B
Authority
CN
China
Prior art keywords
sequence
sequences
dna
binary
dna storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110557922.2A
Other languages
Chinese (zh)
Other versions
CN113315623A (en
Inventor
刘文斌
昝乡镇
姚祥宇
李树栋
许�鹏
方刚
陈智华
石晓龙
鲍振申
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202110557922.2A priority Critical patent/CN113315623B/en
Publication of CN113315623A publication Critical patent/CN113315623A/en
Application granted granted Critical
Publication of CN113315623B publication Critical patent/CN113315623B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/06Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols the encryption apparatus using shift registers or memories for block-wise or stream coding, e.g. DES systems or RC4; Hash functions; Pseudorandom sequence generators
    • H04L9/0618Block ciphers, i.e. encrypting groups of characters of a plain text message using fixed encryption transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0816Key establishment, i.e. cryptographic processes or cryptographic protocols whereby a shared secret becomes available to two or more parties, for subsequent use
    • H04L9/0819Key transport or distribution, i.e. key establishment techniques where one party creates or otherwise obtains a secret value, and securely transfers it to the other(s)

Abstract

The invention provides a symmetric encryption method for DNA storage, which comprises the following steps: acquiring a file to be encrypted, and encrypting binary information of the file to be encrypted according to an encryption key to obtain a DNA storage sequence; carrying out confusion operation processing on the DNA storage sequence; synthesizing a plurality of DNA storage sequences subjected to confusion operation processing to obtain DNA molecular sequences, storing the DNA molecular sequences, obtaining read lengths of the DNA molecular sequences through sequencing, and decrypting the read lengths and the encryption key to obtain binary information; the method increases the sequencing depth under different error rates, the data recovery rate is in an ascending trend, the method can also effectively reduce information redundancy in the encryption and decryption processes, the robustness is higher, the malicious cracking difficulty is high, the confidentiality effect is better, and the method can be widely applied to the technical field of system biology research.

Description

Symmetric encryption method for DNA storage
Technical Field
The invention relates to the technical field of system biology research, in particular to a symmetric encryption method for DNA storage.
Background
With the development of cloud computing technology and big data technology, the demand of human for storing data shows an exponential growth trend. The total amount of data generated by man in 2025 would reach 175ZB as predicted by international data corporation. The storage requirement of mass data poses a serious challenge to the present storage technology (optical disc, hard disc and the like based on the electromagnetic principle) in terms of maintenance cost, service life and data reliability. Meanwhile, DNA carrying genetic information has attracted attention because of its high density, low energy consumption, long storage life and other features. In recent years, DNA storage has become a hot issue for cross-discipline research.
Despite the numerous advantages of DNA storage over traditional storage techniques, the limitation of DNA storage itself by biochemical properties also faces problems: 1) Uneven amplification of DNA molecule sequences can result in the loss of certain DNA sequences; 2) Base insertions, deletions and substitution errors occur within the DNA sequence at the time of synthesis and sequencing. In order to overcome the above problems encountered in the DNA storage process, many error correction algorithms have been proposed to solve the problem of DNA storage reliability.
Disclosure of Invention
In view of the above, to at least partially solve one of the above technical problems, embodiments of the present invention provide a symmetric encryption method for DNA storage, which can implement DNA storage of confidential data files.
The technical scheme of the application provides a symmetric encryption method for DNA storage, which comprises the following steps:
acquiring a file to be encrypted, and encrypting binary information of the file to be encrypted according to an encryption key to obtain a DNA storage sequence;
performing confusion operation processing on the DNA storage sequence;
synthesizing the DNA storage sequences after the confusion operation processing to obtain DNA molecule sequences, storing the DNA molecule sequences, obtaining read lengths by sequencing the DNA molecule sequences, and decrypting the read lengths and the encryption key to obtain the binary information.
In a possible embodiment of the present disclosure, the step of obtaining a file to be encrypted, and encrypting binary information of the file to be encrypted according to an encryption key to obtain a DNA storage sequence includes:
grouping the binary information to obtain a plurality of binary groups, generating the index value part according to the binary groups, and forming a binary group to be modulated according to the index part and the binary groups;
and generating a two-dimensional binary sequence according to the key sequence of the encryption key and the binary group to be modulated, and performing DNA base substitution on the two-dimensional binary sequence to obtain a DNA storage sequence.
In a possible embodiment of the present disclosure, the step of performing a garbled operation on the DNA storage sequence comprises:
determining a sequence copy number threshold and an interference error rate threshold, constructing a first set, and storing the DNA storage sequences into the first set;
extracting a first sequence from the first set, and copying the first sequence to obtain a plurality of repeated sequences not less than the sequence copy number threshold;
and modifying the sequence paragraphs of the repeated sequences according to the interference error rate threshold, and adding and integrating the modified repeated sequences to obtain the DNA storage sequence subjected to the confusion operation processing.
In a possible embodiment of the solution of the present application, the method further comprises the steps of:
sequencing the DNA molecules to obtain read lengths, and correcting sequence paragraph modification contents in the read lengths according to the encryption key;
grouping the corrected read lengths, and obtaining a consistency sequence from the grouping of the read lengths;
and sequencing the consistency sequence according to the index value part, removing the index value part of the sequenced consistency sequence, and performing binary conversion to obtain the binary information.
In a possible embodiment of the present disclosure, the step of sequencing the DNA molecule to obtain a read length, and correcting the sequence paragraph modification content in the read length according to the encryption key includes:
carrying out base-by-base replacement according to the read length to obtain modulation binary information, and integrating the modulation binary information to obtain an observation modulation code sequence;
and comparing the observed modulation code sequence with the key sequence of the encryption key through a global comparison algorithm, and correcting the sequence paragraph modification content in the read length according to the observed modulation code.
In a possible embodiment of the present disclosure, the step of grouping the corrected read lengths and obtaining a consistency sequence from the read length grouping includes:
generating correction information of the read length, and dividing the read length into a non-correction sequence and a correction sequence according to the correction information;
grouping the correction-free sequences according to the index value part to obtain a second set;
filtering out sequences in the second set according to index value parts and/or expanding the second set according to the corrected sequences;
extracting a high-frequency storage sequence from the second set after the screening and/or the expansion as the consistency sequence.
In a possible embodiment of the solution of the present application, the step of sifting out the sequences in the second set according to the index value part includes:
determining a grouping purification threshold value, determining that the Hamming distance mean value of the second sequence and other sequences in the second set is larger than the grouping purification threshold value, and deleting the second sequence from the second set;
the step of expanding the second set according to the corrected sequence, comprising:
sorting the corrected sequences according to the correction information and the Hamming distance of the index value part to obtain a first list;
determining that a Hamming distance mean of a third sequence in the first list to other sequences in the first list is less than the packet purification threshold, adding the third sequence to the second set.
Advantages and benefits of the present invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:
the technical scheme of the application is based on the idea of information modulation to carry out DNA storage encryption, firstly, a computer digital file is converted into an encrypted DNA storage ciphertext by using a secret key through a symmetrical encryption principle, and then, the functions of DNA storage encryption are realized by further fully utilizing sequence confusion in physics and sequence confusion caused by a biochemical process; the method increases the sequencing depth under different error rates, the data recovery rate is in an ascending trend, the method can also effectively reduce information redundancy in the encryption and decryption processes, the robustness is higher, the malicious cracking difficulty is high, and the confidentiality effect is better.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of a symmetric encryption method for DNA storage according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a structure of an encoding memory line according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a flow chart of an algorithm for decoding DNA data according to an embodiment of the present invention;
FIG. 4 is a line graph of data recurrence rates for different sequencing depths with high error rates according to an embodiment of the present invention;
fig. 5 is a line graph of the accuracy of key inference for different sequencing depths at different error rates, provided by an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
The technical scheme of the application provides a DNA storage encryption method in a symmetrical mode on the basis of an information modulation idea. The method firstly converts a computer digital file into an encrypted DNA storage ciphertext by using a key, and then further fully utilizes physical sequence confusion and sequence confusion (which means insertion, deletion and replacement errors of bases in a sequence caused by a DNA storage process) caused by a biochemical process to realize the function of DNA storage encryption. In addition, the key strength of the method is related to the length of the stored sequence, namely 2 M (wherein M is the DNA storage sequence length). For example, if the length of the DNA storage sequence is 128, the key strength will reach 2 128 To the power.
In a first aspect, as shown in fig. 1, to solve the technical problems pointed out in the foregoing background art, the present application provides a DNA storage encryption method based on an information modulation idea, which has the characteristics of no redundancy and high robustness, and includes steps S100-S300:
s100, acquiring a file to be encrypted, and encrypting binary information of the file to be encrypted according to an encryption key to obtain a DNA storage sequence;
specifically, the sending end in the embodiment selects a secret key, encrypts binary information of a file to be encrypted, namely a computer file, and obtains a series of DNA storage sequences, namely ciphertext.
S200, performing confusion operation processing on the DNA storage sequence;
the term "shuffling manipulation" as used herein refers to manipulation such as insertion, deletion, or substitution error of bases in a sequence during storage of DNA.
S300, synthesizing a plurality of DNA storage sequences subjected to confusion operation processing to obtain DNA molecule sequences, and storing the DNA molecule sequences;
the DNA molecule sequence is sequenced to obtain a read length, and the read length and an encryption key are decrypted to obtain binary information; in particular, embodiments in which the synthesized DNA storage sequences are DNA molecule sequences that can be stored in vitro or in vivo are directed to synthesizing encoding DNA storage sequences in vitro to form DNA molecules or can be stored using currently available in vivo or in vitro storage methods. In the embodiment, the receiving end acquires the ciphertext, namely the DNA molecule sequence, from the in-vitro or in-vivo storage space, sequences the encrypted and stored DNA molecule sequence, decrypts the read length (reads) obtained by sequencing by using the key distributed by the transmitting end, and then recovers the plaintext of the digital file such as the binary information and the like encrypted and stored in the DNA molecule.
In some alternative embodiments, the DNA storage sequence in embodiments comprises an index value portion and a data field portion, depending on the characteristics of in vitro storage and in vivo storage;
in particular, embodiments store data units for DNA of length M bases (M > 0), which can be used to store binary information of length M bits. An index value part (index) which identifies the sequence of the storage unit and occupies N bases (0. Ltoreq. N.ltoreq.M) in the DNA storage unit. Thus, one memory sequence may be composed of an index value part and a data field part. As shown in fig. 2, a memory sequence structure is shown, where M is 64 and n is 8.
Based on the feature that the DNA storage sequence includes the index value portion and the data field portion, a step S100 of obtaining a file to be encrypted, and encrypting binary information of the file to be encrypted according to an encryption key to obtain the DNA storage sequence may include steps S110 and S120:
s110, grouping the binary information to obtain a plurality of binary groups, generating an index value part according to the binary groups, and forming the binary groups to be modulated according to the index part and the binary groups;
specifically, the embodiment groups binary data of a file to be encrypted in units of (M-N) bits, and actually groups the last packet with a length less than (M-N). The embodiment then adds N bits of Index information to each resulting binary packet header to form a series of binary stored sequences.
S120, generating a two-dimensional binary sequence according to the key sequence of the encryption key and the binary group to be modulated in the step S110, and performing DNA base substitution on the two-dimensional binary sequence to obtain a DNA storage sequence; the two-dimensional binary sequence comprises a first row of encryption keys and a second row of binary groups to be modulated;
specifically, as shown in table 1, the embodiment generates a binary key of length M, and replaces the key with each binary stored sequence obtained in step S110 bit by bit according to table 1 by the corresponding DNA base, thereby obtaining a DNA stored sequence.
Modulation code bit 0 0 1 1
Information bit 0 1 0 1
DNA bases A T C G
Further, in step S120, the process of generating the key of M length in the embodiment follows the following two principles:
principle 1: the number of 0 and 1 occurrences in the key sequence of the embodiment is approximately equal, e.g., the ratio of 1 occurrences is between 40% and 60;
principle 2: embodiments randomly generate a binary sequence of length M, and satisfy that the number of consecutive 0's or consecutive 1's occurrences is less than a given threshold γ, and γ takes the value of a non-negative integer less than 4.
In some alternative embodiments, the step S200 of performing the obfuscation operation on the DNA storage sequence may further include the steps S210 to S230:
s210, determining a sequence copy number threshold and an interference error rate threshold, constructing a first set, and storing DNA storage sequences into the first set;
wherein the first set is a set X of DNA storage sequences; specifically, embodiments first determine a sequence copy number threshold
Figure BDA0003077854980000051
(
Figure BDA0003077854980000052
Non-negative integer) and an interference error rate threshold E, for example, in the embodiment, the interference error rate threshold has a value range of 0.2 ≦ E ≦ 0.35. Then the embodiment sets an empty set Y and stores the DNA memory sequences in set X.
S220, extracting a first sequence from the first set, and copying the first sequence to obtain a plurality of repeated sequences not less than a sequence copy number threshold;
specifically, the embodiment takes a sequence X from the X set, deletes X from the set X, and copies the X sequence
Figure BDA0003077854980000061
Then, form
Figure BDA0003077854980000062
And x repeated sequences.
S230, modifying the sequence paragraphs of the repeated sequences according to an interference error rate threshold, and adding and integrating the modified repeated sequences to obtain a DNA storage sequence subjected to confusion operation processing;
in particular, the embodiment pairs
Figure BDA0003077854980000063
Each of the x repeated sequences selects a fault location with a probability E and generates a replacement error with an equal probability for each fault location. Furthermore, sequence addition errors must also meet the criteria: the error locations of the different sequences cannot overlap. After the completion, will
Figure BDA0003077854980000064
The x-sequences with errors are added to the set Y. And repeating the steps S220 and S230 until the X set is empty and outputting a Y set, wherein the Y set is the finally synthesized DNA storage sequence.
In some optional embodiments, as shown in fig. 3, the embodiment method may further include decryption steps S400-S600:
s400, sequencing the DNA molecules to obtain read lengths, and correcting sequence paragraph modification contents in the read lengths according to the encryption key;
specifically, the embodiment performs sequencing on the DNA molecules corresponding to the digital information file to be restored, performs error detection on read lengths (reads) obtained by sequencing according to the key, and primarily corrects the insertion/deletion errors occurring in the reads.
S500, grouping the corrected read lengths, and obtaining a consistency sequence from the read length grouping;
specifically, the embodiment groups the resulting reads and generates a unique consensus sequence for each packet.
S600, sequencing the consistency sequence according to the index value part, removing the index value part of the sequenced consistency sequence, and performing binary conversion to obtain binary information;
specifically, the embodiment sorts the consistency sequences in ascending order according to the index values, and removes the index parts of each consistency sequence; and sequentially converting the consistency sequence into a binary sequence, and splicing the binary sequence to form a computer binary file. Illustratively, in an embodiment, the principle of converting the consensus sequence into a binary sequence comprises
Converting consistent sequences into corresponding binary bits base by base to form binary sequences;
principle two, the conversion of DNA bases into binary system rules is: a- >0, T- >1, C- >0 and G- >1.
In some alternative embodiments, sequencing the DNA molecule to obtain a read length, and correcting the sequence paragraph modification content in the read length according to the encryption key S400 may include steps S410 and S420:
s410, replacing base by base according to the read length to obtain modulation binary information, and integrating the modulation binary information to obtain an observation modulation code sequence;
specifically, embodiments replace the read length (reads) base by base with the corresponding modulated binary information, thereby forming a modulated binary sequence, also known as an observed modulated code sequence.
S420, comparing the observed modulation code sequence with a key sequence of an encryption key through a global comparison algorithm, and correcting the modified content of the sequence paragraph in the read length according to the observed modulation code;
specifically, embodiments align an observed modulation code sequence with a key sequence using a global alignment algorithm, such as the Needle-Wunsch algorithm, and then calibrate corresponding read lengths (reads) based on the observed modulation code to form a corrected sequence (read length).
It should be noted that, in the embodiment, the corresponding read length is calibrated according to the observed modulation code, and the calibration is performed according to the following principle:
principle 1: in the embodiment, the base corresponding to the observed modulation code is a "-" character, and if the character corresponding to the key is a "0" character, an "A" base or a "T" base is inserted into the position corresponding to the read length (reads) with a probability of 50%; otherwise, the read length corresponds to a 50% probability of inserting a "C" base or a "G" base.
Principle 2: in the embodiment, the base corresponding to the observation modulation code is not equal to the base corresponding to the key, and if the base character at the reading length corresponding position is 'A' or 'T', the base character is converted into 'C' base or 'G' base (the probability of converting into 'C' base is 50%); if the base character at the corresponding position of the read length is "C" or "G", it is converted to an "A" base or a "T" base (the probability of conversion to an "A" base is 50%).
Principle 3: in the embodiment, if the character corresponding to the original modulation code is a "-" character and the character corresponding to the original modulation code is not equal to the character corresponding to the observed modulation code, the base character at the position corresponding to the read length is deleted.
In some alternative embodiments, the step S500 of grouping the corrected read lengths and obtaining the consistency sequence from the read length grouping may include the steps S510 to S530:
s510, generating correction information of the read length, and dividing the read length into a non-correction sequence and a correction sequence according to the correction information;
specifically, in the embodiment, in the process of correcting the read lengths (reads) obtained by sequencing according to the key, information whether each base in the read lengths (reads) is corrected or not is retained, which is also called read length correction information.
S520, grouping the correction-free sequences according to the index value part to obtain a second set;
specifically, the example extracts from the corrected sequences that all the correction information of each base of index at the corresponding read length is uncorrected, and forms the set T, i.e., the second set, whereas the correction information of each base of index at the corresponding read length includes any of the corrected values, and puts the sequences as the remaining sequences in the set U.
S530, screening out the sequences in the second set according to the index value part, and/or expanding the second set according to the corrected sequences; the high frequency memory sequences are extracted from the second set after the screening and/or the expansion as the identity sequences.
Specifically, the embodiment groups the sequences of the T set according to the Index values, groups the sequences with the same Index value, and records the grouped result as a set C, where each element in the set is a group and the group includes the corrected sequences with the same Index value. Then, the embodiment purifies each group in the set C according to the base content of the sequence, removes the sequence members which do not belong to the same group in the group from the group and adds the sequence members to the set U, and simultaneously, the embodiment expands each group in the set C by selecting the corresponding sequence from the set U until the expanded group reaches the condition of a mature group or finishes traversing each sequence in the set U. Finally, the embodiment generates the high-frequency storage sequence of the group as the consistency sequence according to the principle of majority voting base by base for each group in the C in the set and puts the consistency sequence into the set H.
It should be noted that the mature grouping in step S530 refers to any specified position, the number of sequences of which the value under the corresponding correction information index is uncorrected is greater than a specified threshold θ, θ is a non-negative integer, and the sampling sequencing depth is taken in the embodiment.
In some possible embodiments, the process of screening out the sequences in the second set according to the index value portion may include steps S530a to S530d:
step S530a: embodiments determine the packet purification threshold τ, e.g., a range of values (0 < τ < M), with G being an empty set.
Step S530b: embodiments mark any one sequence in the packet as a second sequence, calculate the mean of hamming distances of the stored sequence from other sequences in the packet, and add the sequence to the set G if the mean is less than or equal to τ. The hamming distance between two sequences refers to the number of characters in the corresponding positions of the two sequences that are different.
Step S530c: sequences that do not belong to G but to the packet are added to the set U.
Step S530d: the original component is assigned to set G.
In some possible embodiments, expanding the second set according to the corrected sequence may include steps S530 e-S530 h:
step S530e: first, the embodiment sets an empty list L.
Step S530f: the embodiment traverses each sequence s of the set U, counts the corresponding index in the sequence index field at the corresponding value of the corresponding length-read correction information as the number of no-correction s1, counts the hamming distance s2 between the sequence index field and the packet index field, and then adds the sequence represented as (s, s1, s 2) to the list L.
Step S530g: example arrange elements in L, the arrangement rule: and (4) sorting the second attribute of the elements in the L in a descending order, and sorting the third attribute in an ascending order when the second attribute value is the same.
Step S530h: the embodiment sequentially extracts the sequences contained in the elements in the L, calculates the average value d of the Hamming distance between the sequence and all the sequences in the group, adds the sequence to the group for expansion if d is less than or equal to tau, and stops extracting the elements in the L until the group reaches the condition of a mature group.
The embodiment of the application carries out DNA storage encryption based on the idea of information modulation, a sending end and a receiving end use the same secret key, the intensity of the secret key is related to the length of a storage sequence, namely 2 M . <xnotran> , M =104, "10011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001", , , , , . </xnotran>
Wherein, the experimental data result is as follows:
1) Under the condition of 10% error rate and 80X sequencing depth, the data can be completely reproduced.
2) Under the condition of 20% error rate and 160X sequencing depth, the data can be completely reproduced.
3) When the error rate is 30 percent and the sequencing depth is 200X, the data reproduction rate reaches 95.49 percent. And continuously increasing the sequencing depth to realize complete reproduction of data.
4) When the error rate is 40% and the sequencing depth is 200X, the data reproduction rate is 16.38%. By increasing the sequencing depth, the data can be reproduced.
In addition, as shown in fig. 4 and 5, in another embodiment, after the DNA sequence ciphertexts are correctly grouped, the multi-sequence comparison software MAFFT disclosed by the european molecular laboratory EMBL-EBI is used to guess the key, and the experiment is repeated 1000 times, and the key-breaking experimental data shows that: with error rates of 0.2 and 0.3, increasing the sequencing depth (ranging from 100 to 1000), the key inference accuracy consistently fluctuated in the range of 50% -56%. It should be noted that, in a real-world scenario, a user cannot correctly group DNA sequence ciphertexts without a key, and from this point of view, the difficulty of a hacker in cracking the key is further increased.
From the above specific implementation process, it can be concluded that the technical solution provided by the present invention has the following advantages or advantages compared to the prior art:
the technical scheme of the application is based on the idea of information modulation to carry out DNA storage encryption, firstly, a computer digital file is converted into an encrypted DNA storage ciphertext by using a secret key through a symmetrical encryption principle, and then, the functions of DNA storage encryption are realized by further fully utilizing sequence confusion in physics and sequence confusion caused by a biochemical process; the method increases the sequencing depth under different error rates, the data recovery rate is in an ascending trend, the method can also effectively reduce information redundancy in the encryption and decryption processes, the robustness is higher, the malicious cracking difficulty is high, and the confidentiality effect is better.
In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of larger operations are performed independently.
Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise specified to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer given the nature, function, and interrelationships of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
In the description of the specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A symmetric encryption method for DNA storage, comprising the steps of:
acquiring a file to be encrypted, and encrypting binary information of the file to be encrypted according to an encryption key to obtain a DNA storage sequence;
performing confusion operation processing on the DNA storage sequence;
synthesizing a plurality of DNA storage sequences subjected to the confusion operation processing to obtain DNA molecule sequences, storing the DNA molecule sequences, obtaining read lengths through sequencing the DNA molecule sequences, and decrypting the read lengths and the encryption key to obtain the binary information;
the step of obtaining a file to be encrypted, encrypting binary information of the file to be encrypted according to an encryption key, and obtaining a DNA storage sequence comprises the following steps:
grouping the binary information to obtain a plurality of binary groups, generating the index value part according to the binary groups, and forming a binary group to be modulated according to the index value part and the binary groups;
and generating a two-dimensional binary sequence according to the key sequence of the encryption key and the binary group to be modulated, and performing DNA base substitution on the two-dimensional binary sequence to obtain a DNA storage sequence.
2. The method of claim 1, wherein the step of performing the obfuscation operation on the DNA storage sequence comprises:
determining a sequence copy number threshold and an interference error rate threshold, constructing a first set, and storing the DNA storage sequences into the first set;
extracting a first sequence from the first set, and copying the first sequence to obtain a plurality of repeated sequences not less than the sequence copy number threshold;
and modifying the sequence paragraphs of the repeated sequences according to the interference error rate threshold, and adding and integrating the modified repeated sequences to obtain the DNA storage sequence subjected to the confusion operation processing.
3. The symmetric encryption method for DNA storage according to claim 2, further comprising the steps of:
sequencing the DNA molecules to obtain the read length, and correcting the sequence paragraph modification content in the read length according to the encryption key;
grouping the corrected read lengths, and obtaining a consistency sequence from the read length grouping;
and sequencing the consistency sequence according to the index value part, removing the index value part of the sequenced consistency sequence, and performing binary conversion to obtain the binary information.
4. The symmetric encryption method for DNA storage according to claim 3, wherein the step of sequencing the DNA molecule to obtain the read length, and correcting the sequence paragraph modification content in the read length according to the encryption key comprises:
carrying out base-by-base replacement according to the read length to obtain modulation binary information, and integrating the modulation binary information to obtain an observation modulation code sequence;
and comparing the observation modulation code sequence with the key sequence of the encryption key through a global comparison algorithm, and correcting the sequence paragraph modified content in the read length according to the observation modulation code.
5. The method of claim 3, wherein the step of grouping the corrected read lengths to obtain a consensus sequence from the read length groups comprises:
generating correction information of the read length, and dividing the read length into a non-correction sequence and a correction sequence according to the correction information;
grouping the correction-free sequences according to the index value part to obtain a second set;
filtering out sequences in the second set according to an index value part and/or expanding the second set according to the corrected sequences;
extracting a high frequency storage sequence from the second set after the screening and/or the expansion as the consistency sequence.
6. The symmetric encryption method for DNA storage according to claim 5, wherein the step of screening out the sequences in the second set according to the index value portion comprises:
determining a packet purification threshold, determining that the mean hamming distances of the second sequence in the second set from other sequences in the second set are greater than the packet purification threshold, and deleting the second sequence from the second set.
7. The method of claim 5, wherein the step of expanding the second set according to the corrected sequences comprises:
sorting the corrected sequences according to the correction information and the Hamming distance of the index value part to obtain a first list;
determining that a Hamming distance mean of a third sequence in the first list and other sequences in the first list is less than a packet purification threshold, adding the third sequence to the second set.
CN202110557922.2A 2021-05-21 2021-05-21 Symmetric encryption method for DNA storage Active CN113315623B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110557922.2A CN113315623B (en) 2021-05-21 2021-05-21 Symmetric encryption method for DNA storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110557922.2A CN113315623B (en) 2021-05-21 2021-05-21 Symmetric encryption method for DNA storage

Publications (2)

Publication Number Publication Date
CN113315623A CN113315623A (en) 2021-08-27
CN113315623B true CN113315623B (en) 2023-01-24

Family

ID=77374047

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110557922.2A Active CN113315623B (en) 2021-05-21 2021-05-21 Symmetric encryption method for DNA storage

Country Status (1)

Country Link
CN (1) CN113315623B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106817218A (en) * 2015-12-01 2017-06-09 国基电子(上海)有限公司 Encryption method based on DNA technique
CN111737956A (en) * 2020-06-24 2020-10-02 任兆瑞 DNA data storage coding method
CN111988144A (en) * 2020-08-18 2020-11-24 大连大学 DNA one-time pad image encryption method based on multiple keys

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018089567A1 (en) * 2016-11-10 2018-05-17 Life Technologies Corporation Methods, systems and computer readable media to correct base calls in repeat regions of nucleic acid sequence reads
RU2659025C1 (en) * 2017-06-14 2018-06-26 Общество с ограниченной ответственностью "ЛЭНДИГРАД" Methods of encoding and decoding information
CN110706751A (en) * 2019-09-25 2020-01-17 东南大学 DNA storage encryption coding method
CN112582030B (en) * 2020-12-18 2023-08-15 广州大学 Text storage method based on DNA storage medium
CN112802549B (en) * 2021-01-26 2022-05-13 武汉大学 Coding and decoding method for DNA sequence integrity check and error correction

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106817218A (en) * 2015-12-01 2017-06-09 国基电子(上海)有限公司 Encryption method based on DNA technique
CN111737956A (en) * 2020-06-24 2020-10-02 任兆瑞 DNA data storage coding method
CN111988144A (en) * 2020-08-18 2020-11-24 大连大学 DNA one-time pad image encryption method based on multiple keys

Also Published As

Publication number Publication date
CN113315623A (en) 2021-08-27

Similar Documents

Publication Publication Date Title
Safavi-Naini et al. Sequential traitor tracing
Silverberg et al. Efficient traitor tracing algorithms using list decoding
Goldie et al. Communication theory
US20060204008A1 (en) Decryption apparatus and decryption method
CN104881838B (en) One kind is based on GF (23) (K, N) significant point deposited without expansion image and reconstructing method
Dumas et al. Foundations of coding: compression, encryption, error correction
Welzel et al. DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage
JP5395051B2 (en) A low complexity encryption method for content encoded by rateless codes
KR101485460B1 (en) Method of tracing device keys for broadcast encryption
Park et al. Iterative coding scheme satisfying gc balance and run-length constraints for dna storage with robustness to error propagation
Škorić A trivial debiasing scheme for helper data systems
Yao et al. A novel image encryption scheme for DNA storage systems based on DNA hybridization and gene mutation
CN113315623B (en) Symmetric encryption method for DNA storage
Dagher et al. Data storage in cellular DNA: contextualizing diverse encoding schemes
CN115379066B (en) Encryption image reversible data encryption and decryption method based on self-adaptive compression coding
Moussa et al. A Data Hiding Algorithm Based on DNA and Elliptic Curve Cryptosystems.
Zhou et al. On the security of multiple Huffman table based encryption
Kar et al. Digital signatures to ensure the authenticity and integrity of synthetic DNA molecules
Ibrahim et al. Enhancing the security of data hiding using double DNA sequences
Balado On the Shannon capacity of DNA data embedding
CN111371751B (en) File stream byte group data encryption and network transmission method
Bazli et al. Data encryption using bio-molecular information
Hafeez et al. DNA-LCEB: a high-capacity and mutation-resistant DNA data-hiding approach by employing encryption, error correcting codes, and hybrid twofold and fourfold codon-based strategy for synonymous substitution in amino acids
Berezin et al. Cryptographic approaches to authenticating synthetic DNA sequences
Saeb et al. On covert data communication channels employing DNA recombinant and mutagenesis-based steganographic techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant