GB2619782A

GB2619782A - DNA storage coding optimization method based on double-strategy back spider algorithm

Info

Publication number: GB2619782A
Application number: GB2211537.2A
Authority: GB
Inventors: Zhang Qiang; Wang Bin; Wang Pengfei; Wu Jieqiong; Wei Xiaopeng
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2021-09-18
Filing date: 2022-05-27
Publication date: 2023-12-20
Also published as: GB202211537D0

Abstract

The present invention relates to a DNA storage coding optimization method based on a double-strategy black spider algorithm. The method comprises: constructing a DNA coding sequence set meeting a current constraint combination, performing fitness evaluation on sequences in the DNA coding sequence set, and sorting according to a result; introducing the double-strategy black spider algorithm to optimize the set, and obtaining optimized DNA coding sequences having high fitness; screening the optimized DNA coding sequences by means of combination constraint, and reserving sequences meeting the combination constraint; and merging the reserved sequence set into the DNA coding sequence set, and outputting an optimal coding sequence set meeting the combination constraint. The optimized double-strategy black spider algorithm is applied to the DNA coding sequence set, and the purpose of optimizing the sequence is achieved. The optimized sequence set has better performance in a fitness function; the optimized sequence set is screened by means of end constraint to construct a DNA coding sequence set having stable physical and thermodynamic characteristics.

Description

METHOD FOR OPTIMIZING CODING FOR DEOXYRIBONUCLEIC ACID (DNA) STORAGE BASED ON DUAL-POLICY BLACK WIDOW OPTIMIZATION (BWO)

ALGORITHM

CROSS REFERENCE TO RELATED APPLICATION

100011 This patent application claims the benefit and priority of Chinese Patent Application No. 202111101673.2, A method for optimizing coding for deoxyribonucleic acid (DNA) storage based on a dual-policy black widow optimization (BWO) algorithm, filed on September 18, 2021, the disclosure of which is incorporated by reference herein in its entirety as part of the present application.

TECHNICAL FIELD

100021 The present disclosure relates to the field of coding design in DNA storage, and specifically, to a method for optimizing coding for DNA storage based on a BWO algorithm.

BACKGROUND ART

100031 In the context of data explosion, DNA is considered an ideal carrier for information storage due to its advantages of high storage density, abundant resources, easy access, long storage time, and low energy consumption. In recent years, DNA storage technology has been continuously developed and applied, and various coding methods of DNA coding sets have also emerged.

100041 Currently, a widely used method is to combine the intelligence algorithm and population-based algorithm with a DNA coding method. However, update manners and logic of some populations are simple, which is prone to problems of insufficient population diversity and local optima. Due to the special biological properties of DNA sequences, sequences with low similarity can avoid the occurrence of non-specific hybridization as far as possible. Meanwhile, sequence sets with relatively stable physical and thermodynamic properties are less error-prone during storage. A high-quality DNA coding sequence set can reduce error rates during storage. Therefore, in addition to improving algorithm efficiency and quality and expanding storage sequence sets, how to improve stability of the physical and thermodynamic properties of sequences is also an urgent issue to be addressed.

SUMMARY

100051 The present disclosure aims to perform search and optimization on a DNA coding sequence set through a meta-heuristic algorithm, to finally obtain a DNA coding sequence set with more sequences and more stable physical and thermodynamic properties, thereby improving quality of the DNA coding sequence set and ensuring stability of DNA storage.

[0006] To achieve the above objectives, the present disclosure adopts the following technical solutions: [0007] Firstly, a DNA coding sequence set that satisfies a current combined constraint is constructed, fitness of sequences in the DNA coding sequence set is evaluated, and the sequences are sorted by fitness; secondly, the DNA coding sequence set is optimized by using a dual-policy BWO algorithm, to obtain optimized DNA coding sequences with high fitness; thirdly, the optimized DNA coding sequences are screened again through a combined constraint, and sequences that satisfy the combined constraint are retained; and finally, the retained sequences are added into the DNA coding sequence set, and an optimal coding sequence set that satisfies the combined constraint is output.

100081 The foregoing technical solution can achieve the following technical effects: [0009] 1. A random swap policy and a weight-based selection policy are introduced to improve the development and exploration capabilities of the algorithm. The random swap policy can improve the diversity of sequences, and the weight-based selection policy selects different update methods in the process of generating next-generation sequences: in the first half part, finding an optimal solution around the sequences to improve the exploration capability of the algorithm; and in the second half part, finding an alternative solution far away from the current optimal solution to improve the development capability of the algorithm and prevent the sequence selection from falling into local optimum.

[0010] 2. The optimized dual-policy BWO algorithm is applied to a DNA coding sequence set, to optimize the sequences. The optimized sequence set performs better in a fitness function. 100111 3. The optimized sequence set is screened through a terminal constraint to construct a DNA coding sequence set with stable physical and thermodynamic properties.

[0012] 4. The storage sequence set with stable physical and thermodynamic properties is applied to DNA storage to improve the stability and reliability of storage.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] FIG is a flowchart of a method for optimizing coding for DNA storage based on a dual-policy BWO algorithm

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0014] The technical solution in an embodiment of the present disclosure is now described clearly and completely with reference to the accompanying drawings for examples of the present disclosure. It will be understood that the described examples are merely a part of, rather than all, examples of the present disclosure. All other examples derived from the examples of the present disclosure by those skilled in the art without creative work shall fall within the protection scope of the present disclosure.

[0015] In the present disclosure, the algorithm is a dual-policy BWO algorithm, the constraints are Hamming distance, non-consecutiveness constraint, guanine-cytosine (GC) content, and terminal constraint, and the indicators used for measuring sequence stability are hairpin structure, melting temperature, and minimum free energy.

100161 The two policies in the dual-policy BWO algorithm in the present disclosure are a random swap policy and a weight-based selection policy, which are used to improve the exploration and development capabilities of the algorithm. The random swap policy is used to enhance the diversity of sequences, and the weight-based selection policy is used to prevent results from falling into local optimum and enhance the optimization. The entire sequence update is divided into four stages: initialization, offspring generation, cannibalism, and mutation.

[0017] In the present disclosure, the Hamming distance constraint means that there must be at least n different bases in any two sequences in a sequence set to maintain stability, and the GC content is a percentage of the number of guanines (G) and cytosines (C) in a sequence to the total number of bases in the entire sequence. Because different bases connect different hydrogen bonds, a relatively stable state can usually be achieved when the GC content of the storage sequence remains at 40% to 60%. In order to maintain stable biological properties of the sequence, the GC content in this embodiment is 50%. The non-consecutiveness constraint means that there cannot be more than two consecutive identical bases in a storage sequence. The terminal constraint means that there cannot be three or more than three G or C bases in the last five sequences at the terminal of the storage sequence, and its expression is as follows: [0018] GSLStS I ± SLast51 < 3 0) [0019] where IGsLast51 and ICsLas,51 respectively represent the number of guanine (G) bases and the number of cytosine (C) bases in the last five sequences in a sequence S. [0020] Referring to FIG. 1, the specific steps are as follows: 100211 Step Initialize a DNA coding sequence set Pop and parameters required for updating a DNA coding sequence.

[0022] For example, the parameters required for updating a DNA coding sequence include nPop, nvar, pMutation, pCannibalism, and Miter, nPop represents a sequence size; nvar represents the number of sequence dimensions; pMutation represents a mutation rate; pCannibalism represents a cannibalism rate; and Miter represents the maximum number of iterations.

[0023] Step 2: Use Hamming distance as a fitness function to perform fitness-based sorting on the initialized DNA coding sequence set Pop to obtain a current optimal solution.

100241 Step 3: Obtain a parameter S in a weight-based selection policy.

[0025] For example, the parameter S is initialized as 0; if fitness of a current solution is smaller than the current optimal solution, the parameter S is divided by 2 to control the size; and if the fitness of the current solution is greater than the current optimal solution, the parameter S is increased by 1 [0026] Step 4: Determine, during update of the DNA coding sequence set Pop, whether to use a random swap policy.

100271 For example, during the sequence update, whether to use the random swap policy is determined by using formulas (2) and (3). If a parameter a is less than a parameter [3, the random swap policy will be used to replace a dimension in Pop with a dimension of the current optimal solution; otherwise, the policy will not be used, and the current solution and optimal solution in Pop are maintained.

[0028] a=tan(gx (rand-0.5)) (2) [0029] 13= 1-CIter/MIter (3) [0030] Clter is the current number of iterations, and rand is a random number with a value of (0,1).

[0031] Step 5: Obtain parameters Wa and Wb in the weight-based selection policy, and update the DNA coding sequence set Pop based on the parameters Wa and Wb.

100321 For example, the parameters Wa and Wb used in the weight-based selection policy are generated by using formulas (4) and (5).

[0033] W"-(1 -C Iter/IMIter)i-1a15),S7MIter (4) [0034] Wb=(2-2 x C Iter/M Iter) I -(rand-05),/SMiter (5) 100351 The parameter S is generated in step 2. Wa and Wb have different value ranges. Wa is used to improve the development capability of the algorithm, and Wb is used to improve the exploration capability of the algorithm.

[0036] Step 6: Use the Hamming distance as the fitness function to obtain fitness of sequences in the updated DNA coding sequence set Pop, perform fitness-based sorting, retain a sequence that satisfies the Hamming distance constraint, and add the sequence to a DNA coding sequence set Pop2.

100371 For example, formula (6) is used to determine usage of the two parameters. In the first half of a sequence loop, Wa is used to generate offspring, that is, formula (7), in the second half of the sequence loop, Wb is used to generate offspring, that is, formula (8).

[0038] CIter/MIter<0.5 (6) 100391 fyi = Wax rand. x x1 + (1 Y2 = Wa x rand. x x2 + (1 [0040] fYi = rand. x x1 ± Wb X (1 (Y2 = rand. x x2 + Wb X (1 - rand). x x2 - rand). x xi - rand). x x2 - rand). x Clter/MIter < 0.5 (7) Clter/MIter > 0.5 (8) [0041] Wa and Wb are generated by using formulas (4) and (5) respectively, xi and x2 are sequences in the original DNA sequence set Pop; and yi and y2 are sequences in the updated DNA sequence set Pop.

100421 Fitness-based sorting is performed on the updated DNA sequence set Pop, a DNA sequence with low fitness is replaced, a DNA sequence with high fitness is retained in the current sequence set, and the retained sequences are stored in Pop2.

[0043] Step 7: Obtain, based on the number of sequences in the current DNA coding sequence set Pop and a mutation rate, a partial data set that should be mutated, randomly swap values in any two dimensions in the partial data set, and store the mutation result to a DNA coding sequence set Pop3.

100441 For example, the number of sequences in the mutated partial data set is nMutation, as shown in formula (9). Values in two dimensions of a sequence in the partial data set are randomly swapped, and the mutation result is stored in Pop3.

[0045] nMutation=PopxpMutation (9) [0046] Step 8: Filter out, by using the fitness function, sequences that do not satisfy a combined constraint from the DNA coding sequence set Pop2 and the DNA coding sequence set Pop3, and remove the sequences.

[0047] For example, the physical and thermodynamic properties of a sequence are separately measured by using the hairpin structure, melting temperature, and free energy. A calculation formula for the hairpin structure is as follows: = ?I pinlen) E f(nippniniel ne n+-rjEir /22y) [0048] hi a ir p in(S) -2 Hairpin(S,k) (10) 100491 where r represents a length of a shortest subsequence required to form a hairpin loop, and pinlen represents a length of a subsequence that forms a hairpin stem. In the sequence S, if a hairpin structure is generated at the k-th base of the sequence, and the number of complementary bases in the hairpin stem is more than half of the number of bases that make up a stem length, a value of Hairpin(S,k) is set to I; otherwise, the value is set too.

100501 For the DNA sequence set S with m sequences, F-rta(5) is used to represent a difference between Tm values of sequences, and a formula is as follows: [0051] FT",(S) = EZ1 (Tm(S i) -Tm(S)}2 (11) [0052] where Tm(Si) represents a melting temperature of the i-th sequence in the DNA sequence set 5, and Tm(S) represents a mean value of melting temperatures of the DNA sequence set S. 100531 Step 9: Store the DNA coding sequence set Pop2 and the DNA coding sequence set Pop3 to the DNA coding sequence set Pop.

[0054] Step 10: Determine whether the update times of the current DNA coding sequence set Pop reach the maximum number of iterations Miter; and if yes, output the DNA coding sequence set Pop; otherwise, go to step 2.

100551 Embodiment 1 [0056] In this embodiment, a length of a DNA code is 20, the Hamming distance n is greater than or equal to 17 and also satisfies the combined constraint.

100571 Step 1: Initialize parameters required for updating a DNA coding sequence, where nPop=2000, nvar=20, pMutation=0.4, pCannibalism=0.5, and MIter=2500. Perform filtering on an initialized DNA sequence set with the parameter settings based on a combined constraint, and add a sequence that satisfies the combined constraint to a sequence set Pop as an initial DNA coding sequence set. With the settings, a size of the initial sequence set is 130.

[0058] Step 2: Use Hamming distance as a fitness function to sort the sequences in Pop to obtain a current optimal solution.

[0059] Step 3: Obtain a parameter S in a weight-based selection policy by comparing a current solution with the optimal solution. If fitness of the current solution is less than that of the current optimal solution, divide S by 2; and if the fitness of the current solution is greater than that of the current optimal solution, increase S by L 100601 Step 4: Use a random swap policy in the current sequence set Pop to perform sequence mutation.

[0061] Step 5: Use formulas (4) and (5) to generate parameters Wa and Wb used in the weight-based selection policy; and use formula (6) to determine usage of the two parameters, where in the first half of a sequence loop, Wa is used to generate offspring, that is, formula (7); in the second half of the sequence loop, Wb is used to generate offspring, that is, formula (8). [0062] Step 6: Perform fitness-based sorting on Pop, replace a DNA sequence with low fitness, and retain a DNA sequence with high fitness in the current sequence set, and store the retained sequence to Pop2.

[0063] Step 7: Randomly swap values in two dimensions in nMutation DNA sequence sets, and store the mutation result to Pop3.

100641 Step 8: Filter out, by using the fitness function, sequences that do not satisfy a constraint from Pop2 and Pop3, and remove the sequences.

[0065] Step 9: Store Pop2 and Pop3 to Pop.

[0066] Step 10: Determine whether the update times of the current DNA coding sequence set Pop reach the maximum number of iterations; and if yes, output the DNA coding sequence set Pop; otherwise, go to step 2.

100671 The present disclosure proposes a method for combining the dual-policy BWO algorithm with the combined constraint to construct a stable DNA coding sequence set. The random swap policy and weight-based selection policy are introduced into the dual-policy BWO algorithm, which improves the development and exploration capabilities of the algorithm while improving the sequence diversity. The experiments of the present disclosure are completed on a desktop computer with Intel(11) CPU 3.6 GHz, 4.0 GB RAM, and Windows 8.

100681 Table 1 Results when n=10 and d>7

TCTGCAGAGT GC AGATC TAG TCGCACTATG

TGTCGTAGTC AGCAGTCAGA TGCGAGATCA

CTCACTACAG GTACGCATGA CTGTCGTACT

CGTGTCTCTA AGAGCATGAC CGATGAGATG

100691 Table 2 Physical and thermodynamic performance of a sequence set when n=9 and d>7 Number of hairpin structures FTn,(S) 6(0) 38 3.65 0.34 100701 Table 1 and Table 2 respectively show the results of a sequence set when the sequence length n=10 and Hamming distance d>7 and the physical and thermodynamic performance when the sequence length n=9 and Hamming distance d>7. According to comparison between the experiment results and other known coding methods, the present disclosure can not only increase the code quantity, but also improve the physical and thermodynamic stability of the coding set. 100711 The above are merely descriptions of preferred embodiments, but are not intended to limit of the present disclosure. It should be noted that many modifications and variations can be made by those of ordinary skill in the art without departing from the technical principle of the present disclosure. These modifications and variations should also be deemed as falling within the protection scope of the present disclosure.

Claims

WHAT IS CLAIMED IS: 1. A method for optimizing coding for deoxyribonucleic acid (DNA) storage based on a dual-policy black widow optimization (BWO) algorithm, comprising the following steps: step 1: initializing a DNA coding sequence set Pop and parameters required for updating a DNA coding sequence; step 2: using Hamming distance as a fitness function to perform fitness-based sorting on the initialized DNA coding sequence set Pop to obtain a current optimal solution; step 3: obtaining a parameter S in a weight-based selection policy; step 4: determining, during update of the DNA coding sequence set Pop, whether to use a random swap policy; step 5: obtaining parameters Wa and Wb in the weight-based selection policy, and updating the DNA coding sequence set Pop based on the parameters Wa and Wh; step 6: using the Hamming distance as the fitness function to obtain fitness of sequences in the updated DNA coding sequence set Pop, performing fitness-based sorting, retaining a sequence that satisfies the Hamming distance constraint, and adding the sequence to a DNA coding sequence set Pop2; step 7: obtaining, based on the number of sequences in the current DNA coding sequence set Pop and a mutation rate, a partial data set that should be mutated, randomly swapping values in any two dimensions in the partial data set, and storing the mutation result to a DNA coding sequence set Pop3; step 8: filtering out, by using the fitness function, sequences that do not satisfy a combined constraint from the DNA coding sequence set Pop2 and the DNA coding sequence set Pop3, and removing the sequences; step 9: storing the DNA coding sequence set Pop2 and the DNA coding sequence set Pop3 to the DNA coding sequence set Pop; and step 10: determining whether the update times of the current DNA coding sequence set Pop reach the maximum number of iterations; and if yes, outputting the DNA coding sequence set Pop; otherwise, going to step 2.
2. The method for optimizing coding for DNA storage based on a dual-policy BWO algorithm according to claim 1, wherein the parameters required for updating a DNA coding sequence comprise: sequence size nPop, number of sequence dimensions nvar, mutation rate pMutation, cannibalism rate pCannibalism, and maximum number of iterations Miter.
3. The method for optimizing coding for DNA storage based on a dual-policy BWO algorithm according to claim 1, wherein the combined constraint comprises a Hamming distance constraint, guanine-cytosine (GC) content, a non-consecutiveness constraint, and a terminal constraint, wherein the terminal constraint is as follows: IGSLast51 ± sLast5I < 3 (I) wherein 1GsLas,51 and ICsLas,31 respectively represent the number of guanine (G) bases and the number of cytosine (C) bases in the last five sequences in a sequence S.
4. The method for optimizing coding for DNA storage based on a dual-policy BWO algorithm according to claim 1, wherein the determining, during update of the DNA coding sequence set Pop, whether to use a random swap policy is specifically: a=tan (7c x (rand-0. 5)) (2) p=1-CIter/MIter (3) wherein rand is a random number with a value of (0,1), CIter is the current number of iterations, and Miter is the maximum number of iterations; and if a parameter a is less than a parameter f3, a random swap policy will be used to replace a dimension in the DNA coding sequence set Pop with a dimension of the current optimal solution; otherwise, the random swap policy will not be used, and a current solution and the optimal solution in the DNA coding sequence set Pop are maintained.
5. The method for optimizing coding for DNA storage based on a dual-policy BWO algorithm according to claim 1, wherein the manners for obtaining parameters Wa and Wb in the weight-based selection policy are: Wa(]-CIter/N4 Iter)' -(rand-o.5)xsimber (4) Wb=(2-2xCIter/MIter)1-(ntal-0.5),ISMILer (5) wherein the parameter S is generated in step 3 and initialized as 0; if fitness of a current solution is smaller than the current optimal solution, the parameter S is divided by 2; and if the fitness of the current solution is greater than the current optimal solution, S is increased by 1.
6. The method for optimizing coding for DNA storage based on a dual-policy BWO algorithm according to claim 5, wherein the manners for updating the DNA coding sequence set Pop based on the parameters Wa and Wb are: Clter/M1ter<0.5 (6) fyi = Wa x rand. x + (1-rand). x x2 (y2 = Wa x rand. x x2 + (1-rand). x xi Clter/MIter < 0.5 (7) = rand. x x1 + Wb X (1-rand). x x2 Clter/MIter > 0.5 ty2 = rand. x x2 + Wb X (1-rand). x wherein Wa and Wb are generated by using formulas (4) and (5) respectively, xi and x2 are sequences in the original DNA sequence set Pop, and yi and y2 are sequences in the updated DNA sequence set Pop. (8)
7. A method for optimizing coding for DNA storage based on a dual-policy BWO algorithm, comprising: step 1: initializing a DNA coding sequence set Pop and parameters required for updating a DNA coding sequence, to obtain the initialized DNA coding sequence set Pop; step 2: using Hamming distance as a fitness function to perform fitness-based sorting on the initialized DNA coding sequence set Pop to obtain a current optimal solution; step 3: obtaining parameters Wa and Wb in a weight-based selection policy, and updating the DNA coding sequence set Pop based on the parameters Wa and Wb to obtain the updated DNA coding sequence set Pop; step 4: using the Hamming distance as the fitness function to obtain fitness of sequences in the updated DNA coding sequence set Pop, performing fitness-based sorting, retaining a sequence that satisfies the Hamming distance constraint, and adding the sequence to a DNA coding sequence set Pop2; step 5: determining, based on the number of sequences in the updated DNA coding sequence set Pop and a mutation rate, a partial data set to be mutated, randomly swapping values in any two dimensions in the partial data set, and storing the mutation result to a DNA coding sequence set Pop3; step 6: filtering out, by using the fitness function, sequences that do not satisfy a combined constraint from the DNA coding sequence set Pop2 and the DNA coding sequence set Pop3, and removing the sequences to obtained the processed DNA coding sequence set Pop2 and the processed DNA coding sequence set Pop3; step 7: storing the processed DNA coding sequence set Pop2 and the processed DNA coding sequence set Pop3 to the updated DNA coding sequence set Pop to obtain a DNA coding sequence set Pop4; and step 8: determining whether the update times of the DNA coding sequence set Pop4 reach the maximum number of iterations; and if yes, outputting the DNA coding sequence set Pop4; otherwise, going to step 2.
8. The method for optimizing coding for DNA storage based on of a dual-policy BWO algorithm according to claim 7, wherein before the obtaining parameters Wa and Wb in a weight-based selection policy, the method further comprises: obtaining a parameter S in the weight-based selection policy; and initializing the parameter S as 0, wherein if fitness of a current solution is smaller than the current optimal solution, the parameter S is divided by 2; and if the fitness of the current solution is greater than the current optimal solution, S is increased by 1.
9. The method for optimizing coding for DNA storage based on a dual-policy BWO algorithm according to claim 8, wherein before the obtaining parameters Wa and Wb in a weight-based selection policy, the method further comprises: determining, during update of the DNA coding sequence set Pop, whether to use a random swap policy, which is specifically: a=tan (7c x (rand-O. 5)) (2) p=1-CIter/MIter (3) wherein rand is a random number with a value of (0,1), CIter is the current number of iterations, and Miter is the maximum number of iterations; and if a parameter a is less than a parameter f3, the random swap policy will be used to replace a dimension in the initialized DNA coding sequence set Pop with a dimension of the current optimal solution; otherwise, the random swap policy will not be used, and the current solution and the optimal solution in the initialized DNA coding sequence set Pop are maintained.
10. The method for optimizing coding for DNA storage based on of a dual-policy BWO algorithm according to claim 9, wherein the manners for obtaining parameters Wa and Wb in a weight-based selection policy are: Wa=(1-CIter/MIter)'-imiti4).5ixsiktiter (4) Wb=(2-2xCIter/MIter)I-(rad-0.5),/51'h1ier (5)-