CN113380322A

CN113380322A - Artificial nucleic acid sequence watermark encoding system, watermark character string and encoding and decoding method

Info

Publication number: CN113380322A
Application number: CN202110716040.6A
Authority: CN
Inventors: 康康; 陈钰
Original assignee: Beisheng Biotechnology Shenzhen Co ltd
Current assignee: Beisheng Biotechnology Shenzhen Co ltd
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-09-10
Anticipated expiration: 2041-06-25
Also published as: CN113380322B

Abstract

The invention provides an artificial nucleic acid sequence watermark encoding system, a watermark character string and encoding and decoding methods. Wherein the artificial nucleic acid sequence watermark encoding system comprises: a variable encoding section and at least one watermark string connected to the variable encoding section; the variable coded segment includes at least one information unit; the information unit comprises a nucleic acid characteristic part and a numbering part which is connected with the nucleic acid characteristic part and corresponds to the nucleic acid characteristic part. The coding system provided by the invention can be used as a watermark sequence to protect intellectual property rights of gene products, and the corresponding sequence is determined according to the nucleic acid characteristics, so that the storage capacity and the sequencing efficiency are greatly improved.

Description

Artificial nucleic acid sequence watermark encoding system, watermark character string and encoding and decoding method

Technical Field

The invention belongs to the technical field of artificial nucleic acid sequences, and particularly relates to an artificial nucleic acid sequence watermark encoding system, a watermark character string and encoding and decoding methods.

Background

A DNA sequence or gene sequence is the primary structure of a real or hypothetical DNA molecule carrying genetic information, represented using a string of letters. The genetic engineering refers to strict design according to the desire of people, and endows organisms with new genetic characteristics through in vitro DNA recombination and transgenic technology, thereby creating new organism types and biological products which are more in line with the needs of people. Genetic engineering is the design and construction at the DNA molecular level, also known as DNA recombination techniques.

Plasmid vectors are plasmids that have been artificially constructed for adaptation to laboratory procedures on the basis of native plasmids. Compared with the natural plasmid, the plasmid vector usually has one or more selective marker genes (such as antibiotic resistance genes) and a synthetic multiple cloning site sequence containing multiple restriction enzyme recognition sites, and most of non-essential sequences are removed, so that the molecular weight is reduced as much as possible, thereby facilitating the genetic engineering operation.

The intellectual property of artificial nucleic acid sequences for genetic engineering, such as the above-mentioned vector plasmids, should be attributed to the company or laboratory that designed the sequence. However, the artificial nucleic acid sequence is often formed by combining the disclosed nucleic acid sequences, lacks specific intellectual property attribution information, and is difficult to protect intellectual property.

In addition, artificial nucleic acid sequences tend to be longer, as vector plasmids typically have DNA sequences of 3k-10 kbbp. If the artificial nucleic acid sequence needs to be identified, sequencing technology is often adopted to obtain the complete functional module. Taking the most common Sanger sequencing as an example, a single Sanger sequencing reaction can only effectively cover a sequence of 300-600bp, and 5-20 Sanger reactions are needed for completely identifying a vector plasmid sequence, so the cost is high.

Disclosure of Invention

The invention provides an artificial nucleic acid sequence watermark encoding system, a watermark character string and encoding and decoding methods, which aim to overcome the defects in the prior art.

In order to solve the above problems, the present invention provides an artificial nucleic acid sequence watermark encoding system, comprising:

a variable encoding section and at least one watermark string connected to the variable encoding section; wherein the variable coding section is arranged at a midstream position in the artificial nucleic acid sequence watermark coding system, and the watermark character string is arranged at an upstream position and/or a downstream position in the artificial nucleic acid sequence watermark coding system;

the variable coded segment includes at least one information unit;

the information unit comprises a nucleic acid characteristic part and a numbering part which is connected with the nucleic acid characteristic part and corresponds to the nucleic acid characteristic part.

Preferably, the watermark string length is not less than 3;

in the variable coding segment, a plurality of the information units are arranged in 5 'to 3' order for indicating the kind of functional fragment and the corresponding number on the artificial nucleic acid fragment.

Preferably, the nucleic acid features in each of the information elements are used to indicate DNA assembly methods and interface information, screening marker information, replication origin information, reporter gene information, insert backbone species information, insert promoter information, insert RBS information, and insert CDS information or insert terminator information.

Preferably, the method further comprises a usage scene segment for representing the end of the artificial nucleic acid sequence watermark encoding system and a usage scene;

the usage scene segment is connected with the variable coding segment;

the usage scene segment is arranged at an upstream position or a downstream position in the artificial nucleic acid sequence watermark encoding system;

the variable encoding section is provided at a midstream position between the usage scene and/or the watermark string.

Preferably, the usage scenario segment comprises a bacterial artificial chromosome marker, a yeast artificial chromosome marker, a nuclear genome marker, a mitochondrial genome marker and/or a chloroplast genome marker RNA fragment marker for identifying the usage scenario.

In addition, in order to solve the above problems, the present application further provides a watermark string in an artificial nucleic acid sequence watermark encoding system, wherein the length of the watermark string is not less than 15 bp/nt; and the watermark character string can be used as a general PCR primer sequence to amplify and sequence the artificial nucleic acid fragment.

Preferably, the watermark string, the nucleic acid sequence thereof, the reverse complement of the nucleic acid sequence, the sequence itself of the RNA sequence and/or the reverse complement of the RNA sequence and/or the subsequence thereof, is a sequence having a similarity of not less than 95% to the nucleic acid sequence, the reverse complement of the RNA sequence and the reverse complement of the RNA sequence; wherein,

the nucleic acid sequence is:

the reverse complement of the nucleic acid sequence is:

the RNA sequence is:

the reverse complement of the RNA sequence is:

in addition, to solve the above problems, the present application further provides an encoding and decoding method of an artificial nucleic acid sequence watermark encoding system, comprising:

the coding method comprises the following steps:

presetting a watermark character string, a variable coding section and a use scene section;

encoding the watermark character string, the variable encoding section and the use scene section to obtain a target artificial nucleic acid sequence;

the decoding method comprises the following steps:

performing Sanger sequencing on the target artificial nucleic acid sequence to obtain a sequencing result;

acquiring the watermark character strings, variable coding sections and nucleic acid sequences corresponding to the use scene sections at the upstream position, the midstream position and the downstream position from the sequencing result;

and performing information interpretation conversion on the nucleic acid sequences corresponding to the watermark character string, the variable coding section and the use scene section to obtain decoding information.

Preferably, in the encoding method, "encoding the watermark string, the variable encoding section, and the usage scenario section to obtain the target artificial nucleic acid sequence" includes:

converting each character in the nucleic acid sequences corresponding to the watermark character string, the variable coding section and the use scene section into an ASCII byte coded eight-bit binary number to obtain a first information set;

converting the first information set into a four-bit quaternion according to a mapping of 00 ═ A, 01 ═ G, 10 ═ C, 11 ═ T or U, and obtaining the target artificial nucleic acid sequence;

in the decoding method, "performing information interpretation conversion on the nucleic acid sequences corresponding to the watermark character string, the variable encoding section, and the usage scene section to obtain decoding information" includes:

converting the nucleic acid sequences corresponding to the watermark character string, the variable coding segment and the use scene segment into binary digital characters according to the rule of A-00, G-01, C-10, T or U-11 to obtain a second information set;

and encoding the second information set by using ASCII (American standard code for information interchange) with each 8 bits as a group, and converting the second information set into character strings to obtain decoding information corresponding to the target artificial nucleic acid sequence.

respectively acquiring non-numeric characters and decimal numeric characters in the nucleic acid sequences corresponding to the watermark character string, the variable coding section and the use scene section;

converting the non-numeric characters into ASCII byte coded eight-bit binary numbers to obtain a third information set;

converting the third information set into a four-bit quaternion according to a rule mapping of 00 ═ A, 01 ═ G, 10 ═ C, 11 ═ T or U, and obtaining a fourth information set;

performing quaternary conversion on the decimal numeric characters; according to the fixed digit number of the quaternary number converted in the quaternary system, the front end of the digit which is not enough for the fixed digit number after the quaternary conversion reaches the fixed digit number by 0; obtaining a fifth information set;

mapping the fifth information set according to a rule of 0 ═ A, 1 ═ G, 2 ═ C, 3 ═ T or U, and converting the fifth information set into decimal numbers to obtain the target artificial nucleic acid sequence;

cutting the nucleic acid sequences corresponding to the watermark character string, the variable coding section and the use scene section into subsequences with the length of 4+ n, and taking each subsequence as a decoding unit; wherein n is a fixed digit of a quaternary number converted from quaternary;

converting the nucleic acid sequence of the first four bits in each decoding unit into eight-bit binary ASCII codes according to the rule of A00, G01, C10, T or U11, and converting the eight-bit binary ASCII codes into characters to obtain a sixth information set; and,

converting the nucleic acid sequence of the last n bits in each decoding unit into a quaternary number according to the rule of A ═ 0, G ═ 1, C ═ 2, T or U ═ 3, and then converting the quaternary number into a decimal number to obtain a seventh information set;

and connecting the sixth information set and the seventh information set in all decoding units to obtain the decoding information.

The invention provides an artificial nucleic acid sequence watermark encoding system, a watermark character string and encoding and decoding methods. Wherein the artificial nucleic acid sequence watermark encoding system comprises: the method comprises the following steps: a variable encoding section and at least one watermark string connected to the variable encoding section; wherein the variable coding section is arranged at a midstream position in the artificial nucleic acid sequence watermark coding system, and the watermark character string is arranged at an upstream position and/or a downstream position in the artificial nucleic acid sequence watermark coding system; the variable coded segment includes at least one information unit; the information unit comprises a nucleic acid characteristic part and a numbering part which is connected with the nucleic acid characteristic part and corresponds to the nucleic acid characteristic part. The invention provides an encoding and decoding method of an artificial nucleic acid sequence watermark encoding system, which arranges a variable encoding section, a watermark character string and a relative position of a use scene section through artificial design, fixes the variable encoding section at a midstream position, fixes other watermark character strings and the like at the upstream and downstream of two sides, converts the artificial nucleic acid sequence watermark encoding system into a nucleic acid sequence, has no high homology with a known natural biological genome, can be used as a watermark sequence, and can be used for identifying the artificial nucleic acid sequence design of a specific unit or a laboratory to protect intellectual property rights of gene products; in addition, the length of the artificial nucleic acid sequence watermark encoding system is between 100-300bp, the artificial nucleic acid sequence watermark encoding system can be covered by a Sanger sequencing reaction, the complete artificial nucleic acid sequence does not need to be sequenced, and the component composition of the artificial nucleic acid fragment can be obtained only by one-time Sanger sequencing. The complete sequence of the artificial nucleic acid fragment can also be deduced by combining the preset mapping relationship among the nucleic acid characteristic part (assembly method, etc.), the numbering part (element library number) and the sequence.

Detailed Description

The technical solution of the present invention is further described in detail by way of the following specific embodiments, but the present invention is not limited thereto, and any limited number of modifications made by anyone within the scope of the claims of the present invention are still within the scope of the claims of the present invention.

The embodiment provides an artificial nucleic acid sequence watermark encoding system, which comprises:

the variable coded segment includes at least one information unit;

The artificial nucleic acid sequence watermark encoding system has the overall length of 100-300bp, and the overall length can be an artificial nucleic acid fragment, and can include but is not limited to a vector plasmid; the partial fragment of the plasmid may be a BAC fragment, a YAC fragment, a nucleic acid fragment inserted into the genome of the organism, or an RNA fragment.

The watermark string and the variable coding segment of the artificial nucleic acid sequence watermark coding system should be in the same engineered organism as the artificial nucleic acid segment, but not necessarily in the same nucleic acid molecule, e.g., in the same chromosome, in the same vector plasmid.

The artificial nucleic acid sequence watermark encoding system is integrally composed of a plurality of parts, and can comprise a variable encoding section and at least one watermark character string connected with the variable encoding section. The length of the variable coding section can be determined by different lengths according to manual design, and the variable coding section can be long or short, so that the problem of connection of a plurality of variable coding sections is not involved, and the number of the watermark character strings can be 1 or more, and the watermark character strings are used for representing different unit information and the like. Multiple short messages may be connected by a character _.

In an artificial nucleic acid sequence watermark encoding system, which can be set to an upstream position, a midstream position and a downstream position, wherein variable encoding sections can be set to the midstream position and/or the downstream position, and at this time, the watermark character string can have 1 or more, the following may occur:

1. watermark string _ variable encoding section;

2. watermark string 1_ variable encoding section _ watermark string 2.

The boundary of the artificial nucleic acid sequence watermark coding system can be rapidly determined by fixing the position of the variable coding section at the midstream position and relative to the position of the fixed variable coding section, so that the accuracy and precision of identification are improved, the identification efficiency is improved, and the problem of messy codes during coding and decoding when the position of the variable coding section is not determined is solved.

As described above, the variable coding segment is composed of one or more information units, each of which includes two parts, a nucleic acid feature part and a numbering part. The nucleic acid characteristic part is used for marking the nucleic acid assembly method or the nuclear calculation and case epitaxy; a coding section for identifying a specific assembly method number or a number of the selected nucleic acid element in the library of elements.

Furthermore, the element library is a preset database with an encoding part of an index target and a corresponding DNA sequence, and can be used for encoding or decoding of an artificial nucleic acid sequence watermark encoding system.

As described above, the variable coding segment is composed of a plurality of information units, each of which includes a nucleic acid feature and a number, wherein the nucleic acid feature may be represented by A1-digit letter and the number may be represented by a plurality of digits (e.g., A123B456C 789). The number of the information units can be increased, decreased and customized according to the actual requirements, the requirements of the information amount and the requirements of conditions.

The watermark string may be disposed at an upstream or downstream position of the artificial nucleic acid sequence watermark encoding system, and may be designed by a sequence design unit to identify specific unit information, for example, the watermark string "biosensen" indicates that the sequence is designed and provided by the company "biosensen ltd.

For example, the artificial nucleic acid sequence watermark encoding system is:

biosensen _ a003S001O001R001I005P001B001C010T 003; the system includes two parts, a watermark string "biosensen" at an upstream location, and a variable code segment "a 003S001O001R001I005P001B001C010T 003" at a mid-stream location (or mid-downstream location). Wherein, the variable code segment comprises 9 information units, which are respectively 'A003', 'S001', 'O001', 'R001', 'I005', 'P001', 'B001', 'C010' and 'T003'; in the first information unit, the nucleic acid characteristic part is "A" and the numbering part is "003". Therefore, the corresponding element library can be indexed based on the number part in each information unit, and the corresponding sequence can be determined according to the nucleic acid characteristics, so that the storage capacity and the sequencing efficiency are greatly improved.

The application provides an artificial nucleic acid sequence watermark coding system, the relative positions of a variable coding section and a watermark character string are arranged by artificial design, the variable coding section is fixed at the midstream position, other watermark character strings and the like are fixed at the upstream and downstream of two sides, after the artificial nucleic acid sequence watermark coding system is converted into a nucleic acid sequence, the nucleic acid sequence has no high homology with the known natural biological genome, and can be used as a watermark sequence to mark the artificial nucleic acid sequence design of a specific unit or a laboratory and carry out intellectual property protection on a gene product; in addition, the length of the artificial nucleic acid sequence watermark encoding system is between 100-300bp, the artificial nucleic acid sequence watermark encoding system can be covered by a Sanger sequencing reaction, the complete artificial nucleic acid sequence does not need to be sequenced, and the component composition of the artificial nucleic acid fragment can be obtained only by one-time Sanger sequencing. The complete sequence of the artificial nucleic acid fragment can also be deduced by combining the preset mapping relationship among the nucleic acid characteristic part (assembly method, etc.), the numbering part (element library number) and the sequence.

Further, the length of the watermark character string is not less than 3;

Since the length of the watermark string is now 3 or more, the universal primer can be guaranteed to be 16nt after encoding into nucleic acid.

In the above, in the variable coding section, a plurality of the information units are arranged in 5 'to 3' order, for example:

Biosysen_A003S001O001R001I005P001B001C010T003；

a003 is start information for identifying a nucleic acid assembling method and interface information, and 003 is a coding part for indexing information in a corresponding component library. Thereafter, each information unit represents the type and number of the functional fragment on the artificial nucleic acid fragment in the order from 5 'to 3' (for example, in the case of a circular nucleic acid system such as a plasmid, generally starting from the first functional element downstream of the inserted fragment); if there are some information units behind Ixxx and before the scene character string is used downstream, these information units represent the internal structure of the insert (i.e. I005P001B001C010T003 in this example, which represents the insert skeleton type of 005, and the internal structure is composed of promoter No. 001, RBS No. 001, CDS No. 010, and terminator No. 003 connected in sequence).

Further, the nucleic acid feature in each of the information elements is used to indicate DNA assembly method and interface information, screening marker information, replication origin information, reporter gene information, insert backbone species information, insert promoter information, insert RBS information, and insert CDS information or insert terminator information.

As described above, the use of the nucleic acid signature in each information element includes information indicating, but not limited to, DNA assembly methods and interfaces, screening marker information, replication origin information, reporter gene information, insert backbone species information, insert promoter information, insert RBS information, and insert CDS information or insert terminator information, and the like.

The above-mentioned nucleic acid features represent the nucleic acid assembly method or the type of nucleic acid element, and are shown by 1-digit letters in this example, but are not limited thereto. In this embodiment, the definition and representation may be performed as follows:

the nucleic acid assembling method and the nucleic acid element types of each information unit are identified through the nucleic acid characteristic part in each information unit, and the corresponding numbers in the element library are identified by utilizing the numbers of the encoding parts, so that the corresponding nucleic acid information can be accurately and quickly indexed, further, more related information of the nucleic acid sequence is stored in a simple artificial nucleic acid sequence watermark encoding system comprising a plurality of information units through corresponding encoding and decoding, the complete functional module composition of the artificial nucleic acid sequence can be completely known through sequencing the encoding system, even the complete nucleic acid sequence is deduced, the sequencing cost is greatly reduced, the sequencing efficiency is improved, and convenience is provided for sequencing work.

Further, the method comprises a use scene segment for representing the end and use scene of the artificial nucleic acid sequence watermark encoding system;

the usage scene segment is connected with the variable coding segment;

Further, the usage scenario segment includes a bacterial artificial chromosome identifier, a yeast artificial chromosome identifier, a nuclear genome identifier, a mitochondrial genome identifier, and/or a chloroplast genome identifier RNA fragment identifier for identifying the usage scenario.

As described above, the usage scenario segments include, but are not limited to, bacterial artificial chromosome markers, yeast artificial chromosome markers, nuclear genome markers, mitochondrial genome markers, and/or chloroplast genome marker RNA segment markers for identifying the usage scenario. The method can be used for identifying the end of the coding system and also can be used for identifying the use scene of the artificial nucleic acid sequence. The positions may be set at an upstream position or a downstream position, for example, they may be:

1. watermark string _ variable encoding section _ usage scene section;

2. scene segment _ variable encoding segment _ watermark string is used.

For example: biosensen _ a003S001O001R001I005P001B001C010T003_ Vector;

wherein, Vector is a use scene segment positioned at a downstream position and indicates that the artificial nucleic acid sequence is a Vector plasmid. In addition, "BAC," "YAC," "Genome," "Mito," "choro," "RNA," and the like may also be used to refer to, but are not limited to, artificial nucleic acid sequences for bacterial artificial chromosomes, yeast artificial chromosomes, nuclear genomes, mitochondrial genomes, chloroplast genomes, RNA fragments, and the like. By adopting the scene segment, the scene used by the sequence, such as a vector plasmid, can be obtained during sequencing, encoding and decoding, so that the sequencing efficiency can be improved, and the related information can be accurately obtained.

In addition, in order to solve the above problems, the present embodiment further provides a watermark string in an artificial nucleic acid sequence watermark encoding system, wherein the length of the watermark string is not less than 15 bp/nt; and the watermark character string can be used as a general PCR primer sequence to amplify and sequence the artificial nucleic acid fragment.

Further, the watermark character string, the nucleic acid sequence, the reverse complement of the nucleic acid sequence, the sequence itself of the RNA sequence and/or the reverse complement of the RNA sequence and/or the subsequence thereof is a sequence having a similarity of not less than 95% to the nucleic acid sequence, the reverse complement of the RNA sequence and the reverse complement of the RNA sequence; wherein,

the nucleic acid sequence is:

the reverse complement of the nucleic acid sequence is:

the RNA sequence is:

the reverse complement of the RNA sequence is:

the nucleic acid sequence is a nucleic acid sequence decoded correspondingly to the watermark character string 'biosensen', has no high homology with the genome of a known natural organism (no result of comparison with the NCBI NT database running blastn), and therefore can be regarded as a specific sequence designed by the company biosensen Ltd.

Accordingly, several sequences can be included as follows:

(1) the four sequences themselves;

(2) any subsequence of the four sequences is more than or equal to 15 bp/nt;

(3) the four sequences or the sequence with the subsequence similarity of more than or equal to 15bp/nt of the four sequences of more than or equal to 95 percent (blastn percent identity);

all the above cases can be regarded as watermark information of the artificial nucleic acid sequence designed by biosystemen Ltd; artificial nucleic acid sequences or genetically engineered organisms containing such watermark information are considered to contain intellectual property rights of the company Biosysen ltd.

By inserting the watermark character string into the artificial nucleic acid sequence watermark coding system, the specific unit information of the artificial nucleic acid sequence can be quickly obtained during coding or decoding, so that the production party corresponding to the sequence can be obtained as Biosysen Ltd. For example:

Biosysen_A003S001O001R001I005P001B001C010T003_Vector；

for the system, the 4 corresponding sequences can be obtained from "biosensen _", or in the sequencing process, the 4 sequences can correspondingly obtain the watermark character string corresponding to the sequence as "biosensen _", so that the production party is known as biosensen ltd ", and specific production unit information can be kept in the gene fragment, so that the corresponding production party can be obtained in the encoding or decoding process, the protection degree of intellectual property rights is improved, and another way for protecting intellectual property rights, product anti-counterfeiting and tracing is provided.

In addition, to solve the above problem, the present embodiment further provides an encoding and decoding method of an artificial nucleic acid sequence watermark encoding system, including:

the coding method comprises the following steps:

the decoding method comprises the following steps:

As described above, the watermark string, the variable encoding section, and the usage scenario section need to be encoded and decoded if they need to be loaded into the complete sequence. In the encoding, the corresponding watermark character string, variable encoding section and use scene section are preset, and then encoding is carried out according to the watermark character string, the variable encoding section and the use scene section, so as to obtain the corresponding target artificial nucleic acid sequence. If the sequence obtained by sequencing is required to be used as a target artificial nucleic acid sequence to obtain corresponding information, the target artificial sequence needs to be decoded.

The embodiment provides an encoding and decoding method of an artificial nucleic acid sequence watermark encoding system, which is characterized in that a variable encoding section, a watermark character string and a use scene section are arranged by preset artificial design, and then encoding aiming at the variable encoding section, the watermark character string and the use scene section is carried out to obtain a target artificial nucleic acid sequence; or a target artificial nucleic acid sequence is obtained through sequencing and then decoded through a decoding method, so that corresponding decoding information is obtained, and therefore, after the artificial nucleic acid sequence watermark coding system is converted into a nucleic acid sequence, the nucleic acid sequence has no high homology with the known natural biological genome and can be used as a watermark sequence to mark the artificial nucleic acid sequence design of a specific unit or a laboratory, and intellectual property protection is carried out on a gene product; in addition, the length of the artificial nucleic acid sequence watermark encoding system is between 100-300bp, the artificial nucleic acid sequence watermark encoding system can be covered by a Sanger sequencing reaction, the complete artificial nucleic acid sequence does not need to be sequenced, and the component composition of the artificial nucleic acid fragment can be obtained only by one-time Sanger sequencing. The complete sequence of the artificial nucleic acid fragment can also be deduced by combining the preset mapping relationship among the nucleic acid characteristic part (assembly method, etc.), the numbering part (element library number) and the sequence.

Example 1:

in the encoding method, "encoding the watermark string, the variable encoding section, and the usage scenario section to obtain the target artificial nucleic acid sequence" includes:

In the encoding process of this embodiment (ASCII transcoding method encoding to obtain the target artificial nucleic acid sequence), the ASCII method is adopted: each character in the artificial nucleic acid sequence watermark coding system is converted into an ASCII byte code (eight-bit binary number), and then converted into a DNA sequence of four-bit quaternary (A, T/U, C, G) by mapping 00-A, 01-G, 10-C, 11-T or U in every two bits.

For example: biosensen _ a003S001O001R001I005P001B001C010T003_ Vector;

the artificial nucleic acid sequence watermark encoding system of this case converts the nucleic acid sequence by the method provided in this example into:

in the decoding process, the sequencing sequence is converted into a binary digital character string by a mode of a being 00, G being 01, C being 10, T or U being 11, and then converted into a character string by ASCII encoding with 8 bits as a group, that is, a variable encoding character string is obtained.

Specifically, sequencing the target artificial nucleic acid sequence obtained by encoding according to the ASCII transcoding method in the first embodiment:

the plasmid with the number of Biosysen _ a001S001O001R001I001_ Vector and the primer 5'-GAACGCCGGCTTGTATGTCGG-3' were selected for Sanger sequencing, and the sequencing result was:

decoding was performed with the sequence obtained in example 1, with the results:

biosensen _ a001S001O001R001I001 Vector, which completely reduced the encoded information, was tested successfully.

Example 2:

In the encoding process, a hybrid encoding method is adopted, in which non-numeric characters in the artificial nucleic acid sequence watermark encoding system are converted as in scheme one, and decimal numeric character strings are directly subjected to quaternary conversion (the converted quaternary numbers should have fixed numbers, and if the numbers are less than the fixed numbers, the front ends of the digits are complemented to the digits by 0, in this example, four-digit quaternary numbers are used, and the quaternary numbers 130 should be complemented to be 0130), and the digits are mapped into nucleic acid sequences according to a rule of 0 ═ a, 1 ═ G, 2 ═ C, 3 ═ T or U (e.g., 0130 → AGTA).

For example: biosensen _ a003S001O001R001I005P001B001C010T003_ Vector;

the nucleic acid sequence transformed by the artificial nucleic acid sequence watermark encoding system of this case through example 2 is:

in the decoding process, the sequencing sequence is first cut into subsequences with length of 4+ n (n is a fixed number of bits used for four-run number when encoding numbers, in this case 4), each subsequence represents an information unit: in each information unit, the first 4-bit nucleic acid sequence is converted into an 8-bit binary ASCII code by means of a ═ 00, G ═ 01, C ═ 10, T, or U ═ 11, and converted into characters; the n-bit sequence is converted into a four-digit number and then converted into a decimal number through the rule that A is 0, G is 1, C is 2, and T or U is 3; and connecting the characters restored by all the information units and the decimal numbers to restore to obtain the variable coding character string.

Sequencing was performed for example 2:

the plasmid with the number of Biosysen _ A003S001O001R001I001_ Vector and the primer 5'-GAACGCCGGCTTGTATGTCGG-3' were selected for Sanger sequencing, and the sequencing result was:

decoding was performed as in example 2, with the results:

biosensen _ a003S001O001R001I001_ Vector, the encoded information was completely reduced and the test was successful.

Example 3:

in addition, biological products widely relate to aspects of medical health, food safety, crop planting, livestock breeding and the like, and are closely related to human life health and safety, so that the uniqueness and correspondence of the biological products in logistics, storage and clinical delivery processes are important in ensuring biosafety application.

In order to solve the above problem, in this embodiment, an encryption and decryption method for checking a specific biological product is provided, which comprehensively utilizes an artificial nucleic acid sequence watermark encoding system, a watermark string technology, and an encryption algorithm, and specifically includes the following steps:

after the steps of presetting the watermark character string, the variable coding section and the using scene section, the method also comprises the following steps:

in the encoding method:

generating a public key and a private key based on the SHA-256 algorithm according to the preset watermark character string, the variable coding section and the use scene section;

based on the private key, carrying out encryption calculation on the watermark character string, the variable coding section and the use scene section to obtain the watermark character string, the variable coding section and the use scene section which are encrypted by the private key;

in the step of encoding the watermark character string, the variable encoding section and the usage scenario section to obtain the target artificial nucleic acid sequence, the watermark character string, the variable encoding section and the usage scenario section are the watermark character string, the variable encoding section and the usage scenario section encrypted by the private key.

In a decoding method:

after the step of performing information interpretation conversion on the watermark character string, the variable coding segment and the nucleic acid sequence corresponding to the usage scene segment to obtain decoding information, the method further includes:

decrypting the decoding information by using a public key to obtain a first hash value; meanwhile, calculating the decoding information by using an SHA-256 algorithm to obtain a second hash value;

comparing the first hash value with the second hash value, and judging whether the first hash value is consistent with the second hash value;

and if the first hash value is consistent with the second hash value, judging that the test is passed.

In the product preparation process, the public key and the private key are generated together, wherein the public key can be calculated by the private key, but the private key cannot be calculated by the public key.

Specifically, the private key is used for carrying out encryption calculation on the watermark character string, the variable coding section and the use scene section to obtain the watermark character string, the variable coding section and the use scene section which are encrypted by the private key; each item (category) is calculated by a private key to obtain a corresponding 256-bit digital value, and then the obtained digital value is reused to correspondingly obtain the target artificial nucleic acid sequence.

The public key disclosure may be held by the transferee, publicly available, for example, with the transfer of the product to the transferee, or obtained by the transferee at the public.

And the transferee calculates the nucleic acid sequence obtained after decoding by using the public key to obtain a 256-bit hash value, namely the first hash value.

Meanwhile, the SHA-256 algorithm is utilized to calculate the nucleic acid sequence, and a 256-bit hash value is obtained as the second hash value.

Comparing the two hash values, if the first hash value is identical to the second hash value, i.e. the first hash value is equal to the second hash value, the following conclusion can be proved:

1. the resulting decoded information is for the resulting product having the target artificial nucleic acid sequence;

2. this decoded information is for the transferor of the product;

3. the received decoding information is the same as the original record of the product delivered by the transferor and is not tampered.

The SHA-256 algorithm, which is an algorithm subdivided under SHA-2, is a hash function, also called hash algorithm, and is a method for creating a small digital "fingerprint" from any kind of data. The hash function compresses a message or data into a digest so that the amount of data becomes small, fixing the format of the data. This function mixes the data shuffled and recreates a fingerprint called a hash value (or hash value). The hash value is typically represented by a short string of random letters and numbers.

For example, after acquiring biological product a of company B (shipping according to the order intention) through cold-chain transportation according to the order intention of hospital X, Sanger sequencing is performed on biological product a, and the sequencing result is:

the decoding is performed by the method in the above embodiment 2, and the result (i.e., the decoding information) is:

Biosysen_A003S001O001R001I001_Vector；

calculating decoding information by using the public key corresponding to the biological product A as follows, namely "Biosysen _ A003S001O001R001I001_ Vector", to obtain a first hash value, namely:

55CA1E14F0FC5D3C3FDE11E223B8ACA5A7944297866DCF923F1F5E 5DD80B2075；

meanwhile, utilizing an SHA-256 calculation tool to calculate the decoding information "Biosysen _ A003S001O001R001I001_ Vector" to obtain a second hash value, namely:

55CA1E14F0FC5D3C3FDE11E223B8ACA5A7944297866DCF923F1F5E 5DD80B2075；

matching the first hash value with the second hash value, namely comparing, and obtaining a conclusion through comparison: if the first hash value is identical to the second hash value, that is, the first hash value is equal to the second hash value, it can be determined that:

1. the obtained decoding information is directed to the obtained biological product A with the target artificial nucleic acid sequence;

2. this decoded information is produced by company B corresponding to the biological product A;

3. the received decoding information is the same as the original record of product a that company B has exported, and has not been tampered with.

That is, product a obtained by hospital X is produced by company B, and is a specified product corresponding to the ordering intent required by hospital X and sent out to hospital X by company B.

In the embodiment, the information in the encoding method and the decoding method is encrypted and decrypted by using the RSA asymmetric encryption algorithm, so that the encrypted information is written in the sequence of the biochemical product by using the encryption and decryption of the algorithm from the top to the production source and from the bottom to the product transferee in the product circulation process, then the unique conclusion of the product is obtained after the product is obtained by sequencing, encoding and decrypting by using a public key and finally comparing the hash value, on one hand, the safety, the uniqueness and the correspondence of the product are improved, on the other hand, the intellectual property rights of the design, research and development and production side are fully ensured, and the product achieves the high confidentiality of the biological information level.

The embodiment greatly improves the threshold of encryption and decryption, fully combines the RSA asymmetric encryption method of the block chain with the watermark coding system of the artificial nucleic acid sequence and the coding and decoding of the character string for the first time, further avoids the possibility that the product is changed and forged and the information is falsified in the logistics process and the transfer and acceptance process based on the embodiment from multiple links of research and development, production, clinical medical treatment, industrial application and the like, and is more favorable for ensuring the safety problem that the biological product is applied to the specific food and medical health fields in different links.

While the preferred embodiment and the corresponding examples of the present invention have been described, it should be understood that various changes and modifications, including but not limited to, adjustments of proportions, flows and amounts, which are within the scope of the invention, may be made by those skilled in the art without departing from the inventive concept thereof.

Claims

1. An artificial nucleic acid sequence watermark encoding system, comprising:

the variable coded segment includes at least one information unit;

2. The artificial nucleic acid sequence watermark encoding system of claim 1,

the length of the watermark character string is not less than 3;

3. The artificial nucleic acid sequence watermark encoding system of claim 1,

the nucleic acid features in each of the information elements are used to indicate DNA assembly methods and interface information, screening marker information, replication origin information, reporter gene information, insert backbone species information, insert promoter information, insert RBS information, and insert CDS information or insert terminator information.

4. The artificial nucleic acid sequence watermark encoding system of claim 1,

further comprising a usage scenario segment for representing an end of the artificial nucleic acid sequence watermark encoding system and a usage scenario;

the usage scene segment is connected with the variable coding segment;

5. The artificial nucleic acid sequence watermark encoding system of claim 4,

the usage scenario segment includes a bacterial artificial chromosome identifier, a yeast artificial chromosome identifier, a nuclear genome identifier, a mitochondrial genome identifier, and/or a chloroplast genome identifier RNA fragment identifier for identifying the usage scenario.

6. A watermark string in an artificial nucleic acid sequence watermark encoding system according to any one of claims 1 to 5,

the length of the watermark character string in the artificial nucleic acid sequence watermark coding system is not less than 15 bp/nt; and the watermark character string can be used as a general PCR primer sequence to amplify and sequence the artificial nucleic acid fragment.

7. The watermark string in an artificial nucleic acid sequence watermark encoding system of claim 6,

the watermark character string, the nucleic acid sequence, the reverse complementary sequence of the nucleic acid sequence, the sequence of the RNA sequence and/or the reverse complementary sequence of the RNA sequence and/or the subsequence thereof is a sequence with the similarity of the nucleic acid sequence, the reverse complementary sequence of the RNA sequence and the reverse complementary sequence of the RNA sequence not less than 95%; wherein,

the nucleic acid sequence is:

GAACGCCGGCTTGTATGTCGGTATGCGGGCTCGGTT；

the reverse complement of the nucleic acid sequence is:

AACCGAGCCCGCATACCGACATACAAGCCGGCGTTC；

the RNA sequence is:

GAACGCCGGCUUGUAUGUCGGUAUGCGGGCUCGGUU；

the reverse complement of the RNA sequence is:

AACCGAGCCCGCAUACCGACAUACAAGCCGGCGUUC。

8. an encoding and decoding method for an artificial nucleic acid sequence watermark encoding system, comprising:

the coding method comprises the following steps:

the decoding method comprises the following steps:

9. The encoding and decoding method of an artificial nucleic acid sequence watermark encoding system of claim 8,

10. The encoding and decoding method of an artificial nucleic acid sequence watermark encoding system of claim 8,