CN115409174B - Base sequence filtering method and device based on DRAM memory calculation - Google Patents
Base sequence filtering method and device based on DRAM memory calculation Download PDFInfo
- Publication number
- CN115409174B CN115409174B CN202211354686.5A CN202211354686A CN115409174B CN 115409174 B CN115409174 B CN 115409174B CN 202211354686 A CN202211354686 A CN 202211354686A CN 115409174 B CN115409174 B CN 115409174B
- Authority
- CN
- China
- Prior art keywords
- data
- row
- base sequence
- calculation
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/123—DNA computing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Biology (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Genetics & Genomics (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biotechnology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a base sequence filtering method and a device based on DRAM memory calculation, wherein the method comprises the following steps: step one, according to the row width of a storage array of a DRAM and the starting address of a target base sequence to be screened, the target base sequence is screened out and then rearranged and combined; marking the rearranged and combined target base sequence with bases of A adenine, G guanine, C cytosine and T thymine respectively to obtain a marking line of the corresponding base; thirdly, counting the number of the position values of 1 in the marking line after shifting the marking line data to obtain the counting result of the corresponding base; and step four, comparing the statistical result of the reference base sequence with the statistical result of the target base sequence, and filtering the screened target base sequence. The invention carries out position matching screening in the memory subarray, reduces the transfer of a large amount of data between the CPU and the memory, improves the calculation efficiency by times and reduces the power consumption.
Description
Technical Field
The invention relates to the field of computer memory calculation, in particular to a base sequence filtering method and device based on DRAM memory calculation.
Background
Genes are functional fragments carrying genetic information on DNA (deoxyribonucleic acid) molecules, and support the basic structure and performance of life. The prior art already has a set of mature processing procedures for DNA samples, which generally consist of three steps of DNA sequencing, DNA sequence sequencing, gene mapping and mutation detection. The DNA sequencing is to extract DNA of a biological sample by using a DNA sequencer and convert the DNA into a data sequence Read which can be identified by a computer, a base sequence formed by linking A, C, T and G four bases in the DNA sequence is generally identified by a chemical method and then converted into a character string sequence which can be identified by the computer and consists of A, C, T, G (A-adenine G-guanine C-cytosine T-thymine) four characters, and one data sequence Read is a DNA fragment with fixed length and is a basic unit for subsequent DNA sequence processing. For example, referring to FIG. 1, if a data sequence Read is 10BP (BasePair ), the Read data s sequence TCCTAATCTG is a Read. The result of DNA sequencing is the generation of a large stack of DNA reads, but the order between these reads is unknown. The sequencing of the DNA sequence is to compare these unordered DNA reads with the putative DNA reference sequence to obtain the best matching position of each Read in the reference sequence.
Because the data volume of deoxyribonucleic acid is very huge, sequence fragments are usually screened and filtered before sequencing, the screening and filtering are suitable for being realized by adopting a parallelization calculation mode, the memory calculation provides a good calculation platform for calculation, repeated movement of a large amount of data can be reduced, and the system performance can be effectively improved.
In modern computer systems, the movement of data between computing units and memory is a significant proportion of the system power consumption and program runtime. With the advent of multi-core processors, where more and more cores are integrated into the same chip, the total memory bandwidth does not increase proportionally, creating a mismatch between computing power and data transfer, thus leading to the so-called "memory wall" problem. Meanwhile, although the computing resources are increased, the communication delay between the computing resources and a dynamic random access memory (hereinafter referred to as "DRAM") is not improved, so that the data movement becomes one of the system bottlenecks.
In order to solve these challenges, many new computing methods have been proposed in succession, including near memory computing, in-memory processor, in-memory computing, and so on. Memory computing is one of the key technologies to solve the problem of memory walls. The memory computing is operated in the memory as the name suggests, and can obviously reduce the serious computing time delay and power consumption brought by data exchange. Various memory computing technologies are currently emerging based on different storage media materials including RRAM, PCM, STT-MRAM, DRAM, etc. The current common DRAM memory microarchitecture is shown in figure 2.
However, since there are many possibilities of matching positions due to the large amount of base sequence data, there is still a problem that the calculation amount is huge using the conventional method.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides a base sequence filtering method and a base sequence filtering device based on DRAM memory calculation, position matching screening is carried out in a memory subarray, namely, the DRAM memory calculation is carried out, based on the principle that the DRAM capacitor charging and discharging can complete basic logic operation, the number of AGCT bases in a certain section of gene sequence, namely a reference sequence, is counted and compared with the number of bases of a target sequence, if a certain threshold value is exceeded, the section of sequence is considered to be not matched with the target sequence, so that the purpose of screening and positioning is achieved, and the sequencing calculation of the base sequence is realized, and the specific technical scheme is as follows:
a base sequence filtering method based on DRAM memory calculation comprises the following steps:
step one, according to the row width of a storage array of a DRAM and the starting address of a target base sequence to be screened, the target base sequence is screened out and then rearranged and combined;
marking the rearranged and combined target base sequence with bases of A adenine, G guanine, C cytosine and T thymine respectively to obtain a marking line of the corresponding base;
thirdly, carrying out displacement operation on the marking line data, and then counting the number of the marking lines with the position value of 1 to obtain the counting results of A adenine, G guanine, C cytosine and T thymine;
step four, comparing the statistical result of the reference base sequence with the statistical result of the target base sequence, and filtering the screened target base sequence.
Further, the first step specifically comprises:
recording the number of invalid information data and the position of the invalid information data according to the length of the target base sequence and the column width of the storage array;
setting initial segment mask data and tail segment mask data according to the screening starting address and writing the initial segment mask data and the tail segment mask data into a storage array;
selecting a target row data sequence and initial segment mask data of a line before a target base sequence for bitwise and calculating to obtain effective initial segment target row data; selecting a target line data sequence and tail mask data of a next line of the target base sequence to perform bit-wise calculation to obtain effective tail section target line data;
carrying out bitwise or calculation on the effective initial section target row data and the tail section target row data, merging effective parts of the two rows of data into one row to obtain complete effective target base sequence data, wherein the initial parts and the tail parts of the effective target base sequence data are in the same row and have no coincident positions;
performing a column conversion operation on the effective target base sequence data in the line memory format to obtain first high bit data and first low bit data arranged in columns in the memory array;
generating an ordinal GCT mask and an A mask according to the number and the position of invalid information data, performing AND operation on the GCT mask and the first high-bit data and the first low-bit data respectively, setting 0 for irrelevant data, generating and storing the ordinal GCT data as second high-bit data and second low-bit data, performing bitwise negation and storing the second high-bit data and the second low-bit data as third high-bit data and third low-bit data; the a mask is or-operated with the first high bit data and the first low bit data, respectively, and sets all the irrelevant data to 1, generates column a data, and stores the column a data as fourth high bit data and fourth low bit data.
Further, the start segment mask data and the end segment mask data are composed of 0 and M, M is composed of two-bit binary 1, when 0 and the target row sequence data are AND, the irrelevant data can be set to 0, and when 1 and the target row sequence data are AND, the valid data can be reserved.
Further, the specific method for labeling the base A adenine in the target base sequence in the second step comprises the following steps:
copying the fourth high bit data and the fourth low bit data stored by the column A data to a first row and a second row of a calculation area of the memory array, respectively;
performing OR operation on the data between the first line and the second line according to bits to obtain a result R1;
performing negation operation on the result R1;
and copying the result after the inversion operation to a mark row of the A adenine in the memory array, wherein the position value of the A adenine is 1, and the rest position values are 0.
Further, the specific method for labeling the base C cytosine in the target base sequence in the second step comprises the following steps:
copying the second high bit data and the second low bit data of the target base sequence to a first row and a second row of a calculation area of the memory array respectively;
carrying out bitwise AND operation on the data of the first row and the data of the second row in the calculation area to obtain a result R2;
the result R2 is copied to the marked row of C-cytosines in the memory array, where the position value containing the C-cytosine is 1 and the remaining position values are 0.
Further, the specific method for labeling the base G guanine in the target base sequence in the second step comprises the following steps:
copying the third low bit data to the first row in the compute region of the memory array;
copying the second high bit data to a second row in the calculation area;
carrying out bitwise AND operation on the data of the first line and the second line in the calculation area to obtain a result R3;
the result R3 is copied to a mark row of G guanine in the memory array, wherein the position value containing G guanine is 1, and the rest position values are 0.
Further, the specific method for labeling the base T thymine in the target base sequence in the second step comprises the following steps:
copying the third high-order bit data to a first row of a calculation area of the storage array;
copying the second low bit data to a second row of the calculation region;
carrying out bitwise AND operation on the data of the first row and the second row to obtain a result R4;
the result R4 is copied to a tag row of T thymine in the memory array, where the position value containing T thymine is 1 and the remaining position values are 0.
Further, the specific method for statistics in the third step includes the following three steps:
copying the a-row and the a _ s-row data to a first row and a second row of a calculation area of the storage array, carrying out summation calculation on the data in the same column, namely carrying out exclusive OR operation on the first row and the second row to obtain a sum s of the first row and the second row, carrying out AND operation on the first row and the second row to obtain a carry term c of the sum of the first row and the second row, and writing the sum result back to a temporary storage area of the storage array; dividing the value n of the current row counter by 2, judging whether the result is 1, and if the result is 1, finishing the calculation; if the result is not 1, performing a new round of shifting and summing operation on the basis of the summing result of the temporary storage area, adding one row of the calculation result every time the summing operation is completed, namely executing the operation of the step 1, shifting the calculation result, and performing column-type accumulation on the shifting result;
and 3, when the value n of the row counter is finally judged to be 1, a final result can be obtained in the first row of the calculation result and is stored.
Further, the fourth step is specifically: putting the complement values of the statistical results of the obtained A adenine, G guanine, C cytosine and T thymine of the target base sequence in the same row in a column form, putting the complement values of the statistical results of the reference base sequence in the corresponding column, calculating difference values, and finally, summing the four difference values and comparing the sum values with a threshold value; if the sum of the differences is greater than the threshold, excluding the target base sequence; if the value is less than the threshold value, marking the sequence of the screening position as the target base sequence.
A base sequence filtering device based on DRAM memory calculation comprises:
the memory array is composed of DRAM subarrays and used for storing target base sequences, and binary expression is used for setting base information, and the memory array specifically comprises the following components: the binary expression corresponding to A-adenine is 00, the binary expression corresponding to G-guanine is 10, the binary expression corresponding to C-cytosine is 11, the binary expression corresponding to T-thymine is 01, and each base information is composed of 2bit data;
the DRAM subarray is N in width, namely each row is provided with N rows of storage units, two rows of storage units are needed for storing base sequence information with the length of one row being N, the storage units are respectively marked as an H row for storing high bits and an L row for storing low bits, namely the high bits and the low bits of the same base information are stored in the same row;
the storage array is provided with a calculation area for data calculation, an original data storage area for storing original base sequence data, a column type data area for storing data converted into a column format, and a temporary storage area for temporarily storing intermediate results generated in the calculation process;
the control module receives an external address, data and a command, performs decoding control, sends the decoding control to the word line controller, the bit line controller, the shift and inversion module and the buffer, converts the base sequence data format from a same-row mode to a same-column mode, writes the base sequence data format into a DRAM sub-array, and controls the calculation process; wherein the word line controller 402 controls row signals of the memory array, and the bit line controller controls column signals of the memory array; the buffer is used for buffering data; and the shifting and negating module comprises a shifting module and a negating module and can perform shifting operation and negating operation on a line of data according to the calculation requirement.
And the counting module is internally provided with a group of counters which comprise a shift counter and a row counter, respectively counts different types of base sequences when the reference sequence is written, records the final result, and writes the AGCT statistical value of the reference sequence into the fixed address of the target array.
Has the advantages that:
the invention carries out position matching screening in the memory subarray, reduces the transfer of a large amount of data between the CPU and the memory, improves the calculation efficiency by times and reduces the power consumption.
Drawings
FIG. 1 is a schematic diagram of the base sequence read in a DNA sequence fragment;
FIG. 2 is a schematic diagram of a generic DRAM memory microarchitecture;
FIG. 3 is a schematic block diagram of a base sequence filtering apparatus based on DRAM memory calculation according to the present invention;
FIG. 4 is a schematic flow chart of a base sequence filtering method based on DRAM memory calculation according to the present invention;
FIG. 5 is a schematic flow chart showing a detailed first step of the method of the present invention;
FIG. 6 is a schematic diagram showing a manner of storing data when rearranging and combining target nucleotide sequences in a memory array of the apparatus of the present invention;
FIGS. 7 to 10 are schematic views showing the manner of storing data when labeling the base A adenine, the base C cytosine, the base G guanine, and the base T thymine in the target base sequence according to the present invention;
FIG. 11 is a schematic diagram of a piece of base sequence information stored in a memory array according to an embodiment of the present invention;
FIG. 12 is a schematic diagram showing a manner of storing data when a target base sequence is rearranged and combined in a memory array according to an embodiment of the present invention;
FIG. 13 is a diagram illustrating a portion of data of a binary representation after shifting according to an embodiment of the present invention;
FIG. 14 is a schematic diagram showing a data storage method in the memory array according to the embodiment of the present invention when AND operation is performed on a base GCT mask;
FIG. 15 is a schematic diagram illustrating a storage manner of data when an OR operation is performed on a base A mask in a memory array according to an embodiment of the present invention;
FIGS. 16 to 19 are schematic views showing the manner of storing data when labeling the base A adenine, the base C cytosine, the base G guanine, and the base T thymine in the target base sequence according to the example of the present invention;
FIGS. 20 to 26 are data diagrams illustrating shift and column-wise summation of the label rows for each base type with a value of 1 as a statistic, according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and technical effects of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments of the specification.
As shown in FIG. 3, the present invention provides a base sequence filtering apparatus based on DRAM memory calculation, comprising:
the memory array 404, which is composed of DRAM subarrays, is used to store a target base sequence, and sets binary expression for base information, specifically: the binary expression corresponding to A-adenine is 00, the binary expression corresponding to G-guanine is 10, the binary expression corresponding to C-cytosine is 11, the binary expression corresponding to T-thymine is 01, and each base information is composed of 2bit data;
the DRAM subarray is N in width, namely each row is provided with N rows of storage units, two rows of storage units are needed for storing base sequence information with the length of one row being N, the storage units are respectively marked as an H row for storing high bits and an L row for storing low bits, namely the high bits and the low bits of the same base information are stored in the same row;
the memory array 404 is provided with a calculation area 501 for data calculation, an original data storage area for storing original base sequence data, a column-type data area for storing data converted into a column format, and a temporary storage area 4010 for temporarily storing intermediate results generated during calculation.
The control module 403 receives external addresses, data and commands, performs decoding control, sends the decoding control to the word line controller 402, the bit line controller 401, the shift and inversion module 406 and the buffer 405, converts the base sequence data format from a parallel mode to a parallel mode, writes the base sequence data format into a DRAM subarray, and controls the calculation process; the word line controller 402 controls row signals of the memory array 404, and the bit line controller 401 controls column signals of the memory array 404; the buffer 405 is used for buffering data; the shift and inversion module 406, including a shift module and an inversion module, can perform shift operation and inversion operation on a row of data according to the calculation requirement.
The counting module 407 is provided with a set of counters therein, and when the reference sequence is written, counts the base sequences of different types respectively, records the final result, and writes the AGCT statistic of the reference sequence into the fixed address of target _ array of the target array, which is denoted as target _ a, target _ G, target _ C, and target _ T.
Based on the base sequencing filtering device, the base sequence filtering method based on DRAM memory calculation adopted by the invention is shown in FIG. 4, and specifically comprises the following steps:
in the first step, in the preparation stage of statistical data, a starting point address is screened in the storage array 404 according to the system setting, a target base sequence with a length smaller than N is screened, and the target base sequence is rearranged and combined.
As shown in FIGS. 5 and 6, the control module 403 records the number of invalid information data and the position 4006 of the invalid information data based on the length of the target base sequence and the column width of the memory array 404. Setting a start segment mask data 4003 and an end segment mask data 4004 according to the address of the screening start point, writing the mask data into the memory array 404, the mask data being composed of 0 and M, M being composed of a two-bit binary 1, setting the extraneous data to 0 when 0 and the sequence data are anded, and retaining the valid data when 1 and the sequence data are anded; selecting a target line data sequence 4001 where a previous line of the target base sequence is located and starting segment mask data 4003 to perform bitwise calculation to obtain effective starting segment target line data 4001_1; selecting a next row of target line data sequence 4002 and end mask data 4004 for bitwise and calculation to obtain effective end segment target line data 4002_1; the effective target line data 4001 _1and 4002 _1of the effective initial segment and the effective tail segment are subjected to bitwise or calculation, so that effective parts of two lines of data can be combined into one line, complete effective target base sequence data is obtained, and the beginning and the tail parts of one sequence are ensured to be in the same line and have no overlapping position.
After screening, row-column conversion operation is performed on the data in the row memory format, first high-bit data 4005 \ and first low-bit data 4005 \ of two rows of memory cells respectively stored in the memory array are obtained, and the data 4005 \ and 4005 \ are written back to the memory array 404 for further calculation.
The control module 403 generates column type GCT masks (10 alternate bit strings) and a masks (combinations of 01 alternate bit strings and all 1 bit strings) according to the number of invalid information data and the positions of the invalid information data, the GCT masks respectively perform and operation with data 4005 \/and 4005 \/h, unrelated data are all set to 0, column type GCT data are generated and respectively stored in 4007 \/and 4007 \/h lines, meanwhile, bit-wise negation is performed through a negation module, and negated data are stored in 4009 \/and 4009 \/h lines; the A mask is OR-ed with the data 4005 \/and the irrelevant data is set to 1, and the generated line A data is stored in the lines 4008 \/and 4008 \/respectively.
In step two, the control module 403 marks the rearranged and combined target base sequences stored in the DRAM subarray with bases of a adenine, G guanine, C cytosine, and T thymine, respectively, and obtains a mark row of the corresponding base.
As shown in FIG. 7, the specific method for labeling the base A adenine in the target base sequence comprises the following steps:
copying 4008 \hline and 4008 \lline data of the target base sequence to a first line and a second line of the calculation region 501, respectively;
the data between the first and second rows is OR-ed bitwise to obtain a result R1,
sending the result R1 to an inversion module for inversion operation;
the result of the above inversion operation is copied to a mark line 502 of a adenine, in which the position value containing a adenine is 1 and the remaining position values are 0.
As shown in FIG. 8, the specific method for labeling the base C cytosine in the target base sequence comprises the following steps:
copying 4007 \hline and 4007 \lline data of the target base sequence to a first line and a second line of the calculation region 501, respectively;
performing bitwise and operation on the data in the first row and the data in the second row of the calculation area 501 to obtain a result R2;
the result R2 is copied to the C cytosine labeled row 503, where the C cytosine is contained at a position value of 1 and the remaining positions are 0.
As shown in FIG. 9, the specific steps of labeling the base G guanine in the target base sequence are as follows:
taking a bit-by-bit negation value 4009_l of 4007_l row data of the target base sequence and copying the data to a first row in the calculation area 501;
copying 4007 xu h line data to a second line in the computing area 501;
performing bitwise and operation on the data of the first row and the data of the second row in the calculation area 501 to obtain a result R3;
the result R3 is copied to the G guanine label line 504, where the G guanine-containing position has a value of 1 and the remaining positions have a value of 0.
As shown in FIG. 10, the specific method for labeling the base T thymine in the target base sequence comprises the following steps:
copying the negation value 4009_h of the data of the line 4007_h of the target base sequence to the first line of the calculation region 501;
copying the data of the 4007_l line of the target base sequence to the second line of the calculation region 501;
carrying out bitwise AND operation on the data of the first line and the second line to obtain a result R4;
the result R4 is copied to the tag line 505 of T thymine, which contains T thymine with position values of 1 and the remaining position values of 0.
And thirdly, carrying out displacement operation on the marking line data, and then counting the number of the marking lines with the position value of 1 to obtain the counting results of A adenine, G guanine, C cytosine and T thymine.
Specifically, let the column width of the DRAM subarray be N, i.e., each row has N columns of memory cells, N is an integer power of 2, each marker row occupies a row of memory space, and the number of the marker rows having a position value of 1 is counted.
A shift counter and a row counter are arranged in the counting module 407, the shift counter is used for counting the current shift times, the initial value is 0, and the counting is cleared after the counting is finished; the initial value of the column counter is N, and the initial value is restored after the calculation is completed.
The specific statistical method comprises the following three steps:
step 2, copying the a row and the a _ s row data to a first row and a second row of the calculation area 501, performing summation calculation on the data in the same column, that is, performing exclusive or operation on the first row and the second row to obtain a sum s of the first row and the second row, performing and operation on the first row and the second row to obtain a carry term c of the sum of the first row and the second row, and writing the sum back to a temporary storage area 4010 of the storage array 404; the bit line controller 401 divides the value n of the current row counter by 2, and determines whether the result is 1, and if the result is 1, the calculation is ended; if the result is not 1, a new round of shift summation operation (one row is added to the calculation result each time the summation operation is completed) is performed based on the summation result of the temporary storage 4010, that is, the operation of step 1 is executed, the calculation results are respectively copied to the shift modules for shifting, and the shift results are accumulated in a column;
and 3, when the value n of the row counter is finally judged to be 1, obtaining a final result in the first row of the calculation result, and storing the row result to a specified position.
Step four, comparing the statistical result of the reference base sequence with the statistical result of the target base sequence, and filtering the screened target base sequence.
Specifically, the complement values of the statistical results of the obtained A adenine, G guanine, C cytosine and T thymine of the target base sequence are placed in the same row in a column form, then the complement values of the statistical results of the reference base sequence are placed in the corresponding columns, difference calculation is carried out, and finally the four differences are summed and compared with a threshold value; if the sum of the differences is greater than the threshold, excluding the target base sequence; if the value is less than the threshold value, the sequence of the screening position is marked as the target base sequence, and the subsequent action can be carried out.
The embodiment is as follows:
assuming that a piece of base sequence information is stored in the current storage area, the length of the selected sequence is 10, and the sequence is AGTTTCTCCG, as shown in fig. 11.
According to the address of the starting point beginning at the arrow in fig. 11 as the screening, setting the starting segment mask as 0000000000000000111111111111 in binary, setting the end segment mask as 111111110000000000000000 in binary, and writing the two mask data into the storage array; selecting a sequence TCTTTGAAAGTTTC where a previous row is located and a start segment mask to perform bitwise AND calculation to obtain effective start segment data 00000000AGTTTC (wherein each 0 represents 2-bit 0); selecting a next row sequence TCCGAGGATGTGGT and an end mask for bitwise and calculation to obtain effective end segment data TCCG0000000000 (wherein each 0 represents 2-bit 0); by bit or calculation of the valid target line data 00000000AGTTTC and TCCG0000000000, valid parts of two lines of data can be merged into one line to obtain a complete target sequence TCCG0000AGTTTC, as shown in fig. 12.
And after screening, carrying out column conversion storage operation on the data in the row storage format to obtain data 4005 uth and 4005_lwhich are arranged in columns, and writing the result back to the storage array for next calculation. The mask data 4003 is composed of 0 and M, where M represents 2 bits 11 and M is composed of a multi-bit binary 1, and when 0 and the sequence data are not the same, the irrelevant data can be set to 0; when 1 and sequence data are associated, valid data may be retained.
Take partial data TCCG0000 in the above example as an example, where 0 is invalid data; its binary expression is, the lower 8 bits of 0 are effectively invalid data: 0111111000000000;
after the shift module shifts left by one, as shown in fig. 13, the effective data arranged in a column is in the dotted line;
the control module generates column GCT mask and A mask according to the number of invalid information and the position of the invalid information, wherein the GCT mask is a bit string with 10 alternating bits: 101010101010;
the GCT mask performs and operation with 4005 \/and, respectively, sets both irrelevant data to 0, generates column-type GCT data, and stores the column-type GCT data in 4007 _/and 4007_h, respectively, as shown in fig. 14;
the a mask is a combination of 01 alternating bit strings and all 1 bit strings: 0101010111111111 OR's with 4005 _land 4005_h, respectively, sets both irrelevant data to 1, and stores the generated column A data in 4008 _land 4008_h, respectively, as shown in FIG. 15.
Then, the A adenine, the G guanine, the C cytosine and the T thymine are respectively subjected to base labeling.
As shown in fig. 16, the labeling of adenine a specifically includes:
copy one row H (0111111111111111) and one row L (1111110111111111) of the target sequence to the first and second rows of the computation region 501, respectively;
performing a bitwise OR operation between the first line and the second line to obtain a result R1 (1111111111111111);
negating R1 to obtain (0000000000000000);
copying the result to a mark row of A adenine, wherein the position value of the A adenine is 1, the rest position values are 0, and marks of adenine in the current result are all 0.
As shown in fig. 17, labeling cytosine C specifically is:
copying one line 4007 \h (0010101000000000) and one line 4007 \l (1010100000000000) of the target sequence to the first and second lines of the calculation region 501, respectively;
performing bitwise AND operation on the first row and the second row to obtain a result R2;
the result is copied to C cytosine mark line 503 (0010100000000000) where the position value containing C cytosine is 1 and the remaining position values are 0.
As shown in fig. 18, the labeling of guanine G specifically includes:
copy the bitwise negation value 4009_l (0101011111111111) for 4007 _lof the target sequence to one of the rows of the calculation region 501;
the calculation area carries out bitwise AND operation on the two rows to obtain a result R3 (0000001000000000);
the result R3 is copied to the G guanine label line 504, where the G guanine-containing position is 1 and the remaining positions have values of 0.
As shown in fig. 19, labeling thymine T specifically is:
copying the negated value 4009_h of one line 4007_h of the target sequence to one line of the calculation region;
copying one line 4007 \ of the target sequence to another line of the calculation area, namely a first line and a second line;
performing bitwise AND operation on the first row and the second row to obtain a result R4;
the result R4 is copied to the tag line 505 of T thymine, where the position containing T thymine is 1 and the remaining positions have a value of 0.
Then, for the label row of each base type, the statistics of the value 1 is carried out, and the specific contents are as follows: assuming that the column width of the current subarray is 16 (typically N is an integer power of 2), i.e. there are 16 columns of memory cells per row, each marked row occupies one row of memory space. And taking a certain mark line to perform the number statistics of 1, for example, taking the mark line of C currently (0010100000000000). And a shift counter is arranged in the counting module and used for counting the current shift times, the initial value is 0, and the counting module is cleared after the counting is finished.
The specific statistical method comprises the following steps:
the controller judges whether the current n is 1, if the current n =16, the marking line is read out, the read result is sent to a shift module to carry out left shift operation, the bit number of the shift is 0 power of 2, after the shift operation is completed, the value i of a shift counter is added with 1 to be 1, the result is written back to a subarray module where the marking line is located, the original marking line is set as a line, and the result after the shift is set as a line _ s.
The a row and the a _ s row are copied to the first row and the second row of the calculation area, and the data in the same column are summed to obtain a1 and a0, as shown in fig. 20.
The controller divides the current n by 2 to obtain a new n which is 8; and if the judgment result is not 1, respectively copying the calculation results to a shift row for shifting, wherein the bit number of the shift is 1 power of 2, writing the result back to the subarray after the shift, and continuing to participate in the operation.
Shifting a1 and a0, and adding 1 to the value i of the shift counter to be 2 after the shifting operation is completed, as shown in fig. 21;
column-wise summation after shifting, as shown in fig. 22;
the controller divides the current n by 2 to obtain a new n which is 4; and the judgment result is not 1, shift a2 a1 a0, and after the shift is completed, add 1 to the value i of the shift counter to be 3, as shown in fig. 23;
column-wise summation after shifting, as shown in fig. 24;
the controller divides the current n by 2 to obtain a new n which is 2; and the judgment result is not 1, shift a3 a2 a1 a0, and after the shift is completed, add 1 to the value i of the shift counter to be 4, as shown in fig. 25;
column-wise summation after shifting, as shown in fig. 26;
when the value of the column counter is finally judged to be 1, a final result can be obtained in the first column of the calculation result, then the complementary code values of the statistical results of the obtained A adenine, G guanine, C cytosine and T thymine of the target base sequence are placed in the same row in a column form, then the complementary code values of the statistical results of the reference base sequence are placed in the corresponding column, the difference calculation is carried out, and finally the four differences are summed and compared with the threshold; if the sum of the differences is greater than the threshold, excluding the target base sequence; if the value is less than the threshold value, marking the sequence of the screening position as the target base sequence.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described in detail the practice of the invention, it will be appreciated by those skilled in the art that variations may be applied to the embodiments described in the foregoing examples, or equivalents may be substituted for elements thereof. All changes, equivalents and modifications which come within the spirit and scope of the invention are desired to be protected.
Claims (10)
1. A base sequence filtering method based on DRAM memory calculation is characterized by comprising the following steps:
step one, according to the row width of a storage array of a DRAM and the starting address of a target base sequence to be screened, the target base sequence is screened out and then rearranged and combined;
marking the rearranged and combined target base sequence with bases of A adenine, G guanine, C cytosine and T thymine respectively to obtain a marking line of the corresponding base;
thirdly, after shifting operation is carried out on the marking line data, counting the number of position values in the marking line as 1 to obtain the counting results of A adenine, G guanine, C cytosine and T thymine;
step four, comparing the statistical result of the reference base sequence with the statistical result of the target base sequence, and filtering the screened target base sequence.
2. The method for filtering base sequences based on DRAM memory calculation according to claim 1, wherein the first step is specifically:
recording the number of invalid information data and the position of the invalid information data according to the length of the target base sequence and the column width of the storage array;
setting initial segment mask data and tail segment mask data according to the screening starting address and writing the initial segment mask data and the tail segment mask data into a storage array;
selecting a target line data sequence and initial segment mask data where a previous line of the target base sequence is located to perform bit-wise calculation to obtain effective initial segment target line data; selecting a target line data sequence and tail mask data in the next row of the target base sequence for bitwise and calculation to obtain effective tail target line data;
carrying out bitwise calculation on the effective initial segment target line data and the effective tail segment target line data, merging effective parts of the two lines of data into one line to obtain complete effective target base sequence data, wherein the head part and the tail part of the effective target base sequence data are in the same line and have no coincident position;
performing a column conversion operation on the effective target base sequence data in the line memory format to obtain first high bit data and first low bit data arranged in columns in the memory array;
generating an ordinal GCT mask and an A mask according to the number and the position of invalid information data, performing AND operation on the GCT mask and the first high-bit data and the first low-bit data respectively, setting 0 for irrelevant data, generating and storing the ordinal GCT data as second high-bit data and second low-bit data, performing bitwise negation and storing the second high-bit data and the second low-bit data as third high-bit data and third low-bit data; the a mask performs or operation with the first high bit data and the first low bit data, respectively, sets all the irrelevant data to 1, generates column a data, and stores the column a data as fourth high bit data and fourth low bit data.
3. The method as claimed in claim 2, wherein the start segment mask data and the end segment mask data are comprised of 0 and M, M is comprised of two binary 1, when the 0 and the target row sequence data are not associated, the irrelevant data is set to 0, and when the 1 and the target row sequence data are associated, the valid data is retained.
4. The method for filtering base sequences based on DRAM memory calculation as claimed in claim 2, wherein the specific method for labeling the base A adenine in the target base sequence in the second step is as follows:
copying the fourth high bit data and the fourth low bit data stored by the column A data to a first row and a second row of a calculation area of the memory array, respectively;
performing OR operation on the data between the first line and the second line according to bits to obtain a result R1;
performing negation operation on the result R1;
and copying the result after the inversion operation to a mark row of the A adenine in the memory array, wherein the position value of the A adenine is 1, and the rest position values are 0.
5. The method for filtering a base sequence based on DRAM memory calculation as claimed in claim 2, wherein the specific method steps for labeling the base C cytosine in the target base sequence in the second step are as follows:
copying the second high bit data and the second low bit data of the target base sequence to a first row and a second row of a calculation area of the memory array respectively;
carrying out bitwise AND operation on the data of the first row and the data of the second row in the calculation area to obtain a result R2;
the result R2 is copied to a tag row of C-cytosine in the memory array, where the position value of the C-cytosine is 1 and the remaining position values are 0.
6. The method for filtering a base sequence based on DRAM memory calculation as claimed in claim 2, wherein the specific method steps for labeling the base G guanine in the target base sequence in the second step are as follows:
copying the third low bit data to the first row in the compute region of the memory array;
copying the second high-bit data to a second row in the calculation area;
carrying out bitwise AND operation on the data of the first line and the second line in the calculation area to obtain a result R3;
the result R3 is copied to a mark row of G guanine in the memory array, wherein the position value containing G guanine is 1, and the rest position values are 0.
7. The method for filtering base sequences based on DRAM memory calculation as claimed in claim 2, wherein the specific method for labeling the base T thymine in the target base sequence in the second step is as follows:
copying the third high-order bit data to a first row of a calculation area of the storage array;
copying the second low bit data to a second row of the calculation region;
carrying out bitwise AND operation on the data of the first row and the second row to obtain a result R4;
the result R4 is copied to a tag row of T thymine in the memory array, where the position value containing T thymine is 1 and the remaining position values are 0.
8. The method for filtering base sequences based on DRAM memory calculation as claimed in claim 1, wherein the specific method of statistics in the third step comprises the following three steps:
step 1, adopting a column counter and a shift counter, firstly judging whether the value n of the current column counter is 1, if not, reading the marked line, and performing left shift operation on the read result, wherein the number of shifted bits is the power i of 2, and i is the value of the shift counter, after the shift operation is completed, adding 1 to the value i of the shift counter, and writing the result back to the DRAM sub-array where the marked line is located; setting the original marking line as a line a, setting the shifted result as a line a _ s, if n is 1, ending the calculation, and entering the step 3;
copying the a-row and the a _ s-row data to a first row and a second row of a calculation area of the storage array, carrying out summation calculation on the data in the same column, namely carrying out exclusive OR operation on the first row and the second row to obtain a sum s of the first row and the second row, carrying out AND operation on the first row and the second row to obtain a carry term c of the sum of the first row and the second row, and writing the sum result back to a temporary storage area of the storage array; dividing the value n of the current column counter by 2, judging whether the result is 1, and if the result is 1, finishing the calculation; if the result is not 1, performing a new round of shifting and summing operation on the basis of the summing result of the temporary storage area, adding one row of the calculation result every time the summing operation is completed, namely executing the operation of the step 1, shifting the calculation result, and performing column-type accumulation on the shifting result;
and 3, when the value n of the row counter is finally judged to be 1, a final result can be obtained in the first row of the calculation result and is stored.
9. The method for filtering base sequences based on DRAM memory calculation according to claim 1, wherein the fourth step is specifically: putting the complement values of the statistical results of the obtained A adenine, G guanine, C cytosine and T thymine of the target base sequence in the same row in a column form, putting the complement values of the statistical results of the reference base sequence in the corresponding column, calculating difference values, and finally, summing the four difference values and comparing the sum values with a threshold value; if the sum of the differences is greater than the threshold, excluding the target base sequence; if the value is less than the threshold value, the selected sequence is marked as the target base sequence.
10. A base sequence filtering apparatus based on DRAM memory calculation, comprising:
the memory array is composed of DRAM subarrays and used for storing target base sequences, and binary expression is used for setting base information, and the memory array specifically comprises the following components: the binary expression corresponding to A-adenine is 00, the binary expression corresponding to G-guanine is 10, the binary expression corresponding to C-cytosine is 11, the binary expression corresponding to T-thymine is 01, and each base information is composed of 2bit data;
the DRAM subarray is N in width, namely each row is provided with N rows of storage units, two rows of storage units are needed for storing base sequence information with the length of one row being N, the storage units are respectively marked as an H row for storing high bits and an L row for storing low bits, namely the high bits and the low bits of the same base information are stored in the same row;
the storage array is provided with a calculation area for data calculation, an original data storage area for storing original base sequence data, a column type data area for storing data converted into a column format, and a temporary storage area for temporarily storing intermediate results generated in the calculation process;
the control module is used for receiving external addresses, data and commands, then carrying out decoding control, sending the decoding control to the word line controller, the bit line controller, the shifting and negating module and the buffer, converting the base sequence data format from a same row mode to a same column mode, writing the base sequence data format into a DRAM subarray, and controlling the calculation process; wherein the word line controller 402 controls row signals of the memory array, and the bit line controller controls column signals of the memory array; the buffer is used for buffering data; the shift and negation module comprises a shift module and a negation module and can perform shift operation and negation operation on a line of data according to calculation requirements;
and the counting module is internally provided with a group of counters which comprise a shift counter and a row counter, respectively counts different types of base sequences when the reference sequence is written, records the final result, and writes the AGCT statistical value of the reference sequence into the fixed address of the target array.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211354686.5A CN115409174B (en) | 2022-11-01 | 2022-11-01 | Base sequence filtering method and device based on DRAM memory calculation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211354686.5A CN115409174B (en) | 2022-11-01 | 2022-11-01 | Base sequence filtering method and device based on DRAM memory calculation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115409174A CN115409174A (en) | 2022-11-29 |
CN115409174B true CN115409174B (en) | 2023-03-31 |
Family
ID=84169305
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211354686.5A Active CN115409174B (en) | 2022-11-01 | 2022-11-01 | Base sequence filtering method and device based on DRAM memory calculation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115409174B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116665772B (en) * | 2023-05-30 | 2024-02-13 | 之江实验室 | Genome map analysis method, device and medium based on memory calculation |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001184381A (en) * | 1999-12-24 | 2001-07-06 | Kanegafuchi Chem Ind Co Ltd | Method and device for calculating optimum solution of multiplex variation protein amino acid array and storage medium storing program for conducting the same |
JP2003167883A (en) * | 2001-11-30 | 2003-06-13 | Celestar Lexico-Sciences Inc | Array information processor, array information processing method, program and recording medium |
CN1829805A (en) * | 2003-05-23 | 2006-09-06 | 冷泉港实验室 | Virtual representations of nucleotide sequences |
CN101466847A (en) * | 2005-06-15 | 2009-06-24 | 考利达基因组股份有限公司 | Single molecule arrays for genetic and chemical analysis |
CN106796628A (en) * | 2014-09-03 | 2017-05-31 | 陈颂雄 | Secure transaction device, system and method based on synthetic gene group variant |
CN111132999A (en) * | 2017-07-07 | 2020-05-08 | 阿瓦克塔生命科学有限公司 | Scaffold proteins |
CN112789680A (en) * | 2019-03-21 | 2021-05-11 | 因美纳有限公司 | Artificial intelligence based quality scoring |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6414746B1 (en) * | 1999-11-24 | 2002-07-02 | Advanced Scientific Concepts, Inc. | 3-D imaging multiple target laser radar |
US6754135B2 (en) * | 2002-09-13 | 2004-06-22 | International Business Machines Corporation | Reduced latency wide-I/O burst architecture |
US7114023B2 (en) * | 2003-08-29 | 2006-09-26 | Intel Corporation | Non-sequential access pattern based address generator |
EP2495337A1 (en) * | 2006-02-24 | 2012-09-05 | Callida Genomics, Inc. | High throughput genome sequencing on DNA arrays |
EP2107125A1 (en) * | 2008-03-31 | 2009-10-07 | Eppendorf Array Technologies SA (EAT) | Real-time PCR of targets on a micro-array |
JP5667049B2 (en) * | 2008-06-25 | 2015-02-12 | ライフ テクノロジーズ コーポレーション | Method and apparatus for measuring analytes using large-scale FET arrays |
US8969090B2 (en) * | 2010-01-04 | 2015-03-03 | Life Technologies Corporation | DNA sequencing methods and detectors and systems for carrying out the same |
CN103540589A (en) * | 2013-10-28 | 2014-01-29 | 深圳市第二人民医院 | Mononucleotide polymorphism sequence of telomerase reverse transcriptase (TERT) promoter |
CN104850761B (en) * | 2014-02-17 | 2017-11-07 | 深圳华大基因科技有限公司 | Nucleotide sequence joining method and device |
CN104200133B (en) * | 2014-09-19 | 2017-03-29 | 中南大学 | A kind of genome De novo sequence assembly methods based on reading and range distribution |
EP3481950A4 (en) * | 2016-07-07 | 2019-07-10 | Cemvita Technologies LLC. | Cognitive cell with coded chemicals for generating outputs from environmental inputs and method of using same |
KR102622275B1 (en) * | 2017-01-10 | 2024-01-05 | 로스웰 바이오테크놀로지스 인코포레이티드 | Methods and systems for DNA data storage |
WO2020213736A1 (en) * | 2019-04-17 | 2020-10-22 | 株式会社PEZY Computing | Information processing device, information processing method, program and storage medium |
CN112802556B (en) * | 2021-01-20 | 2023-05-09 | 天津大学合肥创新发展研究院 | Accelerator device for multi-marker sequence parallel identification of sequencing data |
-
2022
- 2022-11-01 CN CN202211354686.5A patent/CN115409174B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001184381A (en) * | 1999-12-24 | 2001-07-06 | Kanegafuchi Chem Ind Co Ltd | Method and device for calculating optimum solution of multiplex variation protein amino acid array and storage medium storing program for conducting the same |
JP2003167883A (en) * | 2001-11-30 | 2003-06-13 | Celestar Lexico-Sciences Inc | Array information processor, array information processing method, program and recording medium |
CN1829805A (en) * | 2003-05-23 | 2006-09-06 | 冷泉港实验室 | Virtual representations of nucleotide sequences |
CN101466847A (en) * | 2005-06-15 | 2009-06-24 | 考利达基因组股份有限公司 | Single molecule arrays for genetic and chemical analysis |
CN106796628A (en) * | 2014-09-03 | 2017-05-31 | 陈颂雄 | Secure transaction device, system and method based on synthetic gene group variant |
CN111132999A (en) * | 2017-07-07 | 2020-05-08 | 阿瓦克塔生命科学有限公司 | Scaffold proteins |
CN112789680A (en) * | 2019-03-21 | 2021-05-11 | 因美纳有限公司 | Artificial intelligence based quality scoring |
Also Published As
Publication number | Publication date |
---|---|
CN115409174A (en) | 2022-11-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11630863B2 (en) | Data storage based on encoded DNA sequences | |
Eddy | A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure | |
US7276338B2 (en) | Nucleotide sequencing via repetitive single molecule hybridization | |
CN115409174B (en) | Base sequence filtering method and device based on DRAM memory calculation | |
US8397131B1 (en) | Efficient readout schemes for analog memory cell devices | |
US20200266929A1 (en) | Technologies for performing encoding of data symbols for column read operations | |
CN111752859A (en) | Techniques for efficient random associative search operations | |
US7171528B2 (en) | Method and apparatus for generating a write mask key | |
US11327881B2 (en) | Technologies for column-based data layouts for clustered data systems | |
JP2001006375A5 (en) | ||
CN113129943A (en) | Data operation method based on flash memory data page storage structure and solid state disk | |
CN113742070A (en) | Low-depth sequencing group genotype filling calculation memory optimization method | |
JP7422228B2 (en) | Device and method for locating sample reads within a reference genome | |
KR100948468B1 (en) | The method for flag satus deterimining of non volatile memory device | |
CN117690489A (en) | Sequence alignment using memory arrays | |
JP2019028572A (en) | Information processing apparatus, information processing system, information processing method, and information processing program | |
US11837330B2 (en) | Reference-guided genome sequencing | |
CN110703982B (en) | Structure body sorting method, sorting device and sorter | |
JP7439258B2 (en) | Reference-guided genome sequencing | |
CN106547702B (en) | A kind of 8 memory access address calculation method of bimodulus | |
US20220284948A1 (en) | Optimized column read enabled memory | |
US20240094947A1 (en) | Memory system | |
Garzon et al. | Sensitivity and capacity of microarray encodings | |
GB2617190A (en) | Memory architecture | |
Wei | Enlarge Practical DNA Storage Capacity: The Challenge and The Methodology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |