CN115409174A

CN115409174A - Base sequence filtering method and device based on DRAM memory calculation

Info

Publication number: CN115409174A
Application number: CN202211354686.5A
Authority: CN
Inventors: 杨弢; 毛旷; 汤昭荣; 潘秋红; 叶茂伟; 黄智华; 王京; 王颖
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-11-01
Filing date: 2022-11-01
Publication date: 2022-11-29
Anticipated expiration: 2042-11-01
Also published as: CN115409174B

Abstract

The invention discloses a base sequence filtering method and a device based on DRAM memory calculation, wherein the method comprises the following steps: firstly, according to the row width of a storage array of a DRAM and the starting address of a target base sequence to be screened, the target base sequence is screened out and then rearranged and combined; marking the rearranged and combined target base sequence with bases of A adenine, G guanine, C cytosine and T thymine respectively to obtain a marking line of the corresponding base; thirdly, counting the number of the position values of 1 in the marking line after shifting the marking line data to obtain the counting result of the corresponding base; and step four, comparing the statistical result of the reference base sequence with the statistical result of the target base sequence, and filtering the screened target base sequence. The invention carries out position matching screening in the memory subarray, reduces the transfer of a large amount of data between the CPU and the memory, improves the calculation efficiency by times and reduces the power consumption.

Description

Base sequence filtering method and device based on DRAM memory calculation

Technical Field

The invention relates to the field of computer memory calculation, in particular to a base sequence filtering method and device based on DRAM memory calculation.

Background

Genes are functional fragments of DNA (deoxyribonucleic acid) molecules that carry genetic information, and they support the basic structure and properties of life. The prior art already has a set of mature processing procedures for DNA samples, which generally consist of three steps of DNA sequencing, DNA sequence sequencing, gene mapping and mutation detection. The DNA sequencing is to extract and convert DNA of a biological sample into a data sequence Read capable of being recognized by a computer by using a DNA sequencer, generally, a base sequence formed by linking four bases of a, C, T and G in the DNA sequence is recognized by a chemical method, and then converted into a character string sequence capable of being recognized by a computer, which is composed of four characters of a, C, T and G (a-adenine G-guanine C-cytosine T-thymine), wherein one data sequence Read is a DNA fragment with a fixed length and is a basic unit for subsequent DNA sequence processing. For example, referring to FIG. 1, if the length of a data sequence Read is 10BP (BasePair ), the Read data sequence TCCTAATCTG is a Read. The result of DNA sequencing is the generation of a large stack of DNA reads, but the order between these reads is unknown. Sequencing of DNA sequences is to compare these unordered DNA reads with a putative DNA reference sequence to obtain the best match position of each Read in the reference sequence.

Because the data volume of deoxyribonucleic acid is very huge, sequence fragments are usually screened and filtered before sequencing, the screening and filtering are suitable for being realized by adopting a parallelization calculation mode, the memory calculation provides a good calculation platform for calculation, repeated movement of a large amount of data can be reduced, and the system performance can be effectively improved.

In modern computer systems, the movement of data between compute units and memory is a significant percentage of the system power consumption and program runtime. With the advent of multi-core processors, where more and more cores are integrated into the same chip, the total memory bandwidth does not increase proportionally, creating a mismatch between computing power and data transfer, thus leading to the so-called "memory wall" problem. Meanwhile, although the computing resources are increased, the communication delay between the computing resources and a dynamic random access memory (hereinafter referred to as "DRAM") is not improved, so that the data movement becomes one of the system bottlenecks.

In order to solve these challenges, many new computing methods have been proposed in succession, including near memory computing, in-memory processor, in-memory computing, and so on. Memory computing is one of the key technologies to solve the problem of memory walls. The memory computing is operated in the memory as the name suggests, and can obviously reduce the serious computing time delay and power consumption brought by data exchange. Various memory computing technologies are currently emerging based on different storage media materials including RRAM, PCM, STT-MRAM, DRAM, etc. The current common DRAM memory microarchitecture is shown in figure 2.

However, since there are many possibilities of matching positions due to the large amount of base sequence data, there is still a problem that the calculation amount is huge using the conventional method.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a base sequence filtering method and a base sequence filtering device based on DRAM memory calculation, position matching screening is carried out in a memory subarray, namely, the DRAM memory calculation is carried out, based on the principle that the DRAM capacitor charging and discharging can complete basic logic operation, the number of AGCT bases in a certain section of gene sequence, namely a reference sequence, is counted and compared with the number of bases of a target sequence, if a certain threshold value is exceeded, the section of sequence is considered to be not matched with the target sequence, so that the purpose of screening and positioning is achieved, and the sequencing calculation of the base sequence is realized, and the specific technical scheme is as follows:

a base sequence filtering method based on DRAM memory calculation comprises the following steps:

step one, according to the row width of a storage array of a DRAM and the starting address of a target base sequence to be screened, the target base sequence is screened out and then rearranged and combined;

marking the rearranged and combined target base sequence with bases of A adenine, G guanine, C cytosine and T thymine respectively to obtain a marking line of the corresponding base;

thirdly, carrying out displacement operation on the marking line data, and then counting the number of the marking lines with the position value of 1 to obtain the counting results of A adenine, G guanine, C cytosine and T thymine;

step four, comparing the statistical result of the reference base sequence with the statistical result of the target base sequence, and filtering the screened target base sequence.

Further, the step one is specifically:

recording the number of invalid information data and the position of the invalid information data according to the length of the target base sequence and the column width of the storage array;

setting initial segment mask data and tail segment mask data according to the screening starting address and writing the initial segment mask data and the tail segment mask data into a storage array;

selecting a target row data sequence and initial segment mask data of a line before a target base sequence for bitwise and calculating to obtain effective initial segment target row data; selecting a target line data sequence and tail mask data of a next line of the target base sequence to perform bit-wise calculation to obtain effective tail section target line data;

carrying out bitwise or calculation on the effective initial section target row data and the tail section target row data, merging effective parts of the two rows of data into one row to obtain complete effective target base sequence data, wherein the initial parts and the tail parts of the effective target base sequence data are in the same row and have no coincident positions;

performing a column conversion operation on the effective target base sequence data in the line memory format to obtain first high bit data and first low bit data arranged in columns in the memory array;

generating an array GCT mask and an A mask according to the number of invalid information data and the position of the invalid information data, wherein the GCT mask is respectively subjected to AND operation with first high-bit data and first low-bit data, and irrelevant data is set to be 0 to generate array GCT data which is stored as second high-bit data and second low-bit data, and then bit-wise negation is performed and the array GCT data is stored as third high-bit data and third low-bit data; the a mask is or-operated with the first high bit data and the first low bit data, respectively, and sets all the irrelevant data to 1, generates column a data, and stores the column a data as fourth high bit data and fourth low bit data.

Further, the start segment mask data and the end segment mask data are composed of 0 and M, M is composed of two-bit binary 1, when 0 and the target row sequence data are AND, the irrelevant data can be set to 0, and when 1 and the target row sequence data are AND, the valid data can be reserved.

Further, the specific method for labeling the base A adenine in the target base sequence in the second step comprises the following steps:

copying the fourth high bit data and the fourth low bit data stored by the column A data to a first row and a second row of a calculation area of the memory array, respectively;

performing OR operation on the data between the first line and the second line according to bits to obtain a result R1;

performing negation operation on the result R1;

and copying the result after the inversion operation to a mark row of the A adenine in the memory array, wherein the position value of the A adenine is 1, and the rest position values are 0.

Further, the specific method for labeling the base C cytosine in the target base sequence in the second step comprises the following steps:

copying the second high bit data and the second low bit data of the target base sequence to a first row and a second row of a calculation area of the memory array respectively;

performing bitwise AND operation on the data of the first row and the second row of the calculation area to obtain a result R2;

the result R2 is copied to the marked row of C-cytosines in the memory array, where the position value containing the C-cytosine is 1 and the remaining position values are 0.

Further, the specific method for labeling the base G guanine in the target base sequence in the second step is as follows:

copying the third low bit data to the first row in the compute region of the memory array;

copying the second high-bit data to a second row in the calculation area;

performing bitwise AND operation on the data of the first line and the second line in the calculation area to obtain a result R3;

the result R3 is copied to the tag row of G guanine in the memory array, where the position containing G guanine has a value of 1 and the remaining positions have a value of 0.

Further, the specific method for labeling the base T thymine in the target base sequence in the second step comprises the following steps:

copying the third high-order bit data to a first row of a calculation area of the storage array;

copying the second low bit data to a second row of the calculation region;

carrying out bitwise AND operation on the data of the first row and the second row to obtain a result R4;

the result R4 is copied to a tag row of T thymine in the memory array, where the position value containing T thymine is 1 and the remaining position values are 0.

Further, the specific method for statistics in the third step includes the following three steps:

step 1, adopting a column counter and a shift counter, firstly judging whether the value n of the current column counter is 1, if not, reading the marked line, and performing left shift operation on the read result, wherein the number of shifted bits is the power i of 2, i is the value of the shift counter, after the shift operation is completed, adding 1 to the value i of the shift counter, and writing the result back to the DRAM subarray where the marked line is located; setting the original marking line as a line a, setting the shifted result as a line a _ s, if n is 1, ending the calculation, and entering the step 3;

step 2, copying the a row and the a _ s row data to a first row and a second row of a calculation area of the storage array, carrying out summation calculation on the data in the same column, namely carrying out exclusive OR operation on the first row and the second row to obtain a sum s of the first row and the second row, carrying out AND operation on the first row and the second row to obtain a carry term c of the sum of the first row and the second row, and writing the sum result back to a temporary storage area of the storage array; dividing the value n of the current column counter by 2, judging whether the result is 1, and if the result is 1, finishing the calculation; if the result is not 1, carrying out a new round of shifting and summing operation on the basis of the summing result of the temporary storage area, wherein each time the summing operation is finished, the calculation result is increased by one line, namely, the operation of the step 1 is carried out, the calculation result is shifted, and the shifting result is accumulated in a column manner;

and step 3, when the value n of the row counter is finally judged to be 1, a final result can be obtained in the first row of the calculation result and is stored.

Further, the fourth step specifically includes: putting the complement values of the statistical results of the obtained A adenine, G guanine, C cytosine and T thymine of the target base sequence in the same row in a column form, putting the complement values of the statistical results of the reference base sequence in the corresponding column, calculating difference values, and finally, summing the four difference values and comparing the sum values with a threshold value; if the sum of the differences is greater than the threshold, excluding the target base sequence; if the value is less than the threshold value, marking the sequence of the screening position as the target base sequence.

A base sequence filtering device based on DRAM memory calculation comprises:

the memory array is composed of DRAM subarrays and used for storing target base sequences, and binary expression is used for setting base information, and the method specifically comprises the following steps: the binary expression corresponding to A-adenine is 00, the binary expression corresponding to G-guanine is 10, the binary expression corresponding to C-cytosine is 11, the binary expression corresponding to T-thymine is 01, and each base information consists of 2-bit data;

the DRAM subarray is N in width, namely each row is provided with N rows of storage units, two rows of storage units are needed for storing base sequence information with the length of one row being N, the storage units are respectively marked as an H row for storing high bits and an L row for storing low bits, namely the high bits and the low bits of the same base information are stored in the same row;

the storage array is provided with a calculation area for data calculation, an original data storage area for storing original base sequence data, a column data area for storing data converted into a column format, and a temporary storage area for temporarily storing intermediate results generated in the calculation process;

the control module is used for receiving external addresses, data and commands, then carrying out decoding control, sending the decoding control to the word line controller, the bit line controller, the shifting and negating module and the buffer, converting the base sequence data format from a same row mode to a same column mode, writing the base sequence data format into a DRAM subarray, and controlling the calculation process; wherein the word line controller 402 controls row signals of the memory array, and the bit line controller controls column signals of the memory array; the buffer is used for buffering data; and the shifting and negating module comprises a shifting module and a negating module and can perform shifting operation and negating operation on a line of data according to the calculation requirement.

And the counting module is internally provided with a group of counters which comprise a shift counter and a row counter, respectively counts different types of base sequences when the reference sequences are written, records the final result, and writes the AGCT statistical value of the reference sequence into the fixed address of the target array.

Has the advantages that:

the invention carries out position matching screening in the memory subarray, reduces the transfer of a large amount of data between the cpu and the memory, improves the calculation efficiency by times and reduces the power consumption.

Drawings

FIG. 1 is a schematic diagram of the base sequence read in a DNA sequence fragment;

FIG. 2 is a schematic diagram of a generic DRAM memory microarchitecture;

FIG. 3 is a schematic block diagram of a DRAM memory-based base sequence filtering apparatus according to the present invention;

FIG. 4 is a schematic flow chart of a base sequence filtering method based on DRAM memory calculation according to the present invention;

FIG. 5 is a flow chart showing details of step one of the method of the present invention;

FIG. 6 is a schematic diagram showing a manner of storing data when a target base sequence is rearranged and combined in a memory array of the apparatus of the present invention;

FIGS. 7 to 10 are schematic views showing the manner of storing data when labeling the base A adenine, the base C cytosine, the base G guanine, and the base T thymine in the target base sequence according to the present invention;

FIG. 11 is a schematic diagram of a piece of base sequence information stored in a memory array according to an embodiment of the present invention;

FIG. 12 is a schematic diagram showing a manner of storing data when a target base sequence is rearranged and combined in a memory array according to an embodiment of the present invention;

FIG. 13 is a diagram illustrating a portion of data of a binary representation after shifting according to an embodiment of the present invention;

FIG. 14 is a schematic diagram showing a data storage method in the memory array according to the embodiment of the present invention when AND operation is performed on a base GCT mask;

FIG. 15 is a schematic diagram illustrating a storage manner of data when an OR operation is performed on a base A mask in a memory array according to an embodiment of the present invention;

FIGS. 16 to 19 are schematic views showing the manner of storing data when labeling the base A adenine, the base C cytosine, the base G guanine, and the base T thymine in the target base sequence according to the example of the present invention;

FIGS. 20 to 26 are data diagrams illustrating shift and column-wise summation of the label rows for each base type with a value of 1 as a statistic, according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and technical effects of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.

As shown in FIG. 3, the present invention provides a base sequence filtering apparatus based on DRAM memory calculation, comprising:

the memory array 404 is composed of DRAM subarrays, and is used for storing target base sequences, and setting binary expression for base information, specifically: the binary expression corresponding to A-adenine is 00, the binary expression corresponding to G-guanine is 10, the binary expression corresponding to C-cytosine is 11, the binary expression corresponding to T-thymine is 01, and each base information consists of 2-bit data;

the DRAM subarray is N in width, namely N rows of storage units are arranged in each row, two rows of storage units are needed for storing base sequence information with the length of one row being N and are respectively marked as H rows for storing high bits and L rows for storing low bits, namely the high bits and the low bits of the same base information are stored in the same row;

the memory array 404 is provided with a calculation area 501 for data calculation, an original data storage area for storing original base sequence data, a column-type data area for storing data converted into a column format, and a temporary storage area 4010 for temporarily storing intermediate results generated during calculation.

The control module 403 receives external addresses, data and commands, performs decoding control, sends the decoding control to the word line controller 402, the bit line controller 401, the shift and inversion module 406 and the buffer 405, converts the base sequence data format from a parallel mode to a parallel mode, writes the base sequence data format into a DRAM subarray, and controls the calculation process; the word line controller 402 controls row signals of the memory array 404, and the bit line controller 401 controls column signals of the memory array 404; the buffer 405 is used for buffering data; the shift and negation module 406, including a shift module and a negation module, can perform shift operation and negation operation on a line of data according to the calculation requirement.

The counting module 407 is provided with a set of counters therein, and when the reference sequence is written, counts the base sequences of different types respectively, records the final result, and writes the AGCT statistic of the reference sequence into the fixed address of target _ array of the target array, which is denoted as target _ a, target _ G, target _ C, and target _ T.

Based on the base sequencing filtering device, the base sequence filtering method based on DRAM memory calculation adopted by the invention is shown in FIG. 4, and specifically comprises the following steps:

in the first step, in the preparation stage of statistical data, a starting point address is screened in the storage array 404 according to the system setting, a target base sequence with a length smaller than N is screened, and the target base sequence is rearranged and combined.

As shown in fig. 5 and 6, the control module 403 records the number of invalid information data and the invalid information data position 4006 according to the length of the target base sequence and the column width of the memory array 404. Setting start segment mask data 4003 and end segment mask data 4004 according to the address of the screening start point, writing the mask data into the memory array 404, the mask data being composed of 0 and M, M being composed of a two-bit binary 1, when 0 and the sequence data are anded, setting 0 as irrelevant data, and when 1 and the sequence data are anded, keeping valid data; selecting a target line data sequence 4001 where a previous line of the target base sequence is located and starting segment mask data 4003 to perform bitwise calculation to obtain effective starting segment target line data 4001_1; selecting the target line data sequence 4002 in the next line and the tail mask data 4004 to perform bitwise calculation to obtain effective tail section target line data 4002 xu 1; by bitwise or calculating the effective target row data 4001_1 and 4002_1 of the initial segment and the last segment, the effective parts of the two rows of data can be merged into one row, so as to obtain complete effective target base sequence data, and ensure that the beginning and the end parts of one segment of sequence are in the same row and have no overlapping position.

After screening, row-column conversion operation is performed on the data in the row memory format, so that first high bit data 4005 uth and first low bit data 4005\ l which are arranged in columns and stored in two rows of memory cells in the memory array respectively are obtained, and the data 4005 uth and 4005 \/are written back to the memory array 404 for further calculation.

The control module 403 generates a column type GCT mask (10 alternate bit string) and an a mask (combination of 01 alternate bit string and all 1 bit string) according to the number of invalid information data and the position of the invalid information data, the GCT mask performs and operation with data 4005 \/and 4005 \/h respectively, unrelated data are all set to 0, column type GCT data are generated and stored in 4007 \/and 4007 \/h lines respectively, at the same time, bit-wise negation is performed through a negation module, and negated data are stored in 4009 \/and 4009 \/h lines; the A mask is OR-ed with the data 4005 \/and the irrelevant data is set to 1, and the generated line A data is stored in the lines 4008 \/and 4008 \/respectively.

Step two, the control module 403 marks the rearranged and combined target base sequences stored in the DRAM subarray with bases of a adenine, G guanine, C cytosine, and T thymine respectively to obtain a marked row of the corresponding base.

As shown in FIG. 7, the specific method for labeling the base A adenine in the target base sequence comprises the following steps:

copying 4008 \hline and 4008 \lline data of the target base sequence to a first line and a second line of the calculation region 501, respectively;

the data between the first and second rows are ored bitwise to obtain a result R1,

sending the result R1 to an inversion module for inversion operation;

the result of the above inversion operation is copied to a mark line 502 of a adenine in which the position value of a adenine is 1 and the remaining position values are 0.

As shown in FIG. 8, the specific method for labeling the base C cytosine in the target base sequence comprises the following steps:

copying 4007 \hline and 4007 \lline data of the target base sequence to a first line and a second line of the calculation region 501, respectively;

performing bitwise and operation on the data in the first row and the data in the second row of the calculation area 501 to obtain a result R2;

the result R2 is copied to the C cytosine labeled row 503, where the C cytosine is contained at a position value of 1 and the remaining positions are 0.

As shown in FIG. 9, the specific steps of the method for labeling the G-guanine base in the target nucleotide sequence are as follows:

taking a bit-by-bit negation value 4009_l of 4007_l row data of the target base sequence and copying the data to a first row in the calculation area 501;

copying 4007 xu h line data to a second line in the computing area 501;

performing a bitwise and operation on the data of the first row and the second row in the calculation region 501 to obtain a result R3;

the result R3 is copied to the G guanine label line 504, where the G guanine-containing position has a value of 1 and the remaining positions have a value of 0.

As shown in FIG. 10, the specific method for labeling the base T thymine in the target base sequence comprises the following steps:

copying the negation value 4009_h of the data of the line 4007_h of the target base sequence to the first line of the calculation region 501;

copying the data of the 4007 ul line of the target base sequence to the second line of the calculation region 501;

carrying out bitwise AND operation on the data of the first line and the second line to obtain a result R4;

the result R4 is copied to a tag row 505 of T thymines, which have position values of 1 for T thymines and 0 for the remaining position values.

And thirdly, carrying out displacement operation on the marking line data, and then counting the number of the marking lines with the position value of 1 to obtain the counting results of A adenine, G guanine, C cytosine and T thymine.

Specifically, let the column width of the DRAM subarray be N, i.e., each row has N columns of memory cells, N is an integer power of 2, each marker row occupies a row of memory space, and the number of the marker rows having a position value of 1 is counted.

A shift counter and a row counter are arranged in the counting module 407, the shift counter is used for counting the current shift times, the initial value is 0, and the counting is cleared after the counting is finished; the initial value of the column counter is N, and the initial value is restored to N after the calculation is completed.

The specific statistical method comprises the following three steps:

step 1, the bit line controller 401 determines whether the value n of the current column counter is 1, if not, the marked line is read, and the read result is sent to the shift module for left shift operation, the number of shifted bits is the power i of 2, i is the value of the shift counter, after the shift operation is completed, the value i of the shift counter is added by 1, and the result is written back to the subarray where the marked line is located; setting an original marking line as a line a, setting a shifted result as a line a _ s, if n is 1, ending the calculation, and entering the step 3;

step 2, copying the data of the row a and the row a _ s to a first row and a second row of the calculation area 501, performing summation calculation on the data in the same column, that is, performing exclusive-or operation on the first row and the second row to obtain a sum s of the first row and the second row, performing and operation on the first row and the second row to obtain a carry term c of the sum of the first row and the second row, and writing the sum result back to a temporary storage area 4010 of the storage array 404; the bit line controller 401 divides the value n of the current row counter by 2, and determines whether the result is 1, and if the result is 1, the calculation is ended; if the result is not 1, a new round of shift summation operation is performed (one row is added to the calculation result every time the summation operation is completed) based on the summation result of the temporary storage region 4010, that is, the operation of step 1 is performed, the calculation results are respectively copied to the shift modules for shifting, and the shift results are accumulated in a column;

and 3, when the value n of the row counter is finally judged to be 1, obtaining a final result in the first row of the calculation result, and storing the row result to a specified position.

Specifically, the complement values of the statistical results of the obtained A adenine, G guanine, C cytosine and T thymine of the target base sequence are placed in the same row in a column form, then the complement values of the statistical results of the reference base sequence are placed in the corresponding columns, difference calculation is carried out, and finally the four differences are summed and compared with a threshold value; if the sum of the differences is greater than the threshold, excluding the target base sequence; if the value is less than the threshold value, the sequence of the screening position is marked as the target base sequence, and the subsequent action can be carried out.

The embodiment is as follows:

assuming that a piece of base sequence information is stored in the current storage region, the length of the selected sequence is 10, and the sequence is AGTTTCTCCG, as shown in FIG. 11.

Setting the start segment mask to binary 000000000000111111111111 and the end segment mask to binary 111111111111110000000000000000 according to the start address of the screening starting at the arrow in fig. 11, and writing the two mask data into the memory array; selecting a sequence TCTTTGAAAGTTC of a previous line and a start section mask to perform bitwise AND calculation to obtain effective start section data 00000000AGTTTC (wherein each 0 represents 2-bit 0); selecting the next row sequence TCCGAGGATGTGGT and an end mask to perform bitwise AND calculation to obtain effective end segment data TCCG0000000000 (wherein each 0 represents 2-bit 0); by bit-wise or bit-wise computing the valid target line data 00000000AGTTTC and TCCG0000000000, the valid portions of the two lines of data can be merged into one line, resulting in a complete target sequence TCCG0000AGTTTC, as shown in fig. 12.

And after screening, carrying out column conversion storage operation on the data in the row storage format to obtain data 4005 uth and 4005_lwhich are arranged in columns, and writing the result back to the storage array for next calculation. The mask data 4003 is composed of 0 and M, where M represents 2 bits 11 and M is composed of a multi-bit binary 1, and when 0 and the sequence data are not the same, the irrelevant data can be set to 0; when 1 and sequence data are associated, valid data may be retained.

Take partial data TCCG0000 in the above example as an example, where 0 is invalid data; its binary expression is, the lower 8 bits of 0 are effectively invalid data: 0111111000000000;

after the shift module shifts left by one, as shown in fig. 13, the effective data arranged in a column is in the dotted line;

the control module generates column GCT mask and A mask according to the number of invalid information and the position of the invalid information, wherein the GCT mask is a bit string with 10 alternating bits: 101010101010;

the GCT mask performs and operation with 4005 \/and, respectively, sets both irrelevant data to 0, generates column-type GCT data, and stores the column-type GCT data in 4007 _/and 4007_h, respectively, as shown in fig. 14;

the a mask is a combination of 01 alternating bit strings and all 1 bit strings: 0101010111111111, which performs or operation with 4005 _land 4005_h, respectively, sets both irrelevant data to 1, and generates the column type a data to be stored in 4008 _land 4008_h, respectively, as shown in fig. 15.

Then, the A adenine, the G guanine, the C cytosine and the T thymine are respectively subjected to base labeling.

As shown in fig. 16, the specific labeling of adenine a is:

one line H (0111111111111111111) and one line L (1111110111111111) of the target sequence are copied to the first and second lines of the calculation region 501, respectively;

performing a bitwise OR operation between the first row and the second row to obtain a result R1 (1111111111111111);

negating R1 to obtain (0000000000000000);

copying the result to a mark row of A adenine, wherein the position value of the A adenine is 1, the rest position values are 0, and marks of adenine in the current result are all 0.

As shown in fig. 17, labeling cytosine C specifically is:

copy one line 4007_h (0010101000000000) and one line 4007_l (10101000000000000000) of the target sequence to the first and second lines of the calculation area 501, respectively;

performing bitwise AND operation on the first row and the second row to obtain a result R2;

the result is copied to C cytosine mark line 503 (0010100000000000) where the position value containing C cytosine is 1 and the remaining position values are 0.

As shown in fig. 18, the labeling of guanine G specifically includes:

the bitwise inverted value 4009 \ (0101011111111111) of 4007 \/of the target sequence is copied to one of the lines of the calculation region 501;

copy 4007_h (0010101000000000) to one line of the calculation area 501;

the calculation area carries out bitwise AND operation on the two rows to obtain a result R3 (0000001000000000);

the result R3 is copied to the G guanine label line 504, where the G guanine-containing position is 1 and the remaining positions are 0.

As shown in fig. 19, labeling thymine T specifically is:

copying the negated value 4009_h of one line 4007_h of the target sequence to one line of the calculation region;

copying one line 4007 \ of the target sequence to another line of the calculation area, namely a first line and a second line;

performing bitwise AND operation on the first row and the second row to obtain a result R4;

the result R4 is copied to a tag row 505 of T thymines, where the positions containing T thymines have values of 1 and the remaining positions have values of 0.

Then, for the labeled line of each base type, the statistics of the value 1 is carried out, and the specific contents are as follows: assuming that the column width of the current sub-array is 16 (typically N is an integer power of 2), i.e. there are 16 columns of memory cells per row, each marked row occupies a row of memory space. Taking a certain mark line to count the number of 1, for example, the mark line currently taking C (0010100000000000). And a shift counter is arranged in the counting module and used for counting the current shift times, the initial value is 0, and the counting module is cleared after the counting is finished.

The specific statistical method comprises the following steps:

the controller judges whether the current n is 1, if the current n =16, the marking line is read out, the read result is sent to a shift module to carry out left shift operation, the bit number of the shift is 0 power of 2, after the shift operation is completed, the value i of a shift counter is added with 1 to be 1, the result is written back to a subarray module where the marking line is located, the original marking line is set as a line, and the result after the shift is set as a line _ s.

The a row and the a _ s row are copied to the first row and the second row of the calculation area, and the data in the same column are summed to obtain a1 and a0, as shown in fig. 20.

The controller divides the current n by 2 to obtain a new n which is 8; and if the judgment result is not 1, respectively copying the calculation results to a shift row for shifting, wherein the bit number of the shift is 1 power of 2, writing the result back to the subarray after the shift, and continuing to participate in the operation.

Shifting a1 and a0, and adding 1 to the value i of the shift counter to 2 after the shift operation is completed, as shown in fig. 21;

column-wise summation after shifting, as shown in fig. 22;

the controller divides the current n by 2 to obtain a new n which is 4; and the judgment result is not 1, shift a2 a1 a0, and after the shift is completed, add 1 to the value i of the shift counter to be 3, as shown in fig. 23;

column-wise summation after shifting, as shown in fig. 24;

the controller divides the current n by 2 to obtain a new n which is 2; and the judgment result is not 1, shift a3 a2 a1 a0, and after the shift is completed, add 1 to the value i of the shift counter to be 4, as shown in fig. 25;

column-wise summing after shifting, as shown in fig. 26;

when the value of the column counter is finally judged to be 1, a final result can be obtained in the first column of the calculation result, then the complementary code values of the statistical results of the obtained A adenine, G guanine, C cytosine and T thymine of the target base sequence are placed in the same row in a column form, then the complementary code values of the statistical results of the reference base sequence are placed in the corresponding column, the difference calculation is carried out, and finally the four differences are summed and compared with the threshold; if the sum of the differences is greater than the threshold, excluding the target base sequence; if the value is less than the threshold value, marking the sequence of the screening position as the target base sequence.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described in detail the practice of the invention, it will be appreciated by those skilled in the art that variations may be applied to the embodiments described in the foregoing examples, or equivalents may be substituted for elements thereof. All changes, equivalents and modifications which come within the spirit and scope of the invention are desired to be protected.

Claims

1. A base sequence filtering method based on DRAM memory calculation is characterized by comprising the following steps:

and step four, comparing the statistical result of the reference base sequence with the statistical result of the target base sequence, and filtering the screened target base sequence.

2. The method for filtering base sequences based on DRAM memory calculation according to claim 1, wherein the first step is specifically:

selecting a target line data sequence and initial segment mask data where a previous line of the target base sequence is located to perform bit-wise calculation to obtain effective initial segment target line data; selecting a target line data sequence and tail mask data in the next row of the target base sequence for bitwise and calculation to obtain effective tail target line data;

carrying out bitwise calculation on the effective initial segment target line data and the effective tail segment target line data, merging effective parts of the two lines of data into one line to obtain complete effective target base sequence data, wherein the head part and the tail part of the effective target base sequence data are in the same line and have no coincident position;

performing a column conversion operation on the effective target base sequence data in the line storage format to obtain first high bit data and first low bit data arranged in columns in the storage array;

generating an array GCT mask and an A mask according to the number of invalid information data and the position of the invalid information data, wherein the GCT mask is respectively subjected to AND operation with first high-bit data and first low-bit data, and irrelevant data is set to be 0 to generate array GCT data which is stored as second high-bit data and second low-bit data, and then bit-wise negation is performed and the array GCT data is stored as third high-bit data and third low-bit data; the a mask performs or operation with the first high bit data and the first low bit data, respectively, sets all the irrelevant data to 1, generates column a data, and stores the column a data as fourth high bit data and fourth low bit data.

3. The method as claimed in claim 2, wherein the start segment mask data and the end segment mask data are composed of 0 and M, M is composed of two-bit binary 1, when 0 and the target row sequence data are AND, the irrelevant data is set to 0, and when 1 and the target row sequence data are AND, the valid data is retained.

4. The method for filtering a base sequence based on DRAM memory calculation as claimed in claim 2, wherein the specific method steps for labeling the base A adenine in the target base sequence in the second step are as follows:

carrying out negation operation on the result R1;

and copying the result after the inversion operation to a mark line of the A adenine in the memory array, wherein the position value containing the A adenine is 1, and the rest position values are 0.

5. The method for filtering base sequences based on DRAM memory calculation as claimed in claim 2, wherein the specific method for labeling the base C cytosine in the target base sequence in the second step is as follows:

carrying out bitwise AND operation on the data of the first row and the data of the second row in the calculation area to obtain a result R2;

the result R2 is copied to a tag row of C-cytosine in the memory array, where the position value of the C-cytosine is 1 and the remaining position values are 0.

6. The method for filtering base sequences based on DRAM memory calculation as claimed in claim 2, wherein the specific method steps for labeling the base G guanine in the target base sequence in the second step are as follows:

copying the third low bit data to a first row in a compute region of the memory array;

copying the second high bit data to a second row in the calculation area;

carrying out bitwise AND operation on the data of the first line and the second line in the calculation area to obtain a result R3;

the result R3 is copied to a mark row of G guanine in the memory array, wherein the position value containing G guanine is 1, and the rest position values are 0.

7. The method for filtering base sequences based on DRAM memory calculation as claimed in claim 2, wherein the specific method for labeling the base T thymine in the target base sequence in the second step is as follows:

copying the second low bit data to a second row of the calculation region;

8. The method for filtering base sequences based on DRAM memory calculation as claimed in claim 1, wherein the specific method of statistics in the third step comprises the following three steps:

step 1, adopting a column counter and a shift counter, firstly judging whether the value n of the current column counter is 1, if not, reading the marked line, and performing left shift operation on the read result, wherein the number of shifted bits is the power i of 2, and i is the value of the shift counter, after the shift operation is completed, adding 1 to the value i of the shift counter, and writing the result back to the DRAM sub-array where the marked line is located; setting the original marking line as a line a, setting the shifted result as a line a _ s, if n is 1, ending the calculation, and entering the step 3;

copying the a-row and the a _ s-row data to a first row and a second row of a calculation area of the storage array, carrying out summation calculation on the data in the same column, namely carrying out exclusive OR operation on the first row and the second row to obtain a sum s of the first row and the second row, carrying out AND operation on the first row and the second row to obtain a carry term c of the sum of the first row and the second row, and writing the sum result back to a temporary storage area of the storage array; dividing the value n of the current column counter by 2, judging whether the result is 1, and if the result is 1, finishing the calculation; if the result is not 1, carrying out a new round of shifting and summing operation on the basis of the summing result of the temporary storage area, wherein each time the summing operation is finished, the calculation result is increased by one line, namely, the operation of the step 1 is carried out, the calculation result is shifted, and the shifting result is accumulated in a column manner;

9. The method for filtering base sequences based on DRAM memory calculation according to claim 1, wherein the fourth step is specifically: putting the complement values of the statistical results of the obtained A adenine, G guanine, C cytosine and T thymine of the target base sequence in the same row in a column form, putting the complement values of the statistical results of the reference base sequence in the corresponding column, calculating difference values, and finally, summing the four difference values and comparing the sum values with a threshold value; if the sum of the differences is greater than the threshold, excluding the target base sequence; if the value is less than the threshold value, marking the sequence of the screening position as the target base sequence.

10. A base sequence filter device based on DRAM memory calculation is characterized by comprising:

the memory array is composed of DRAM subarrays and used for storing target base sequences, and binary expression is used for setting base information, and the method specifically comprises the following steps: the binary expression corresponding to A-adenine is 00, the binary expression corresponding to G-guanine is 10, the binary expression corresponding to C-cytosine is 11, the binary expression corresponding to T-thymine is 01, and each base information is composed of 2bit data;

the storage array is provided with a calculation area for data calculation, an original data storage area for storing original base sequence data, a column type data area for storing data converted into a column format, and a temporary storage area for temporarily storing intermediate results generated in the calculation process;

the control module receives an external address, data and a command, performs decoding control, sends the decoding control to the word line controller, the bit line controller, the shift and inversion module and the buffer, converts the base sequence data format from a same-row mode to a same-column mode, writes the base sequence data format into a DRAM sub-array, and controls the calculation process; the word line controller 402 controls row signals of the memory array, and the bit line controller controls column signals of the memory array; the buffer is used for buffering data; the shifting and negating module comprises a shifting module and a negating module and can perform shifting operation and negating operation on a line of data according to calculation requirements;

and the counting module is internally provided with a group of counters which comprise a shift counter and a row counter, respectively counts different types of base sequences when the reference sequence is written, records the final result, and writes the AGCT statistical value of the reference sequence into the fixed address of the target array.